We have 8 nodes that are running in proxmox 8.1.11.
Our data center required us to move out of a rack that was going to be used for a different service, so we had to move 4 out of the 8 machines over to another rack.
We migrated all VMs from those 4 nodes (pxnode5-8) to the other 4 (pxnode1-4) that weren't going to move.
When we moved all the machines, we lost quorum and of course the 4 working nodes went "offline" and all the VMs were unresponsive all of a sudden. This was our fault, we totally goofed and forgot that it was 5 for quorum and we broke it when we got down to less. Unfortunately, even though we may have goofed and not put the machines in maintenance mode, this doesn't bode well when our machines lose power suddenly, and now we lost 7 out of the 8 nodes.
I understand that odd number of nodes is what everyone is going to tell me I should have in my cluster, but we still have quorum as you can see in below configuration.
We immediately plug everything back on the pxnode5-8 start them back up and we get our quorum back.
pvecm status
Cluster information
-------------------
Name: cxxxxxx-px
Config Version: 12
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Sat Nov 2 19:08:39 2024
Quorum provider: corosync_votequorum
Nodes: 8
Node ID: 0x00000006
Ring ID: 1.10de
Quorate: Yes
Votequorum information
----------------------
Expected votes: 8
Highest expected: 8
Total votes: 8
Quorum: 5
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.200.0.10
0x00000002 1 10.200.0.17
0x00000003 1 10.200.0.13
0x00000004 1 10.200.0.14
0x00000005 1 10.200.0.15
0x00000006 1 10.200.0.16 (local)
0x00000007 1 10.200.0.12
0x00000008 1 10.200.0.11
This is the current state of the cluster:
Corosync configuration is correct, and all IPs are reachable in less than 1 ms.
If I reboot the machine or restart pvestatd service I get a green "OK" state for about 5 minutes, then it just goes away.
Couple of more things: Port 8006 is accesible to all hosts, firewalls are disabled, Separate Corosync network from the public network. Network cards are 10GB Ethernet.
I don't know what else to say, I'm at my wits end trying to understand what's going on, and the logs aren't showing anything. I am looking for the logs for the management interface.
pveproxy runs on all hosts and show no errors.
pdsh -R ssh -w 10.200.0.[10-17] pveproxy status
10.200.0.10: running
10.200.0.16: running
10.200.0.12: running
10.200.0.13: running
10.200.0.11: running
10.200.0.17: running
10.200.0.15: running
10.200.0.14: running
# corosync-cfgtool -s
Local node ID 6, transport knet
LINK ID 0 udp
addr = 10.200.0.16
status:
nodeid: 1: connected
nodeid: 2: connected
nodeid: 3: connected
nodeid: 4: connected
nodeid: 5: connected
nodeid: 6: localhost
nodeid: 7: connected
nodeid: 8: connected
I've literally lost my hair, do I need to remove nodes and add the back to the cluster from the one node that is still showing "OK?"
If this isn't the right sub, my apologies. I will post in another spot.
Thank you.
Our data center required us to move out of a rack that was going to be used for a different service, so we had to move 4 out of the 8 machines over to another rack.
We migrated all VMs from those 4 nodes (pxnode5-8) to the other 4 (pxnode1-4) that weren't going to move.
When we moved all the machines, we lost quorum and of course the 4 working nodes went "offline" and all the VMs were unresponsive all of a sudden. This was our fault, we totally goofed and forgot that it was 5 for quorum and we broke it when we got down to less. Unfortunately, even though we may have goofed and not put the machines in maintenance mode, this doesn't bode well when our machines lose power suddenly, and now we lost 7 out of the 8 nodes.
I understand that odd number of nodes is what everyone is going to tell me I should have in my cluster, but we still have quorum as you can see in below configuration.
We immediately plug everything back on the pxnode5-8 start them back up and we get our quorum back.
Rich (BB code):
pvesh get /cluster/status
┌───────────────────────────────┬──────────────────────────┬─────────┬────────────────┬───────┬───────┬────────┬───────┬────────┬─────────┬─────────┐
│ id │ name │ type │ ip │ level │ local │ nodeid │ nodes │ online │ quorate │ version │
╞═══════════════════════════════╪══════════════════════════╪═════════╪════════════════╪═══════╪═══════╪════════╪═══════╪════════╪═════════╪═════════╡
│ cluster │ cxxxxxxx-px │ cluster │ │ │ │ │ 8 │ │ 1 │ 14 │
├───────────────────────────────┼──────────────────────────┼─────────┼────────────────┼───────┼───────┼────────┼───────┼────────┼─────────┼─────────┤
│ node/cxxxxxxx-pxnode1-xxxx-21 │ cxxxxxxx-pxnode1-xxxx-21 │ node │ 169.228.56.220 │ │ 0 │ 1 │ │ 1 │ │ │
├───────────────────────────────┼──────────────────────────┼─────────┼────────────────┼───────┼───────┼────────┼───────┼────────┼─────────┼─────────┤
│ node/cxxxxxxx-pxnode2-xxxx-22 │ cxxxxxxx-pxnode2-xxxx-22 │ node │ 169.228.56.228 │ │ 0 │ 8 │ │ 1 │ │ │
├───────────────────────────────┼──────────────────────────┼─────────┼────────────────┼───────┼───────┼────────┼───────┼────────┼─────────┼─────────┤
│ node/cxxxxxxx-pxnode3-xxxx-21 │ cxxxxxxx-pxnode3-xxxx-21 │ node │ 10.200.0.12 │ │ 0 │ 7 │ │ 1 │ │ │
├───────────────────────────────┼──────────────────────────┼─────────┼────────────────┼───────┼───────┼────────┼───────┼────────┼─────────┼─────────┤
│ node/cxxxxxxx-pxnode4-xxxx-22 │ cxxxxxxx-pxnode4-xxxx-22 │ node │ 169.228.56.223 │ │ 0 │ 3 │ │ 1 │ │ │
├───────────────────────────────┼──────────────────────────┼─────────┼────────────────┼───────┼───────┼────────┼───────┼────────┼─────────┼─────────┤
│ node/cxxxxxxx-pxnode5-xxxx-23 │ cxxxxxxx-pxnode5-xxxx-23 │ node │ 169.228.56.224 │ │ 0 │ 4 │ │ 1 │ │ │
├───────────────────────────────┼──────────────────────────┼─────────┼────────────────┼───────┼───────┼────────┼───────┼────────┼─────────┼─────────┤
│ node/cxxxxxxx-pxnode6-xxxx-24 │ cxxxxxxx-pxnode6-xxxx-24 │ node │ 169.228.56.225 │ │ 0 │ 5 │ │ 1 │ │ │
├───────────────────────────────┼──────────────────────────┼─────────┼────────────────┼───────┼───────┼────────┼───────┼────────┼─────────┼─────────┤
│ node/cxxxxxxx-pxnode7-xxxx-23 │ cxxxxxxx-pxnode7-xxxx-23 │ node │ 169.228.56.226 │ │ 1 │ 6 │ │ 1 │ │ │
├───────────────────────────────┼──────────────────────────┼─────────┼────────────────┼───────┼───────┼────────┼───────┼────────┼─────────┼─────────┤
│ node/cxxxxxxx-pxnode8-xxxx-24 │ cxxxxxxx-pxnode8-xxxx-24 │ node │ 169.228.56.233 │ │ 0 │ 2 │ │ 1 │ │ │
└───────────────────────────────┴──────────────────────────┴─────────┴────────────────┴───────┴───────┴────────┴───────┴────────┴─────────┴─────────┘
pvecm status
Cluster information
-------------------
Name: cxxxxxx-px
Config Version: 12
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Sat Nov 2 19:08:39 2024
Quorum provider: corosync_votequorum
Nodes: 8
Node ID: 0x00000006
Ring ID: 1.10de
Quorate: Yes
Votequorum information
----------------------
Expected votes: 8
Highest expected: 8
Total votes: 8
Quorum: 5
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.200.0.10
0x00000002 1 10.200.0.17
0x00000003 1 10.200.0.13
0x00000004 1 10.200.0.14
0x00000005 1 10.200.0.15
0x00000006 1 10.200.0.16 (local)
0x00000007 1 10.200.0.12
0x00000008 1 10.200.0.11
This is the current state of the cluster:
Corosync configuration is correct, and all IPs are reachable in less than 1 ms.
If I reboot the machine or restart pvestatd service I get a green "OK" state for about 5 minutes, then it just goes away.
Couple of more things: Port 8006 is accesible to all hosts, firewalls are disabled, Separate Corosync network from the public network. Network cards are 10GB Ethernet.
I don't know what else to say, I'm at my wits end trying to understand what's going on, and the logs aren't showing anything. I am looking for the logs for the management interface.
pveproxy runs on all hosts and show no errors.
pdsh -R ssh -w 10.200.0.[10-17] pveproxy status
10.200.0.10: running
10.200.0.16: running
10.200.0.12: running
10.200.0.13: running
10.200.0.11: running
10.200.0.17: running
10.200.0.15: running
10.200.0.14: running
# corosync-cfgtool -s
Local node ID 6, transport knet
LINK ID 0 udp
addr = 10.200.0.16
status:
nodeid: 1: connected
nodeid: 2: connected
nodeid: 3: connected
nodeid: 4: connected
nodeid: 5: connected
nodeid: 6: localhost
nodeid: 7: connected
nodeid: 8: connected
I've literally lost my hair, do I need to remove nodes and add the back to the cluster from the one node that is still showing "OK?"
If this isn't the right sub, my apologies. I will post in another spot.
Thank you.
Last edited: