Nodes stay in unknown state except for 1 out of 8.

gaidzak · 2024-11-03T04:30:30+0100

We have 8 nodes that are running in proxmox 8.1.11.

Our data center required us to move out of a rack that was going to be used for a different service, so we had to move 4 out of the 8 machines over to another rack.

We migrated all VMs from those 4 nodes (pxnode5-8) to the other 4 (pxnode1-4) that weren't going to move.

When we moved all the machines, we lost quorum and of course the 4 working nodes went "offline" and all the VMs were unresponsive all of a sudden. This was our fault, we totally goofed and forgot that it was 5 for quorum and we broke it when we got down to less. Unfortunately, even though we may have goofed and not put the machines in maintenance mode, this doesn't bode well when our machines lose power suddenly, and now we lost 7 out of the 8 nodes.

I understand that odd number of nodes is what everyone is going to tell me I should have in my cluster, but we still have quorum as you can see in below configuration.

We immediately plug everything back on the pxnode5-8 start them back up and we get our quorum back.

Rich (BB code):

pvesh get /cluster/status
┌───────────────────────────────┬──────────────────────────┬─────────┬────────────────┬───────┬───────┬────────┬───────┬────────┬─────────┬─────────┐
│ id                            │ name                     │ type    │ ip             │ level │ local │ nodeid │ nodes │ online │ quorate │ version │
╞═══════════════════════════════╪══════════════════════════╪═════════╪════════════════╪═══════╪═══════╪════════╪═══════╪════════╪═════════╪═════════╡
│ cluster                       │ cxxxxxxx-px              │ cluster │                │       │       │        │     8 │        │ 1       │      14 │
├───────────────────────────────┼──────────────────────────┼─────────┼────────────────┼───────┼───────┼────────┼───────┼────────┼─────────┼─────────┤
│ node/cxxxxxxx-pxnode1-xxxx-21 │ cxxxxxxx-pxnode1-xxxx-21 │ node    │ 169.228.56.220 │       │ 0     │      1 │       │ 1      │         │         │
├───────────────────────────────┼──────────────────────────┼─────────┼────────────────┼───────┼───────┼────────┼───────┼────────┼─────────┼─────────┤
│ node/cxxxxxxx-pxnode2-xxxx-22 │ cxxxxxxx-pxnode2-xxxx-22 │ node    │ 169.228.56.228 │       │ 0     │      8 │       │ 1      │         │         │
├───────────────────────────────┼──────────────────────────┼─────────┼────────────────┼───────┼───────┼────────┼───────┼────────┼─────────┼─────────┤
│ node/cxxxxxxx-pxnode3-xxxx-21 │ cxxxxxxx-pxnode3-xxxx-21 │ node    │ 10.200.0.12    │       │ 0     │      7 │       │ 1      │         │         │
├───────────────────────────────┼──────────────────────────┼─────────┼────────────────┼───────┼───────┼────────┼───────┼────────┼─────────┼─────────┤
│ node/cxxxxxxx-pxnode4-xxxx-22 │ cxxxxxxx-pxnode4-xxxx-22 │ node    │ 169.228.56.223 │       │ 0     │      3 │       │ 1      │         │         │
├───────────────────────────────┼──────────────────────────┼─────────┼────────────────┼───────┼───────┼────────┼───────┼────────┼─────────┼─────────┤
│ node/cxxxxxxx-pxnode5-xxxx-23 │ cxxxxxxx-pxnode5-xxxx-23 │ node    │ 169.228.56.224 │       │ 0     │      4 │       │ 1      │         │         │
├───────────────────────────────┼──────────────────────────┼─────────┼────────────────┼───────┼───────┼────────┼───────┼────────┼─────────┼─────────┤
│ node/cxxxxxxx-pxnode6-xxxx-24 │ cxxxxxxx-pxnode6-xxxx-24 │ node    │ 169.228.56.225 │       │ 0     │      5 │       │ 1      │         │         │
├───────────────────────────────┼──────────────────────────┼─────────┼────────────────┼───────┼───────┼────────┼───────┼────────┼─────────┼─────────┤
│ node/cxxxxxxx-pxnode7-xxxx-23 │ cxxxxxxx-pxnode7-xxxx-23 │ node    │ 169.228.56.226 │       │ 1     │      6 │       │ 1      │         │         │
├───────────────────────────────┼──────────────────────────┼─────────┼────────────────┼───────┼───────┼────────┼───────┼────────┼─────────┼─────────┤
│ node/cxxxxxxx-pxnode8-xxxx-24 │ cxxxxxxx-pxnode8-xxxx-24 │ node    │ 169.228.56.233 │       │ 0     │      2 │       │ 1      │         │         │
└───────────────────────────────┴──────────────────────────┴─────────┴────────────────┴───────┴───────┴────────┴───────┴────────┴─────────┴─────────┘

pvecm status

Cluster information
-------------------
Name: cxxxxxx-px
Config Version: 12
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Sat Nov 2 19:08:39 2024
Quorum provider: corosync_votequorum
Nodes: 8
Node ID: 0x00000006
Ring ID: 1.10de
Quorate: Yes

Votequorum information
----------------------
Expected votes: 8
Highest expected: 8
Total votes: 8
Quorum: 5
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.200.0.10
0x00000002 1 10.200.0.17
0x00000003 1 10.200.0.13
0x00000004 1 10.200.0.14
0x00000005 1 10.200.0.15
0x00000006 1 10.200.0.16 (local)
0x00000007 1 10.200.0.12
0x00000008 1 10.200.0.11

This is the current state of the cluster:

proxmox-cant-keep-ha-or-clusters-in-ok-state-after-host.png

Corosync configuration is correct, and all IPs are reachable in less than 1 ms.
If I reboot the machine or restart pvestatd service I get a green "OK" state for about 5 minutes, then it just goes away.

Couple of more things: Port 8006 is accesible to all hosts, firewalls are disabled, Separate Corosync network from the public network. Network cards are 10GB Ethernet.

I don't know what else to say, I'm at my wits end trying to understand what's going on, and the logs aren't showing anything. I am looking for the logs for the management interface.

pveproxy runs on all hosts and show no errors.
pdsh -R ssh -w 10.200.0.[10-17] pveproxy status
10.200.0.10: running
10.200.0.16: running
10.200.0.12: running
10.200.0.13: running
10.200.0.11: running
10.200.0.17: running
10.200.0.15: running
10.200.0.14: running

# corosync-cfgtool -s
Local node ID 6, transport knet
LINK ID 0 udp
addr = 10.200.0.16
status:
nodeid: 1: connected
nodeid: 2: connected
nodeid: 3: connected
nodeid: 4: connected
nodeid: 5: connected
nodeid: 6: localhost
nodeid: 7: connected
nodeid: 8: connected

I've literally lost my hair, do I need to remove nodes and add the back to the cluster from the one node that is still showing "OK?"

If this isn't the right sub, my apologies. I will post in another spot.

Thank you.

gaidzak · 2024-11-03T06:54:07+0100

Additional information:
pvesh get /nodes reports exactly what I'm seeing in the GUI API.

Rich (BB code):

┌──────────────────────────┬─────────┬───────┬───────┬────────┬────────────┬────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────┬───────────┐
│ node                     │ status  │   cpu │ level │ maxcpu │     maxmem │        mem │ ssl_fingerprint                                                                                 │    uptime │
╞══════════════════════════╪═════════╪═══════╪═══════╪════════╪════════════╪════════════╪═════════════════════════════════════════════════════════════════════════════════════════════════╪═══════════╡
│ cxxxxxxx-pxnode1-xxxx-21 │ unknown │       │       │        │            │            │ 88:9C:11:E7:67:A3:00:DA:DF:38:2A:F5:86:6F:21:1D:2C:A8:77:67:CF:37:45:22:64:2B:B1:53:48:4E:03:0D │           │
├──────────────────────────┼─────────┼───────┼───────┼────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ cxxxxxxx-pxnode2-xxxx-22 │ unknown │       │       │        │            │            │ FD:4A:9B:E3:C5:A2:94:94:D7:CD:39:5D:D8:95:91:BA:AB:B8:6F:67:30:42:E2:E1:9D:CC:81:00:9E:EB:C7:B0 │           │
├──────────────────────────┼─────────┼───────┼───────┼────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ cxxxxxxx-pxnode3-xxxx-21 │ unknown │       │       │        │            │            │ 8D:6F:80:D1:28:14:61:1E:DB:2E:2E:13:60:09:C0:D0:96:95:2B:47:7D:FB:33:20:EC:F6:D8:E0:0C:D8:FA:C0 │           │
├──────────────────────────┼─────────┼───────┼───────┼────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ cxxxxxxx-pxnode4-xxxx-22 │ unknown │       │       │        │            │            │ 60:79:38:6F:EA:91:2E:70:4E:85:EE:FA:2B:0D:19:50:04:D2:BE:EE:BF:F1:D9:07:8F:C1:F0:FD:85:B1:83:6F │           │
├──────────────────────────┼─────────┼───────┼───────┼────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ cxxxxxxx-pxnode5-xxxx-23 │ unknown │       │       │        │            │            │ C8:25:AA:09:CA:61:AC:E9:1C:C0:0A:61:D8:5C:F8:62:41:A8:F8:E8:94:B6:94:22:57:D9:9C:D0:08:F6:BB:57 │           │
├──────────────────────────┼─────────┼───────┼───────┼────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ cxxxxxxx-pxnode6-xxxx-24 │ unknown │       │       │        │            │            │ 17:AC:D1:0F:A4:9B:6C:F6:0E:DA:7D:34:A1:9B:7B:68:85:BC:13:7D:D6:23:6D:35:26:A1:00:7D:24:36:35:56 │           │
├──────────────────────────┼─────────┼───────┼───────┼────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ cxxxxxxx-pxnode7-xxxx-23 │ online  │ 8.59% │       │     48 │ 251.55 GiB │ 158.17 GiB │ 5A:42:25:5F:88:5A:DA:C9:A9:F1:8C:E2:77:1E:0A:16:CE:FB:B8:6D:6C:40:0D:78:75:AF:84:B6:24:53:17:E8 │ 4d 3h 33m │
├──────────────────────────┼─────────┼───────┼───────┼────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ cxxxxxxx-pxnode8-xxxx-24 │ unknown │       │       │        │            │            │ 47:BE:28:0A:6E:00:97:C8:4B:FB:89:7B:7B:34:5A:1A:75:CB:4B:68:5D:EA:1F:7F:E5:94:7A:61:31:19:68:2B │           │
└──────────────────────────┴─────────┴───────┴───────┴────────┴────────────┴────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────┴───────────┘

I checked all the individual ssl_fingerprints to each machine and they all match.

Time is on point to each server.

Additionally i deleted rm /var/lib/pve-cluster/.pmxcfs.lockfile to make sure there was nothing stale hanging around.

Rich (BB code):

 corosync-cfgtool -n
Local node ID 6, transport knet
nodeid: 1 reachable
   LINK: 0 udp (10.200.0.16->10.200.0.10) enabled connected mtu: 1397

nodeid: 2 reachable
   LINK: 0 udp (10.200.0.16->10.200.0.17) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (10.200.0.16->10.200.0.13) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (10.200.0.16->10.200.0.14) enabled connected mtu: 1397

nodeid: 5 reachable
   LINK: 0 udp (10.200.0.16->10.200.0.15) enabled connected mtu: 1397

nodeid: 7 reachable
   LINK: 0 udp (10.200.0.16->10.200.0.12) enabled connected mtu: 1397

nodeid: 8 reachable
   LINK: 0 udp (10.200.0.16->10.200.0.11) enabled connected mtu: 1397

Peculiar that pxnode 1 has MTU of 469 but the rest of the nodes are 1397.

Rich (BB code):

-pxnode1-:~# corosync-cfgtool -n
Local node ID 1, transport knet
nodeid: 2 reachable
   LINK: 0 udp (10.200.0.10->10.200.0.17) enabled connected mtu: 469

nodeid: 3 reachable
   LINK: 0 udp (10.200.0.10->10.200.0.13) enabled connected mtu: 469

nodeid: 4 reachable
   LINK: 0 udp (10.200.0.10->10.200.0.14) enabled connected mtu: 469

nodeid: 5 reachable
   LINK: 0 udp (10.200.0.10->10.200.0.15) enabled connected mtu: 469

nodeid: 6 reachable
   LINK: 0 udp (10.200.0.10->10.200.0.16) enabled connected mtu: 469

nodeid: 7 reachable
   LINK: 0 udp (10.200.0.10->10.200.0.12) enabled connected mtu: 469

nodeid: 8 reachable
   LINK: 0 udp (10.200.0.10->10.200.0.11) enabled connected mtu: 469

gaidzak · 2024-11-04T05:38:42+0100

another update just to start mitigating the issues, and trying to make the overall problem broken into compartments to check out.

I turned off all nodes except for two. The one green node pxnode7 and 1 unknown node pxnode8, corosync syslog, pvestatd and other services are quiet, no complaints nothing. Corosync hasn't complained about members disjoining and rejoining since i turned off all hosts except for two.

unfortunately, these two hosts sitting on the same switch, 10 gig E, hosts file set up properly, pxnode 8 still can't go to "ok" mode.

At this point do i just delete nodes and re-add them to the cluster? Create a new cluster and call this one dead? i'm going to mimic this issue with a test cluster and see if I can recreate it. But i"m hoping someone can chime in. (Firewalls are all off)

Rich (BB code):

┌──────────────────────────┬─────────┬───────┬───────┬────────┬────────────┬────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────┬───────────┐
│ node                     │ status  │   cpu │ level │ maxcpu │     maxmem │        mem │ ssl_fingerprint                                                                                 │    uptime │
╞══════════════════════════╪═════════╪═══════╪═══════╪════════╪════════════╪════════════╪═════════════════════════════════════════════════════════════════════════════════════════════════╪═══════════╡
│ cxxxxxxx-pxnode1-xxxx-21 │ offline │       │       │        │            │            │ 88:9C:11:E7:67:A3:00:DA:DF:38:2A:F5:86:6F:21:1D:2C:A8:77:67:CF:37:45:22:64:2B:B1:53:48:4E:03:0D │           │
├──────────────────────────┼─────────┼───────┼───────┼────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ cxxxxxxx-pxnode2-xxxx-22 │ offline │       │       │        │            │            │ FD:4A:9B:E3:C5:A2:94:94:D7:CD:39:5D:D8:95:91:BA:AB:B8:6F:67:30:42:E2:E1:9D:CC:81:00:9E:EB:C7:B0 │           │
├──────────────────────────┼─────────┼───────┼───────┼────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ cxxxxxxx-pxnode3-xxxx-21 │ offline │       │       │        │            │            │ 8D:6F:80:D1:28:14:61:1E:DB:2E:2E:13:60:09:C0:D0:96:95:2B:47:7D:FB:33:20:EC:F6:D8:E0:0C:D8:FA:C0 │           │
├──────────────────────────┼─────────┼───────┼───────┼────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ cxxxxxxx-pxnode4-xxxx-22 │ offline │       │       │        │            │            │ 60:79:38:6F:EA:91:2E:70:4E:85:EE:FA:2B:0D:19:50:04:D2:BE:EE:BF:F1:D9:07:8F:C1:F0:FD:85:B1:83:6F │           │
├──────────────────────────┼─────────┼───────┼───────┼────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ cxxxxxxx-pxnode5-xxxx-23 │ offline │       │       │        │            │            │ C8:25:AA:09:CA:61:AC:E9:1C:C0:0A:61:D8:5C:F8:62:41:A8:F8:E8:94:B6:94:22:57:D9:9C:D0:08:F6:BB:57 │           │
├──────────────────────────┼─────────┼───────┼───────┼────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ cxxxxxxx-pxnode6-xxxx-24 │ offline │       │       │        │            │            │ 17:AC:D1:0F:A4:9B:6C:F6:0E:DA:7D:34:A1:9B:7B:68:85:BC:13:7D:D6:23:6D:35:26:A1:00:7D:24:36:35:56 │           │
├──────────────────────────┼─────────┼───────┼───────┼────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ cxxxxxxx-pxnode7-xxxx-23 │ online  │ 8.59% │       │     48 │ 251.55 GiB │ 158.17 GiB │ 5A:42:25:5F:88:5A:DA:C9:A9:F1:8C:E2:77:1E:0A:16:CE:FB:B8:6D:6C:40:0D:78:75:AF:84:B6:24:53:17:E8 │ 4d 3h 33m │
├──────────────────────────┼─────────┼───────┼───────┼────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ cxxxxxxx-pxnode8-xxxx-24 │ unknown │       │       │        │            │            │ 47:BE:28:0A:6E:00:97:C8:4B:FB:89:7B:7B:34:5A:1A:75:CB:4B:68:5D:EA:1F:7F:E5:94:7A:61:31:19:68:2B │           │
└──────────────────────────┴─────────┴───────┴───────┴────────┴────────────┴────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────┴───────────┘

#pvecm expected 2

Rich (BB code):

~# pvecm status
Cluster information
-------------------
Name:             cxxxxxxx-px
Config Version:   15
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sun Nov  3 20:36:40 2024
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000006
Ring ID:          2.11b9
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.200.0.17
0x00000006          1 10.200.0.16 (local)

Search

Search

Nodes stay in unknown state except for 1 out of 8.

gaidzak

New Member

gaidzak

New Member

gaidzak

New Member