Network reconfiguration on a cluster node takes down entire cluster (watchdog timeout)

pwizard · Nov 1, 2023

Scenario:

We want to go from a single 2x10G network bond on our PVE 7.4 compute nodes to 2x10G+2x25G (and at a later point upgrade to 8.0)

8x 7.4 nodes in a cluster, 5x "compute nodes" prox10/11/12/13/14 (running the VMs, no local storage for VMs) and 3x "storage nodes" proxstore11/12/13 (Ceph, no VMs running on the nodes)

We've added the new 25G NICs to the compute nodes and then evacuated each of the 5 nodes one by one by bulk migrating its VMs to another node, then reconfiguring /etc/network/interfaces to add the new bond and move the guest vmbr0 bridge over, then "ifreload -a" to enable the changes.

This has actually worked fine for all nodes over the course of 1-2 hours.

1 minute after the last of the 5 nodes was ifreloaded - ALL compute nodes hard-rebooted!

We then thought maybe we overlooked the HA watchdog (lrm / crm services had NOT been stopped on any of the 8 nodes)
-> but then why would ALL nodes reboot? Should it not be the node that believes to be seperated from the others, only?
Why are quorate members rebooting - all at once?

Also, when such things happened in the past we would get an email saying "node XYZ has been fenced" -> no such email for any node was sent this time.

After rebooting all nodes came online, the few VMs that had been configured for HA were rebooted automatically on one of the nodes while we could manually, sucessfully, start all of the other VMs affected.
And now nothing further has happened for 15 hours.

Note also:

As 1 of the 5 nodes did not run any VMs anyway because we had just reconfigured its network config (prox14) and 1 of the remaining 4 nodes (prox10) did not host a "monitored" VM, only 3 out of the 5 (8) nodes were HA state = active, while the others (including the HA master) were (and are) "idle" - why did all compute nodes reboot and none of the Ceph nodes?

Are we safe now to migrate VMs and so on? What exactly triggered the meltdown?

The last node, prox14, was ifreloaded at 16:32, the nodes crashed at 16:33:26.

Code:

proxmox-ve: 7.4-1 (running kernel: 5.15.116-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-6
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.107-1-pve: 5.15.107-1
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

pwizard · Nov 2, 2023

If we suppose the issue was the ha-manager, how could we disarm and get rid of the current HA configuration?

Remove all VMs resources from HA (so nothing gets protected anymore and should cause fencing? Then again, as mentioned above only 3 out of 5 compute nodes were running protected workloads at the time of the issue)

Disarm the services as explained here and keep the services disabled even between host reboots?

Remove all of the nodes via ha-manager delnode so that none are left? Is this even possible or sensible?

pwizard · Nov 6, 2023

All the nodes seem to have said "Oct 31 16:33:26 prox11 pve-ha-crm[1823]: loop take too long (63 seconds)",
is this sufficient trigger for watchdog?

And to "disarm" the loaded gun permanently, is it enough to stop and "systemctl disable" lrm and crm services? Or is it going to be automatically re-enabled every time PVE is updated (or at least with major updates 7.4 -> 8.0?)

I'd prefer if we simply got rid of the HA feature entirely on our Proxmox cluster - simply delete /etc/pve/ha after all nodes have the LRM/CRM stopped?

Maximiliano · Nov 6, 2023

Hello,

If I understand correctly you want to setup a 16-node cluster. For such a setup you need 9 votes to be quorate, in you logs I see that you had 7-8 votes from the perspective of a few nodes which is not enough. In the event you have two halves of the cluster (8 nodes) they will both be out of quorum.

If there is any HA guest running in a node without quorum it will fence itself after 2 minutes so it can be restored on a node with quorum. Since you have nodes with 8 votes, no one has quorum. but I am not sure why you didn't get a message in the logs.

pwizard · Nov 9, 2023

Maximiliano said:
Hello,

If I understand correctly you want to setup a 16-node cluster. For such a setup you need 9 votes to be quorate, in you logs I see that you had 7-8 votes from the perspective of a few nodes which is not enough. In the event you have two halves of the cluster (8 nodes) they will both be out of quorum.

If there is any HA guest running in a node without quorum it will fence itself after 2 minutes so it can be restored on a node with quorum. Since you have nodes with 8 votes, no one has quorum. but I am not sure why you didn't get a message in the logs.

Hello Maximiliano,

I should've reworded my initial sentence:

8x 7.4 nodes in a cluster, consisting of 5x "compute nodes" prox10/11/12/13/14 (running the VMs, no local storage for VMs) and 3x "storage nodes" proxstore11/12/13 (Ceph, no VMs running on the nodes)

It's only ever been 8 nodes in the cluster, 3 of them Ceph storage and not touched at all last week (one of them, proxstore11, was elected to HA master before the outage happened), 5 of them as compute nodes which we evacuated in turn in order to change network config and ifreload them.

So at most times we would've been 8 out of 8 nodes in quorum and only during the few seconds where network was reconfigured it would've been 7/8 in quorum and the single node outside of it. None of it declared an "out of quorum" though and rebooted, only 1 minute after the last node was reconfigured they all rebooted at the same time.

Plan for us:

systemctl stop pve-ha-lrm.service
[... on all nodes ...]
systemctl stop pve-ha-crm.service
[... on all nodes ...]
systemctl mask pve-ha-lrm.service; systemctl mask pve-ha-crm.service
[... on all nodes ...]

Is this update-proof against minor and major upgrades?

Best regards,

Patrick

pwizard · Nov 9, 2023

Some more info I just noticed:
as long as there were active HA resources (10 VMs) configured, ha-manager status would show the 3 nodes where the VMs resided as "active" and the other 2 compute nodes and the 3 storage nodes as idle.

Now that I've removed all resources from HA, all nodes report "idle" (as they should) and for the 3 nodes that had been running the formerly-HA-VMs journalctl tells me that they logged "watchdog closed (disabled)" at that time:

Nov 02 20:23:45 prox11 pve-ha-lrm[3679875]: missing resource configuration for 'vm:263'
Nov 02 20:34:06 prox11 pve-ha-lrm[1846]: node had no service configured for 60 rounds, going idle.
Nov 02 20:34:06 prox11 pve-ha-lrm[1846]: watchdog closed (disabled)
Nov 02 20:34:06 prox11 pve-ha-lrm[1846]: status change active => wait_for_agent_lock

Nov 02 20:34:06 prox11 pve-ha-lrm[1846]: node had no service configured for 60 rounds, going idle.
Nov 02 20:34:06 prox11 pve-ha-lrm[1846]: watchdog closed (disabled)
Nov 02 20:34:06 prox11 pve-ha-lrm[1846]: status change active => wait_for_agent_lock

Nov 02 20:33:32 prox12 pve-ha-lrm[1850]: node had no service configured for 60 rounds, going idle.
Nov 02 20:33:32 prox12 pve-ha-lrm[1850]: watchdog closed (disabled)
Nov 02 20:33:32 prox12 pve-ha-lrm[1850]: status change active => wait_for_agent_lock

The other 5 nodes did not log anything from pve-ha-lrm after the big outage on Oct 31.

Current output of ha-manager status:

root@proxstore12:~# ha-manager status
quorum OK
master proxstore11 (active, Thu Nov 9 20:03:43 2023)
lrm prox10 (idle, Thu Nov 9 20:03:45 2023)
lrm prox11 (idle, Thu Nov 9 20:03:45 2023)
lrm prox12 (idle, Thu Nov 9 20:03:45 2023)
lrm prox13 (idle, Thu Nov 9 20:03:45 2023)
lrm prox14 (idle, Thu Nov 9 20:03:45 2023)
lrm proxstore11 (idle, Thu Nov 9 20:03:45 2023)
lrm proxstore12 (idle, Thu Nov 9 20:03:45 2023)
lrm proxstore13 (idle, Thu Nov 9 20:03:45 2023)

As mentioned in the first post, if there was a quorum issue I would've expected only one node, prox14, to be fenced and rebooted.
If there was a larger issue, all watchdog-enabled nodes might have thought to reboot - but that would also have been limited to the 3 nodes recognized as "active" - prox11, prox12, prox13. Yet prox10 and prox14 rebooted as well although *their* LRM should've disabled the watchdog beforehand as there was no ressource there to protect.
Am I correct in my thinking?

Then what other scenario, what other component of PVE even signals to 5 separate hardware servers an immediate hard shutdown/reboot if not the 2 possibilities listed above?

I'm wondering about that because it might not help to disable HA LRM and CRM if there is another "loaded gun" mechanism at play here - but how many of such dangerous tripwires does Proxmox have?

Thanks,

Patrick

Network reconfiguration on a cluster node takes down entire cluster (watchdog timeout)

pwizard

New Member

Attachments

pwizard

New Member

pwizard

New Member

Attachments

Maximiliano

Proxmox Staff Member

pwizard

New Member

pwizard

New Member

We value your privacy