What's the actual, official way to reboot nodes in an HA-enabled cluster?

konaya · Oct 7, 2021

I'm asking because this old answer appears to be incorrect.

We have a cluster with nine nodes. HA is enabled, with three rings. When it's time to upgrade packages, we do the following:

Remove a node from all HA groups
Wait for it to be fully evacuated
apt update && apt upgrade
Reboot and wait for it to reappear in the Proxmox UI
Add it to the HA groups again
Rinse and repeat.

Without exceptions, when we're about halfway through, one node will throw watchdog-mux[2159]: client watchdog expired - disable watchdog updates, fence, and reboot. If we're really unlucky, this then triggers a reboot of all the nodes. So, obviously, there needs to be more to it than "systemd is responsible to shutdown services", because that's apparently not happening gracefully.

So. What's the recommended procedure when one wants to reboot all nodes in sequence?

t.lamprecht · Oct 7, 2021

Hi,

FYI: The "remove from group before reboot and re-add after" part may be obsoleted by using the "migrate" shutdown policy:
https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_node_maintenance

konaya said:
Without exceptions, when we're about halfway through, one node will throw watchdog-mux[2159]: client watchdog expired - disable watchdog updates, fence, and reboot. If we're really unlucky, this then triggers a reboot of all the nodes. So, obviously, there needs to be more to it than "systemd is responsible to shutdown services", because that's apparently not happening gracefully.

No, that should be it and. fwiw, I use that all the time in my clusters. Maybe some other service or characteristic in the system interferes with the graceful shutdown.

Can you post the journal from such a shutdown, maybe we can figure something out from that?
Also, which Proxmox VE version are you using? pveversion -v

konaya · Oct 7, 2021

t.lamprecht said:
Can you post the journal from such a shutdown, maybe we can figure something out from that?

Absolutely. How do I generate the output you need?

t.lamprecht said:
Also, which Proxmox VE version are you using? pveversion -v

From a node which got upgraded:

Code:

proxmox-ve: 6.4-1 (running kernel: 5.4.128-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-5
pve-kernel-helper: 6.4-5
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-4.15: 5.4-12
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.4-1-pve: 4.13.4-26
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.5-pve1~bpo10+1

From a node which didn't:

Code:

proxmox-ve: 6.4-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.4-4 (running version: 6.4-4/337d6701)
pve-kernel-5.4: 6.4-1
pve-kernel-helper: 6.4-1
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-4.15: 5.4-12
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-2-pve: 5.3.13-2
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-2
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-1
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.1.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-3
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-1
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

Naturally, we're somewhat reluctant to continue, since we don't want any more downtime.

konaya · Oct 7, 2021

If it's simply the journal on the unexpectedly rebooting node between the end of the planned reboot and the end of the unexpected reboot, this is it:

Code:

Oct  6 17:26:52 proxmox3 pmxcfs[5842]: [status] notice: received log
Oct  6 17:27:00 proxmox3 systemd[1]: Starting Proxmox VE replication runner...
Oct  6 17:27:01 proxmox3 systemd[1]: pvesr.service: Succeeded.
Oct  6 17:27:01 proxmox3 systemd[1]: Started Proxmox VE replication runner.
Oct  6 17:27:38 proxmox3 [sssd[ldap_child[13813]]]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Realm not local to KDC. Unable to create GSSAPI-encrypted LDAP connection.
Oct  6 17:27:38 proxmox3 [sssd[ldap_child[13815]]]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Realm not local to KDC. Unable to create GSSAPI-encrypted LDAP connection.
Oct  6 17:28:00 proxmox3 systemd[1]: Starting Proxmox VE replication runner...
Oct  6 17:28:01 proxmox3 systemd[1]: pvesr.service: Succeeded.
Oct  6 17:28:01 proxmox3 systemd[1]: Started Proxmox VE replication runner.
Oct  6 17:28:45 proxmox3 kernel: [ 3824.686047] libceph: osd9 (1)REDACTED:6808 socket closed (con state OPEN)
Oct  6 17:29:00 proxmox3 systemd[1]: Starting Proxmox VE replication runner...
Oct  6 17:29:01 proxmox3 systemd[1]: pvesr.service: Succeeded.
Oct  6 17:29:01 proxmox3 systemd[1]: Started Proxmox VE replication runner.
Oct  6 17:29:01 proxmox3 [sssd[ldap_child[20178]]]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Realm not local to KDC. Unable to create GSSAPI-encrypted LDAP connection.
Oct  6 17:29:01 proxmox3 [sssd[ldap_child[20230]]]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Realm not local to KDC. Unable to create GSSAPI-encrypted LDAP connection.
Oct  6 17:29:48 proxmox3 watchdog-mux[2159]: client watchdog expired - disable watchdog updates
Oct  6 17:32:27 proxmox3 systemd-modules-load[9077]: Inserted module 'lp'

konaya · Oct 7, 2021

Same time span, different node:

Code:

Oct  6 17:29:00 proxmox0 systemd[1]: Starting Proxmox VE replication runner...
Oct  6 17:29:01 proxmox0 systemd[1]: pvesr.service: Succeeded.
Oct  6 17:29:01 proxmox0 systemd[1]: Started Proxmox VE replication runner.
Oct  6 17:29:58 proxmox0 corosync[6718]:   [KNET  ] link: host: 4 link: 0 is down
Oct  6 17:29:58 proxmox0 corosync[6718]:   [KNET  ] link: host: 4 link: 1 is down
Oct  6 17:29:58 proxmox0 corosync[6718]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct  6 17:29:58 proxmox0 corosync[6718]:   [KNET  ] host: host: 4 has no active links
Oct  6 17:29:58 proxmox0 corosync[6718]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct  6 17:29:58 proxmox0 corosync[6718]:   [KNET  ] host: host: 4 has no active links
Oct  6 17:30:00 proxmox0 systemd[1]: Starting Proxmox VE replication runner...
Oct  6 17:30:01 proxmox0 corosync[6718]:   [TOTEM ] Token has not been received in 5662 ms
Oct  6 17:30:03 proxmox0 corosync[6718]:   [TOTEM ] A processor failed, forming new configuration: token timed out (7550ms), waiting 9060ms for consensus.
Oct  6 17:30:12 proxmox0 corosync[6718]:   [QUORUM] Sync members[8]: 1 2 3 5 6 7 8 9
Oct  6 17:30:12 proxmox0 corosync[6718]:   [QUORUM] Sync left[1]: 4
Oct  6 17:30:12 proxmox0 corosync[6718]:   [TOTEM ] A new membership (1.359) was formed. Members left: 4
Oct  6 17:30:12 proxmox0 corosync[6718]:   [TOTEM ] Failed to receive the leave message. failed: 4
Oct  6 17:30:12 proxmox0 pmxcfs[6579]: [dcdb] notice: members: 1/5928, 2/5779, 3/6151, 5/3485, 6/6579, 7/3564, 8/4004, 9/4620
Oct  6 17:30:12 proxmox0 pmxcfs[6579]: [dcdb] notice: starting data syncronisation
Oct  6 17:30:12 proxmox0 pmxcfs[6579]: [status] notice: members: 1/5928, 2/5779, 3/6151, 5/3485, 6/6579, 7/3564, 8/4004, 9/4620
Oct  6 17:30:12 proxmox0 pmxcfs[6579]: [status] notice: starting data syncronisation
Oct  6 17:30:12 proxmox0 corosync[6718]:   [QUORUM] Members[8]: 1 2 3 5 6 7 8 9

konaya · Oct 7, 2021

Oh, and of course this might be good to know: 16:25 is when proxmox3 went online after its planned reboot. 17:32 is when it went online after its unexpected reboot.

t.lamprecht · Oct 7, 2021

konaya said:
If it's simply the journal on the unexpectedly rebooting node between the end of the planned reboot and the end of the unexpected reboot, this is it:

We see the watchdog-mux working OK here, the clients closed connections so it has to stop updating the actual watchdog, the real questions is: why did the clients (pve-ha-lrm and pve-ha-crm) stopped updating the watchdog without closing it gracefully, and that happens earlier.

So, more from the past would be interesting.

konaya · Oct 7, 2021

I ran into the 15000 character limit, so I'm attaching two log excerpts from 16:00 to 17:32 for pve-ha-lrm and pve-ha-crm, respectively. Is that far enough into the past?

konaya · Oct 13, 2021

Was that time span far enough into the past, Thomas?

Waschbüsch · Nov 9, 2021

Hi there, I just stumbled upon this thread having experienced a very similar (if not identical) situation:
I had updated one node (that is running pve-backup and has no VMs) already and had initiated its reboot.
I also started moving VMs from one node to another, in order to perform the update on each of the remaining nodes.
I got client watchdog expired - disable watchdog updates on two nodes which led to all remaining nodes rebooting (three nodes with VMs and one just for pve-backup).
I am using PVE 7, though. If it is helpful, I can post my log stuff here or should I open a separate thread?

fabian · Nov 9, 2021

Waschbüsch said:
Hi there, I just stumbled upon this thread having experienced a very similar (if not identical) situation:
I had updated one node (that is running pve-backup and has no VMs) already and had initiated its reboot.
I also started moving VMs from one node to another, in order to perform the update on each of the remaining nodes.
I got client watchdog expired - disable watchdog updates on two nodes which led to all remaining nodes rebooting (three nodes with VMs and one just for pve-backup).
I am using PVE 7, though. If it is helpful, I can post my log stuff here or should I open a separate thread?

that might have been a variant of https://bugzilla.proxmox.com/show_bug.cgi?id=3672 - do the logs contain lots of 'cpg_join' and/or 'cpg_send' retry messages?

Waschbüsch · Nov 9, 2021

fabian said:
that might have been a variant of https://bugzilla.proxmox.com/show_bug.cgi?id=3672 - do the logs contain lots of 'cpg_join' and/or 'cpg_send' retry messages?

Indeed, they do. Lots of them. Thanks!
I guess I will wait for the corosync and libknet package updates before doing anything else?
Can these be upgraded without risking more downtime? E.g. any order / steps to watch out for?

fabian · Nov 9, 2021

yeah, stopping the HA services (first LRM on all nodes, then CRM on all nodes - this disables HA, but also disarms the watchdog so you don't risk fencing), then upgrading all nodes (this will automatically restart corosync, which will pick up the fixed libknet as well), then starting the HA services again (again first LRM on all nodes, then CRM).

that being said, it should take quite a bit of restarts to trigger again statistically speaking, but of course you might be unlucky / have just the right network setup..

Waschbüsch · Nov 9, 2021

fabian said:
yeah, stopping the HA services (first LRM on all nodes, then CRM on all nodes - this disables HA, but also disarms the watchdog so you don't risk fencing), then upgrading all nodes (this will automatically restart corosync, which will pick up the fixed libknet as well), then starting the HA services again (again first LRM on all nodes, then CRM).

good to know.

fabian said:
that being said, it should take quite a bit of restarts to trigger again statistically speaking, but of course you might be unlucky / have just the right network setup..

I have done that sort of thing lots of times and never had this crash, so it might just have been 'luck'. But as long as there is a fix...

Thanks!!

fabian · Nov 10, 2021

packages are now on pve-no-subscription, please see the following extra information if you have been affected by a full-cluster-crash with the mentioned symptoms (one node rebooting/upgrading/restarting corosync -> cpg_join/cpg_send_message retry log entries followed by watchdog expiring on all nodes):

https://bugzilla.proxmox.com/show_bug.cgi?id=3672

for people who triggered this easily because of their particular load/network situation, it might be advisable to follow the following procedure to avoid triggering it again when installing the fixed versions:

stop the HA services (first LRM on all nodes, then CRM on all nodes - this disables HA, but also disarms the watchdog so you don't risk fencing)

then upgrade all nodes (this will automatically restart corosync, which will pick up the fixed libknet as well)

then start the HA services again (again first LRM on all nodes, then CRM) to re-enable HA features

Waschbüsch · Nov 10, 2021

I had seen the packages this morning and followed your instructions from yesterday. I could upgrade the remaining nodes without trouble.
One question remains for me: Can you clarify what 'particular load/network' means? Bandwidth? Using the same (physical) network for PVE cluster traffic as well as e.g. ceph?

fabian · Nov 10, 2021

we had reports that triggered with the corosync link on a bond shared with other usage, we had reports where the logs pointed to MTU-related issues, we had other reports where one corosync link was shared with Ceph traffic. the reproducing environment in the end had a severely rate-limited link on one node. the root cause was a bug in knet when it received a data packet before a ping packet in just the right circumstances. since kronosnet traffic is all UDP there are many scenarios in which this could theoretically happen (there are no ordering guarantees, and the part in knet that is responsible for handling the sequence numbers inside the messages was exactly where the bug was

).

Waschbüsch · Nov 10, 2021

fabian said:
we had reports that triggered with the corosync link on a bond shared with other usage, we had reports where the logs pointed to MTU-related issues, we had other reports where one corosync link was shared with Ceph traffic. the reproducing environment in the end had a severely rate-limited link on one node. the root cause was a bug in knet when it received a data packet before a ping packet in just the right circumstances. since kronosnet traffic is all UDP there are many scenarios in which this could theoretically happen (there are no ordering guarantees, and the part in knet that is responsible for handling the sequence numbers inside the messages was exactly where the bug was ).

Thank you. I do have Ceph and PVE using the same physical 10G network link (although separated by VLAN). However, the bandwidth was not even above 25% around the time the crash occurred.

fabian · Nov 10, 2021

yeah, but it's possible the ping packets were delayed for some reason even with bandwidth to spare - the join message is rather big compared to the heartbeat traffic, so you never know where in the stack they might be treated differently/fragmented/....

Waschbüsch · Nov 10, 2021

fabian said:
yeah, but it's possible the ping packets were delayed for some reason even with bandwidth to spare - the join message is rather big compared to the heartbeat traffic, so you never know where in the stack they might be treated differently/fragmented/....

Makes sense. Thanks for the explanation.

What's the actual, official way to reboot nodes in an HA-enabled cluster?

Active Member

Proxmox Staff Member

Active Member

Active Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Attachments

Active Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member