What's the actual, official way to reboot nodes in an HA-enabled cluster?

konaya

Active Member
Sep 16, 2017
24
0
41
I'm asking because this old answer appears to be incorrect.

We have a cluster with nine nodes. HA is enabled, with three rings. When it's time to upgrade packages, we do the following:

  1. Remove a node from all HA groups
  2. Wait for it to be fully evacuated
  3. apt update && apt upgrade
  4. Reboot and wait for it to reappear in the Proxmox UI
  5. Add it to the HA groups again
  6. Rinse and repeat.
Without exceptions, when we're about halfway through, one node will throw watchdog-mux[2159]: client watchdog expired - disable watchdog updates, fence, and reboot. If we're really unlucky, this then triggers a reboot of all the nodes. So, obviously, there needs to be more to it than "systemd is responsible to shutdown services", because that's apparently not happening gracefully.

So. What's the recommended procedure when one wants to reboot all nodes in sequence?
 
Hi,

FYI: The "remove from group before reboot and re-add after" part may be obsoleted by using the "migrate" shutdown policy:
https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_node_maintenance

Without exceptions, when we're about halfway through, one node will throw watchdog-mux[2159]: client watchdog expired - disable watchdog updates, fence, and reboot. If we're really unlucky, this then triggers a reboot of all the nodes. So, obviously, there needs to be more to it than "systemd is responsible to shutdown services", because that's apparently not happening gracefully.
No, that should be it and. fwiw, I use that all the time in my clusters. Maybe some other service or characteristic in the system interferes with the graceful shutdown.

Can you post the journal from such a shutdown, maybe we can figure something out from that?
Also, which Proxmox VE version are you using? pveversion -v
 
Can you post the journal from such a shutdown, maybe we can figure something out from that?

Absolutely. How do I generate the output you need?

Also, which Proxmox VE version are you using? pveversion -v

From a node which got upgraded:

Code:
proxmox-ve: 6.4-1 (running kernel: 5.4.128-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-5
pve-kernel-helper: 6.4-5
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-4.15: 5.4-12
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.4-1-pve: 4.13.4-26
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.5-pve1~bpo10+1

From a node which didn't:

Code:
proxmox-ve: 6.4-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.4-4 (running version: 6.4-4/337d6701)
pve-kernel-5.4: 6.4-1
pve-kernel-helper: 6.4-1
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-4.15: 5.4-12
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-2-pve: 5.3.13-2
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-2
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-1
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.1.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-3
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-1
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

Naturally, we're somewhat reluctant to continue, since we don't want any more downtime.
 
If it's simply the journal on the unexpectedly rebooting node between the end of the planned reboot and the end of the unexpected reboot, this is it:

Code:
Oct  6 17:26:52 proxmox3 pmxcfs[5842]: [status] notice: received log
Oct  6 17:27:00 proxmox3 systemd[1]: Starting Proxmox VE replication runner...
Oct  6 17:27:01 proxmox3 systemd[1]: pvesr.service: Succeeded.
Oct  6 17:27:01 proxmox3 systemd[1]: Started Proxmox VE replication runner.
Oct  6 17:27:38 proxmox3 [sssd[ldap_child[13813]]]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Realm not local to KDC. Unable to create GSSAPI-encrypted LDAP connection.
Oct  6 17:27:38 proxmox3 [sssd[ldap_child[13815]]]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Realm not local to KDC. Unable to create GSSAPI-encrypted LDAP connection.
Oct  6 17:28:00 proxmox3 systemd[1]: Starting Proxmox VE replication runner...
Oct  6 17:28:01 proxmox3 systemd[1]: pvesr.service: Succeeded.
Oct  6 17:28:01 proxmox3 systemd[1]: Started Proxmox VE replication runner.
Oct  6 17:28:45 proxmox3 kernel: [ 3824.686047] libceph: osd9 (1)REDACTED:6808 socket closed (con state OPEN)
Oct  6 17:29:00 proxmox3 systemd[1]: Starting Proxmox VE replication runner...
Oct  6 17:29:01 proxmox3 systemd[1]: pvesr.service: Succeeded.
Oct  6 17:29:01 proxmox3 systemd[1]: Started Proxmox VE replication runner.
Oct  6 17:29:01 proxmox3 [sssd[ldap_child[20178]]]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Realm not local to KDC. Unable to create GSSAPI-encrypted LDAP connection.
Oct  6 17:29:01 proxmox3 [sssd[ldap_child[20230]]]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Realm not local to KDC. Unable to create GSSAPI-encrypted LDAP connection.
Oct  6 17:29:48 proxmox3 watchdog-mux[2159]: client watchdog expired - disable watchdog updates
Oct  6 17:32:27 proxmox3 systemd-modules-load[9077]: Inserted module 'lp'
 
Same time span, different node:

Code:
Oct  6 17:29:00 proxmox0 systemd[1]: Starting Proxmox VE replication runner...
Oct  6 17:29:01 proxmox0 systemd[1]: pvesr.service: Succeeded.
Oct  6 17:29:01 proxmox0 systemd[1]: Started Proxmox VE replication runner.
Oct  6 17:29:58 proxmox0 corosync[6718]:   [KNET  ] link: host: 4 link: 0 is down
Oct  6 17:29:58 proxmox0 corosync[6718]:   [KNET  ] link: host: 4 link: 1 is down
Oct  6 17:29:58 proxmox0 corosync[6718]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct  6 17:29:58 proxmox0 corosync[6718]:   [KNET  ] host: host: 4 has no active links
Oct  6 17:29:58 proxmox0 corosync[6718]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct  6 17:29:58 proxmox0 corosync[6718]:   [KNET  ] host: host: 4 has no active links
Oct  6 17:30:00 proxmox0 systemd[1]: Starting Proxmox VE replication runner...
Oct  6 17:30:01 proxmox0 corosync[6718]:   [TOTEM ] Token has not been received in 5662 ms
Oct  6 17:30:03 proxmox0 corosync[6718]:   [TOTEM ] A processor failed, forming new configuration: token timed out (7550ms), waiting 9060ms for consensus.
Oct  6 17:30:12 proxmox0 corosync[6718]:   [QUORUM] Sync members[8]: 1 2 3 5 6 7 8 9
Oct  6 17:30:12 proxmox0 corosync[6718]:   [QUORUM] Sync left[1]: 4
Oct  6 17:30:12 proxmox0 corosync[6718]:   [TOTEM ] A new membership (1.359) was formed. Members left: 4
Oct  6 17:30:12 proxmox0 corosync[6718]:   [TOTEM ] Failed to receive the leave message. failed: 4
Oct  6 17:30:12 proxmox0 pmxcfs[6579]: [dcdb] notice: members: 1/5928, 2/5779, 3/6151, 5/3485, 6/6579, 7/3564, 8/4004, 9/4620
Oct  6 17:30:12 proxmox0 pmxcfs[6579]: [dcdb] notice: starting data syncronisation
Oct  6 17:30:12 proxmox0 pmxcfs[6579]: [status] notice: members: 1/5928, 2/5779, 3/6151, 5/3485, 6/6579, 7/3564, 8/4004, 9/4620
Oct  6 17:30:12 proxmox0 pmxcfs[6579]: [status] notice: starting data syncronisation
Oct  6 17:30:12 proxmox0 corosync[6718]:   [QUORUM] Members[8]: 1 2 3 5 6 7 8 9
 
Oh, and of course this might be good to know: 16:25 is when proxmox3 went online after its planned reboot. 17:32 is when it went online after its unexpected reboot.
 
If it's simply the journal on the unexpectedly rebooting node between the end of the planned reboot and the end of the unexpected reboot, this is it:
We see the watchdog-mux working OK here, the clients closed connections so it has to stop updating the actual watchdog, the real questions is: why did the clients (pve-ha-lrm and pve-ha-crm) stopped updating the watchdog without closing it gracefully, and that happens earlier.

So, more from the past would be interesting.
 
I ran into the 15000 character limit, so I'm attaching two log excerpts from 16:00 to 17:32 for pve-ha-lrm and pve-ha-crm, respectively. Is that far enough into the past?
 

Attachments

  • crm.log
    1,019 bytes · Views: 7
  • lrm.log
    61.7 KB · Views: 5
Hi there, I just stumbled upon this thread having experienced a very similar (if not identical) situation:
I had updated one node (that is running pve-backup and has no VMs) already and had initiated its reboot.
I also started moving VMs from one node to another, in order to perform the update on each of the remaining nodes.
I got client watchdog expired - disable watchdog updates on two nodes which led to all remaining nodes rebooting (three nodes with VMs and one just for pve-backup).
I am using PVE 7, though. If it is helpful, I can post my log stuff here or should I open a separate thread?
 
Hi there, I just stumbled upon this thread having experienced a very similar (if not identical) situation:
I had updated one node (that is running pve-backup and has no VMs) already and had initiated its reboot.
I also started moving VMs from one node to another, in order to perform the update on each of the remaining nodes.
I got client watchdog expired - disable watchdog updates on two nodes which led to all remaining nodes rebooting (three nodes with VMs and one just for pve-backup).
I am using PVE 7, though. If it is helpful, I can post my log stuff here or should I open a separate thread?
that might have been a variant of https://bugzilla.proxmox.com/show_bug.cgi?id=3672 - do the logs contain lots of 'cpg_join' and/or 'cpg_send' retry messages?
 
yeah, stopping the HA services (first LRM on all nodes, then CRM on all nodes - this disables HA, but also disarms the watchdog so you don't risk fencing), then upgrading all nodes (this will automatically restart corosync, which will pick up the fixed libknet as well), then starting the HA services again (again first LRM on all nodes, then CRM).

that being said, it should take quite a bit of restarts to trigger again statistically speaking, but of course you might be unlucky / have just the right network setup..
 
yeah, stopping the HA services (first LRM on all nodes, then CRM on all nodes - this disables HA, but also disarms the watchdog so you don't risk fencing), then upgrading all nodes (this will automatically restart corosync, which will pick up the fixed libknet as well), then starting the HA services again (again first LRM on all nodes, then CRM).
good to know.
that being said, it should take quite a bit of restarts to trigger again statistically speaking, but of course you might be unlucky / have just the right network setup..
I have done that sort of thing lots of times and never had this crash, so it might just have been 'luck'. But as long as there is a fix... :)

Thanks!!
 
packages are now on pve-no-subscription, please see the following extra information if you have been affected by a full-cluster-crash with the mentioned symptoms (one node rebooting/upgrading/restarting corosync -> cpg_join/cpg_send_message retry log entries followed by watchdog expiring on all nodes):

https://bugzilla.proxmox.com/show_bug.cgi?id=3672

for people who triggered this easily because of their particular load/network situation, it might be advisable to follow the following procedure to avoid triggering it again when installing the fixed versions:

stop the HA services (first LRM on all nodes, then CRM on all nodes - this disables HA, but also disarms the watchdog so you don't risk fencing)

then upgrade all nodes (this will automatically restart corosync, which will pick up the fixed libknet as well)

then start the HA services again (again first LRM on all nodes, then CRM) to re-enable HA features
 
I had seen the packages this morning and followed your instructions from yesterday. I could upgrade the remaining nodes without trouble.
One question remains for me: Can you clarify what 'particular load/network' means? Bandwidth? Using the same (physical) network for PVE cluster traffic as well as e.g. ceph?
 
we had reports that triggered with the corosync link on a bond shared with other usage, we had reports where the logs pointed to MTU-related issues, we had other reports where one corosync link was shared with Ceph traffic. the reproducing environment in the end had a severely rate-limited link on one node. the root cause was a bug in knet when it received a data packet before a ping packet in just the right circumstances. since kronosnet traffic is all UDP there are many scenarios in which this could theoretically happen (there are no ordering guarantees, and the part in knet that is responsible for handling the sequence numbers inside the messages was exactly where the bug was ;)).
 
we had reports that triggered with the corosync link on a bond shared with other usage, we had reports where the logs pointed to MTU-related issues, we had other reports where one corosync link was shared with Ceph traffic. the reproducing environment in the end had a severely rate-limited link on one node. the root cause was a bug in knet when it received a data packet before a ping packet in just the right circumstances. since kronosnet traffic is all UDP there are many scenarios in which this could theoretically happen (there are no ordering guarantees, and the part in knet that is responsible for handling the sequence numbers inside the messages was exactly where the bug was ;)).
Thank you. I do have Ceph and PVE using the same physical 10G network link (although separated by VLAN). However, the bandwidth was not even above 25% around the time the crash occurred.
 
yeah, but it's possible the ping packets were delayed for some reason even with bandwidth to spare - the join message is rather big compared to the heartbeat traffic, so you never know where in the stack they might be treated differently/fragmented/....
 
yeah, but it's possible the ping packets were delayed for some reason even with bandwidth to spare - the join message is rather big compared to the heartbeat traffic, so you never know where in the stack they might be treated differently/fragmented/....
Makes sense. Thanks for the explanation.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!