[SOLVED] HA Cluster: One Node goes down, all other Nodes goes in reboot

luis01 · Oct 5, 2023

Hello,

we have a 3 node PVE cluster (7.4-16), with separate cluster interfaces. The cluster interfaces are in a bond (LACP). The cluster interfaces are connected to two Cisco Nexus switches.

Over the weekend our complete HA cluster failed. According to the logs of the other servers, the cluster bond of server 2 was flapping all the time. According to Corosync, the server always left the cluster briefly and then rejoined it. The last messages from the server were:

Code:

watchdog-mux[1073]: Client watchdog expired - disable watchdog updates
watchdog-mux[1073]: Client watchdog expired - disable watchdog updates
watchdog-mux[1073]: leave watchdog-mux with active connections
pve-ha-crm[1867]: Unexpected error - cfs-lock 'domain-ha' error: lock request timed out
pve-ha-crm[1867]: Unexpected error - cfs-lock 'domain-ha' error: lock request timed out
pve-ha-crm[1867]: unexpected error - cfs-lock 'domain-ha' error: lock request timed out
kernel: watchdog: watchdog0: watchdog did not stop!
kernel: watchdog: watchdog0: watchdog did not stop!
corosync[1787]:   [KNET ] link: host: 1 link: 0 failed
-- boot 4a50155a4009426e88eb43ccb670676c --

After the server rebooted (20:43:10, the other two servers also rebooted
Last logs from server 1 (reboot at 20:48:26)

Code:

corosync[1801]:   [KNET ] link: host: 2 link: 0 failed
corosync[1801]:   [KNET ] link: host: 2 link: 0 is down
corosync[1801]:   [KNET ] link: host: 2 link: 0 has failed
corosync[1801 ]:   [KNET ] rx: host: 2 link: 0 is up
corosync[1801 ]:   [KNET ] link: Reset MTU for link 0 because host 2 has joined
corosync[1801 ]:   [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
watchdog-mux[1111]: Client watchdog expired - disable watchdog updates
-- boot 0a07ec692a6547d2b8f13e6d268a9277 --

Last logs from server 3 (restart at 20:48:30)

Code:

corosync[1803]:   [TOTEM ] Retransmit List: b c e f 12 13 16 17 1c 1d 1f 20 23 24 27
corosync[1803]:   [TOTEM ] Retransmit List: b c e f 12 13 16 17 1c 1d 1f 20 23 24 27
corosync[1803]:   [TOTEM ] Retransmit List: b c e f 12 13 16 17 1c 1d 1f 20 23 24 27
corosync[1803]:   [KNET ] link: host: 2 link: 0 is down
corosync[1803]:   [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
corosync[1803]:   [KNET ] host: host: 2 has no active links
-- boot 93a4f04465f04303b286929154cf8df5 --

We changed the SFP's of the server and now there are no more flaps. According to our network department, one SFP died. Also the retransmission errors on server 3 are gone.

Can anyone tell me what could be the reason why all 3 rebooted, according to the logs only server 2 left the quorum and with 2 nodes the cluster is still working.
I have now already looked at about 10-20 forum posts and none explains the cause to me

fabian · Oct 5, 2023

if the link was flapping it is possible corosync was not able to properly establish any quorum, only full logs from all three nodes would tell.

luis01 · Oct 5, 2023

Hello,

thank you for the quick reply.
Attached the logs from the 3 servers (syslog from the day of the failure)

Hostname was simply changed to server1, server2 and server3

fabian · Oct 5, 2023

yep, exactly what was happening (the problems actually started earlier and persisted after the fencing, but with less fallout by pure chance)

luis01 · Oct 5, 2023

Ok thanks.
The flapping on server2 and retransmissions on server3 caused the cluster failure and thus the loss of quorum?

fabian · Oct 6, 2023

yeah, the HA stack requires /etc/pve to be writable, that requires an established quorum and working corosync. the flapping in your case was so bad, that corosync was not able to stabilize, /etc/pve was never writable when the HA stack needed it to be, so the watchdogs expired and nodes fenced themselves.

luis01 · Oct 6, 2023

Thank you for the detailed information and the quick answers.

Search

Search

[SOLVED] HA Cluster: One Node goes down, all other Nodes goes in reboot

luis01

Member

fabian

Proxmox Staff Member

luis01

Member

Attachments

fabian

Proxmox Staff Member

luis01

Member

fabian

Proxmox Staff Member

luis01

Member