[SOLVED] HA Cluster: One Node goes down, all other Nodes goes in reboot

luis01

Member
Oct 20, 2020
4
0
6
Hello,

we have a 3 node PVE cluster (7.4-16), with separate cluster interfaces. The cluster interfaces are in a bond (LACP). The cluster interfaces are connected to two Cisco Nexus switches.

Over the weekend our complete HA cluster failed. According to the logs of the other servers, the cluster bond of server 2 was flapping all the time. According to Corosync, the server always left the cluster briefly and then rejoined it. The last messages from the server were:

Code:
watchdog-mux[1073]: Client watchdog expired - disable watchdog updates
watchdog-mux[1073]: Client watchdog expired - disable watchdog updates
watchdog-mux[1073]: leave watchdog-mux with active connections
pve-ha-crm[1867]: Unexpected error - cfs-lock 'domain-ha' error: lock request timed out
pve-ha-crm[1867]: Unexpected error - cfs-lock 'domain-ha' error: lock request timed out
pve-ha-crm[1867]: unexpected error - cfs-lock 'domain-ha' error: lock request timed out
kernel: watchdog: watchdog0: watchdog did not stop!
kernel: watchdog: watchdog0: watchdog did not stop!
corosync[1787]:   [KNET ] link: host: 1 link: 0 failed
-- boot 4a50155a4009426e88eb43ccb670676c --

After the server rebooted (20:43:10, the other two servers also rebooted
Last logs from server 1 (reboot at 20:48:26)
Code:
corosync[1801]:   [KNET ] link: host: 2 link: 0 failed
corosync[1801]:   [KNET ] link: host: 2 link: 0 is down
corosync[1801]:   [KNET ] link: host: 2 link: 0 has failed
corosync[1801 ]:   [KNET ] rx: host: 2 link: 0 is up
corosync[1801 ]:   [KNET ] link: Reset MTU for link 0 because host 2 has joined
corosync[1801 ]:   [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
watchdog-mux[1111]: Client watchdog expired - disable watchdog updates
-- boot 0a07ec692a6547d2b8f13e6d268a9277 --

Last logs from server 3 (restart at 20:48:30)
Code:
corosync[1803]:   [TOTEM ] Retransmit List: b c e f 12 13 16 17 1c 1d 1f 20 23 24 27
corosync[1803]:   [TOTEM ] Retransmit List: b c e f 12 13 16 17 1c 1d 1f 20 23 24 27
corosync[1803]:   [TOTEM ] Retransmit List: b c e f 12 13 16 17 1c 1d 1f 20 23 24 27
corosync[1803]:   [KNET ] link: host: 2 link: 0 is down
corosync[1803]:   [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
corosync[1803]:   [KNET ] host: host: 2 has no active links
-- boot 93a4f04465f04303b286929154cf8df5 --

We changed the SFP's of the server and now there are no more flaps. According to our network department, one SFP died. Also the retransmission errors on server 3 are gone.

Can anyone tell me what could be the reason why all 3 rebooted, according to the logs only server 2 left the quorum and with 2 nodes the cluster is still working.
I have now already looked at about 10-20 forum posts and none explains the cause to me
 
if the link was flapping it is possible corosync was not able to properly establish any quorum, only full logs from all three nodes would tell.
 
Hello,

thank you for the quick reply.
Attached the logs from the 3 servers (syslog from the day of the failure)

Hostname was simply changed to server1, server2 and server3
 

Attachments

  • syslog.zip
    698 KB · Views: 2
yep, exactly what was happening (the problems actually started earlier and persisted after the fencing, but with less fallout by pure chance)
 
Ok thanks.
The flapping on server2 and retransmissions on server3 caused the cluster failure and thus the loss of quorum?
 
yeah, the HA stack requires /etc/pve to be writable, that requires an established quorum and working corosync. the flapping in your case was so bad, that corosync was not able to stabilize, /etc/pve was never writable when the HA stack needed it to be, so the watchdogs expired and nodes fenced themselves.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!