Hello,
we have a 3 node PVE cluster (7.4-16), with separate cluster interfaces. The cluster interfaces are in a bond (LACP). The cluster interfaces are connected to two Cisco Nexus switches.
Over the weekend our complete HA cluster failed. According to the logs of the other servers, the cluster bond of server 2 was flapping all the time. According to Corosync, the server always left the cluster briefly and then rejoined it. The last messages from the server were:
After the server rebooted (20:43:10, the other two servers also rebooted
Last logs from server 1 (reboot at 20:48:26)
Last logs from server 3 (restart at 20:48:30)
We changed the SFP's of the server and now there are no more flaps. According to our network department, one SFP died. Also the retransmission errors on server 3 are gone.
Can anyone tell me what could be the reason why all 3 rebooted, according to the logs only server 2 left the quorum and with 2 nodes the cluster is still working.
I have now already looked at about 10-20 forum posts and none explains the cause to me
we have a 3 node PVE cluster (7.4-16), with separate cluster interfaces. The cluster interfaces are in a bond (LACP). The cluster interfaces are connected to two Cisco Nexus switches.
Over the weekend our complete HA cluster failed. According to the logs of the other servers, the cluster bond of server 2 was flapping all the time. According to Corosync, the server always left the cluster briefly and then rejoined it. The last messages from the server were:
Code:
watchdog-mux[1073]: Client watchdog expired - disable watchdog updates
watchdog-mux[1073]: Client watchdog expired - disable watchdog updates
watchdog-mux[1073]: leave watchdog-mux with active connections
pve-ha-crm[1867]: Unexpected error - cfs-lock 'domain-ha' error: lock request timed out
pve-ha-crm[1867]: Unexpected error - cfs-lock 'domain-ha' error: lock request timed out
pve-ha-crm[1867]: unexpected error - cfs-lock 'domain-ha' error: lock request timed out
kernel: watchdog: watchdog0: watchdog did not stop!
kernel: watchdog: watchdog0: watchdog did not stop!
corosync[1787]: [KNET ] link: host: 1 link: 0 failed
-- boot 4a50155a4009426e88eb43ccb670676c --
After the server rebooted (20:43:10, the other two servers also rebooted
Last logs from server 1 (reboot at 20:48:26)
Code:
corosync[1801]: [KNET ] link: host: 2 link: 0 failed
corosync[1801]: [KNET ] link: host: 2 link: 0 is down
corosync[1801]: [KNET ] link: host: 2 link: 0 has failed
corosync[1801 ]: [KNET ] rx: host: 2 link: 0 is up
corosync[1801 ]: [KNET ] link: Reset MTU for link 0 because host 2 has joined
corosync[1801 ]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
watchdog-mux[1111]: Client watchdog expired - disable watchdog updates
-- boot 0a07ec692a6547d2b8f13e6d268a9277 --
Last logs from server 3 (restart at 20:48:30)
Code:
corosync[1803]: [TOTEM ] Retransmit List: b c e f 12 13 16 17 1c 1d 1f 20 23 24 27
corosync[1803]: [TOTEM ] Retransmit List: b c e f 12 13 16 17 1c 1d 1f 20 23 24 27
corosync[1803]: [TOTEM ] Retransmit List: b c e f 12 13 16 17 1c 1d 1f 20 23 24 27
corosync[1803]: [KNET ] link: host: 2 link: 0 is down
corosync[1803]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
corosync[1803]: [KNET ] host: host: 2 has no active links
-- boot 93a4f04465f04303b286929154cf8df5 --
We changed the SFP's of the server and now there are no more flaps. According to our network department, one SFP died. Also the retransmission errors on server 3 are gone.
Can anyone tell me what could be the reason why all 3 rebooted, according to the logs only server 2 left the quorum and with 2 nodes the cluster is still working.
I have now already looked at about 10-20 forum posts and none explains the cause to me