Cluster appears borked after recent corosync update?

rec-zachary

New Member
Nov 22, 2024
2
0
1
Good morning everyone,

I run a 16-node testing Proxmox Cluster, so nothing horribly mission critical. After updating the hosts today and rebooting them, corosync fails to establish any connections across the cluster on all hosts, and is sending around 200 megabytes per second of traffic per host across the network (host to host traffic.) I am baffled as to what is going on. The core switch these are plugged into is reporting high rates of errors and collisions on several ports all connected to different hosts, which has all just started after this recent update.

1775490735237.png

This is what the journal shows. Some hosts error out with tens of thousands of rapidly firing errors, shown below:

1775490766963.png

The corosync configuration is standard as Proxmox would set it up, these were all clustered together via the GUI a couple years ago.

Please let me know if you have any ideas.



Zachary
 
I was able to resolve the issue just now specifically by stopping corosync on all hosts and starting it. Restarting the corosync service results in this behavior as does rebooting the hosts. I'll keep trying to find a cause.
 
How is your network configured? Do you have dedicated network links for corosync, storage and outside traffic?