Corosync Stability issues on large cluster

Mar 16, 2023
10
0
1
Hello

We have a cluster of > 30 servers. Initially it did NOT have a dedicated Corosync network as we felt the host network would be sufficient (2 X 10Gbe interfaces each), however we started having Corosync issues once we got to about 30 servers so then decided to put in a dedicated Corosync network. We added 2 X 48 Port 1Gbe switches, cabled up each host to go to both switches and set up Corosync to use both interfaces. In other words we now have a dedicated Corosync network with 2 X 1Gbe interfaces per host. This however doesn't seem to have made a difference.

We still end up with hosts that seem to "fall out" of the cluster. In other words in the Web Gui they show up as red, if you log into one of those hosts directly it appears like they have established themsleves in their own cluster.

I tried stopping pve-cluster and cororsync services on all hosts, then slowly starting it up 1 by 1. As I start them up on hosts it all looks good, with hosts appearing in the cluster as expected, until we get between 20-30 hosts running then it starts happening again

Running Proxmox 7.4 with latest patches.

Any advice?
 
Personnality, I never do more than 20 nodes clusters.

It's really depend of network latency and cpu frequence to handle corosync traffic. (bandwith is not important, but you really need low latency and no network saturation).

Note that also, sometime 1 bad node can impact the whole cluster. (bad nic driver, wrong cable,a node really older vs the others). Do you have same nodes models everywhere ? same cpu generation ?

do you have any retransmit corosync logs in /var/log/daemon.log ?


maybe you could try to use sctp instead udp, I had some retransmist error in the past, fixed by it.

/etc/pve/corosync.conf

totem {
...
interface {
knet_transport: sctp
linknumber: 0
}
...
}


(this need a full restart of corosync on every node, so be carefull to not have HA enabled before change this)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!