Proxmox Cluster of 21 nodes falls out of sync when a single node is added or removed

ThomasBlock

Member
Sep 6, 2022
12
1
8
Hi. I have a cluster of 21 nodes. But it seems very unstable. everytime i add one node or i reboot one node, a random number of other nodes ( between 2 and 10 nodes ) will either fall out of sync or instantly reboot also. Sometimes the whole cluster falls out of quorum, sometimes just some nodes stay separated. The webinterface for all nodes becomes unresponisve
So if one node has a hardware failure, this will cascade to other nodes also shutting down. i now have more problems with shutdowns, that with failing hardware itself.
How could i narrow the source of the error down? how could i fix that?

The Setup: A mix of high-performance AI nodes ( Ryzen CPU ) with 2,5 Gbit and Dell Rackservers with slower Xeon cpu, but 10 Gbit. There is only one physical network, for vm traffic, backups, replication and cluster communication. I know this is not optimal, but there are no other options at the moment. Every replication / backup task is limited to e.g. 1 Gbit and when i look at the managed Mikrotik Switch, i see that the network is really barely used in times of cluster problems.

It also seems that there are no "random" cluster problems. as i said this happens when one node is added or removed.

The broken state shows some pve-cluster logs:

Bildschirmfoto vom 2024-08-21 10-05-36.png

I now found a solution which reliably fixes the cluster - but i have to do it manually:
- open a terminal to very node
- analyze if corosync has enough members: journalctl -u corosync -f | grep "Sync members"
- on the broken nodes, do systemctl stop pve-cluster && systemctl stop corosync
- if the cluster is out of quorum, do the above on every node
- now do on every stopped node systemctl restart corosync && service pve-cluster restart && journalctl -u corosync -f | grep "Sync members"
- wait for the first node to show the correct number of members ( can take one minute ), then restart the next node

In this example i only had to do this in the last terminal. then the cluster went from 19 nodes to 21 nodes

Bildschirmfoto vom 2024-08-21 10-11-49.png


My setup:

Manager Version : not the same for every node - between 8.1 and 8.2

I attach corosync.conf - i have two rings as some nodes have failover networking.
( i could switch to only one ring but at this point i am too afraid to change the config )
 

Attachments

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!