Hi. I have a cluster of 21 nodes. But it seems very unstable. everytime i add one node or i reboot one node, a random number of other nodes ( between 2 and 10 nodes ) will either fall out of sync or instantly reboot also. Sometimes the whole cluster falls out of quorum, sometimes just some nodes stay separated. The webinterface for all nodes becomes unresponisve
So if one node has a hardware failure, this will cascade to other nodes also shutting down. i now have more problems with shutdowns, that with failing hardware itself.
How could i narrow the source of the error down? how could i fix that?
The Setup: A mix of high-performance AI nodes ( Ryzen CPU ) with 2,5 Gbit and Dell Rackservers with slower Xeon cpu, but 10 Gbit. There is only one physical network, for vm traffic, backups, replication and cluster communication. I know this is not optimal, but there are no other options at the moment. Every replication / backup task is limited to e.g. 1 Gbit and when i look at the managed Mikrotik Switch, i see that the network is really barely used in times of cluster problems.
It also seems that there are no "random" cluster problems. as i said this happens when one node is added or removed.
The broken state shows some pve-cluster logs:
I now found a solution which reliably fixes the cluster - but i have to do it manually:
- open a terminal to very node
- analyze if corosync has enough members:
- on the broken nodes, do
- if the cluster is out of quorum, do the above on every node
- now do on every stopped node
- wait for the first node to show the correct number of members ( can take one minute ), then restart the next node
In this example i only had to do this in the last terminal. then the cluster went from 19 nodes to 21 nodes
My setup:
Manager Version : not the same for every node - between 8.1 and 8.2
I attach corosync.conf - i have two rings as some nodes have failover networking.
( i could switch to only one ring but at this point i am too afraid to change the config )
So if one node has a hardware failure, this will cascade to other nodes also shutting down. i now have more problems with shutdowns, that with failing hardware itself.
How could i narrow the source of the error down? how could i fix that?
The Setup: A mix of high-performance AI nodes ( Ryzen CPU ) with 2,5 Gbit and Dell Rackservers with slower Xeon cpu, but 10 Gbit. There is only one physical network, for vm traffic, backups, replication and cluster communication. I know this is not optimal, but there are no other options at the moment. Every replication / backup task is limited to e.g. 1 Gbit and when i look at the managed Mikrotik Switch, i see that the network is really barely used in times of cluster problems.
It also seems that there are no "random" cluster problems. as i said this happens when one node is added or removed.
The broken state shows some pve-cluster logs:
I now found a solution which reliably fixes the cluster - but i have to do it manually:
- open a terminal to very node
- analyze if corosync has enough members:
journalctl -u corosync -f | grep "Sync members"
- on the broken nodes, do
systemctl stop pve-cluster && systemctl stop corosync
- if the cluster is out of quorum, do the above on every node
- now do on every stopped node
systemctl restart corosync && service pve-cluster restart && journalctl -u corosync -f | grep "Sync members"
- wait for the first node to show the correct number of members ( can take one minute ), then restart the next node
In this example i only had to do this in the last terminal. then the cluster went from 19 nodes to 21 nodes
My setup:
Manager Version : not the same for every node - between 8.1 and 8.2
I attach corosync.conf - i have two rings as some nodes have failover networking.
( i could switch to only one ring but at this point i am too afraid to change the config )