Cluster synchronization is abnormal

wang122132112232

New Member
May 27, 2025
5
0
1
My PVE cluster has 15 nodes. After adding a new node, the cluster shows that all nodes are disconnected, and the synchronization is abnormal. When manually logging in to the background, the network of all nodes is normal. After restarting the relevant service processes for synchronization and rebooting 5 fast nodes, the synchronization of these 5 nodes becomes normal. However, it shows abnormal again after a while;
1752026811955.png
 
For assessing the situation it will be helpful if you run these commands on one of the servers and post the output:
cat /etc/pve/corosync.conf
corosync-cfgtool -n
 
cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: 10-125-24-33
nodeid: 8
quorum_votes: 1
ring0_addr: 10.125.24.33
}
node {
name: 10-125-24-49
nodeid: 2
quorum_votes: 1
ring0_addr: 10.125.24.49
}
node {
name: 10-125-24-51
nodeid: 7
quorum_votes: 1
ring0_addr: 10.125.24.51
}
node {
name: 10-125-24-53
nodeid: 6
quorum_votes: 1
ring0_addr: 10.125.24.53
}
node {
name: 10-125-24-54
nodeid: 5
quorum_votes: 1
ring0_addr: 10.125.24.54
}
node {
name: 10-125-24-55
nodeid: 4
quorum_votes: 1
ring0_addr: 10.125.24.55
}
node {
name: 10-125-24-84
nodeid: 3
quorum_votes: 1
ring0_addr: 10.125.24.84
}
node {
name: 10-125-24-85
nodeid: 1
quorum_votes: 1
ring0_addr: 10.125.24.85
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: yewu1
config_version: 15
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}

corosync-cfgtool -n
Local node ID 1, transport knet
nodeid: 2 reachable
LINK: 0 udp (10.125.24.85->10.125.24.49) enabled connected mtu: 1397

nodeid: 3 reachable
LINK: 0 udp (10.125.24.85->10.125.24.84) enabled connected mtu: 1397

nodeid: 4 reachable
LINK: 0 udp (10.125.24.85->10.125.24.55) enabled connected mtu: 1397

nodeid: 5 reachable
LINK: 0 udp (10.125.24.85->10.125.24.54) enabled connected mtu: 1397

nodeid: 6 reachable
LINK: 0 udp (10.125.24.85->10.125.24.53) enabled connected mtu: 1397

nodeid: 7 reachable
LINK: 0 udp (10.125.24.85->10.125.24.51) enabled connected mtu: 1397

nodeid: 8 reachable
LINK: 0 udp (10.125.24.85->10.125.24.33) enabled connected mtu: 1397

Last time, I re - partitioned the cluster, created a new Proxmox Virtual Environment (PVE) cluster, and added 15 nodes to it. There were approximately 230 virtual machines in total. After all the nodes were added, I found that the cluster synchronization was abnormal and extremely slow. Both checking the cluster information and accessing the login page were very laggy.
1752116061978.png
Currently, the approach is to repartition it into two clusters, with one cluster having 8 nodes. At present, the data synchronization of the clusters is normal and they are functioning properly.
 
Add more network ports to your nodes and make sure to have two physical interfaces available just for Corosync. Otherwise any workload of the vm or the pve nodes may disrupt the Corosync communication. This would also add redundancy in case of link failures.