Cluster Issues

Oct 12, 2025
9
1
1
Trying to get a new cluster setup with 4 identical nodes. I currently have 3 of the nodes setup and working in a cluster. When I add the 4th node, it never fully connects and eventually creates issues, such as all four nodes losing connection. They go grey with the question mark or a red x. As soon as I power off the 4th node, the other 3 start working correctly again.

When I run pvecm status, the 1st node's Ring ID changes. I've removed the 4th node, wiped it clean and re-added it with the same results.

I'm at a loss, any ideas?
 
Clue there ... you wiped it clean. And then you probably rejoined it with the same name ...

Did you delete /etc/pve/nodes/OLD-NODE-YOU-NUKED before rejoining the rebuilt machine?
Did you comment out the old ssh key in /etc/pve/priv/authorized_keys before rejoining the rebuilt machine?

Also, are all the nodes running corosync on the same subnet?
I've had issues like that when I selected the wrong subnet for the cluster join.
 
Last edited:
HI, stale node entries or mismatched SSH keys can definitely cause cluster sync chaos.

In addition, make sure the new node’s ring0_addr matches the existing subnet in /etc/pve/corosync.conf, and that /etc/hosts across all nodes correctly maps each node’s cluster IP. Any mismatch there will break quorum when the 4th joins.
 
  • Like
Reactions: tcabernoch
So I have wiped all the nodes, placed them on a switch, all by themself, first three nodes were working great in the cluster, as soon as I added the 4th, it all broke. I've tested all of the network cards, no issues. I've replaced networking cables as well.
 
What network Hardware are you using? Do you have dedicated Networks for cluster communication and Ceph/ZFS replication ( if you happen to use one of it)?
 
After you had wiped all the nodes, then reinstalled, is this 4th (fatal) node the same (I mean the physical server) as the 4th node before the wiping?
What if you change the order of adding the servers?
I mean: does the same server cause the problem, or the 4th one no matter which hardware is "4th"?
 
These are Dell PowerEdge servers, all with the same hardware. Yes the same 4th node when added it all breaks. So I just wiped node 2,3, and 4, create a cluster with those 3 and from the CLI it appears to be working when I run pvecm status, but I cannot get to the webgui from any of the 3 nodes.
 
Yeah — that behavior pretty much screams Corosync identity or network conflict, not hardware.

When two nodes work fine but adding a third (or fourth) causes the whole cluster to fall apart, it’s usually one of these:

1. Duplicate nodeid or ring0_addr entries in /etc/pve/corosync.conf (worth double-checking on every node).
2. Stale /etc/pve/nodes/<oldnode> directories or SSH keys left from earlier attempts.
3. Multicast instability on the switch — try setting transport: udpu to use unicast and restart Corosync on all nodes.

If switching to unicast stabilizes it, you’ve found the culprit — some Dell onboard NICs and smart switches just don’t play nice with multicast clustering traffic.
 
  • Like
Reactions: Johannes S
Yeah — that behavior pretty much screams Corosync identity or network conflict, not hardware.

When two nodes work fine but adding a third (or fourth) causes the whole cluster to fall apart, it’s usually one of these:

1. Duplicate nodeid or ring0_addr entries in /etc/pve/corosync.conf (worth double-checking on every node).
2. Stale /etc/pve/nodes/<oldnode> directories or SSH keys left from earlier attempts.
3. Multicast instability on the switch — try setting transport: udpu to use unicast and restart Corosync on all nodes.

If switching to unicast stabilizes it, you’ve found the culprit — some Dell onboard NICs and smart switches just don’t play nice with multicast clustering traffic.
Thanks for the information about the transport. I was researching and finding that might be the issue. However, I'm having trouble getting it switched over to udpu. After a fresh install, how can I make this change?
 
Unless you set it manually or this is a cluster that has been upgraded since ancient times, your PVE is using unicast. PVE does not use multicast since PVE 4.x IIRC, when Corosync 3.x was introduced with the use of unicast kronosnet.

Post your /etc/pve/corosync.conf in each node and make 100% sure it is a symlink on each node to /etc/corosync/corosync.conf.

Also, check service logs with jornalctl -u corosync.service both during normal operation and during/after setting that fourth node in the cluster. Wondering why you didn't start debugging from there ;)
 
I was able to set Transport: udpu but I also needed to set crypto_cipher and crypto_hash to none in order for corosync to start back up. By making these changes to the corosync.conf file, I was able to get the 3 nodes connected. I'm currently moving VMs off from the 4th node, and once finished, I will add it to the cluster. Hopefully this works. I will post results.
 
  • Like
Reactions: readyspace
After wiping the 4th node, reinstalling, and then adding it to the cluster with the transport: udpu setting already in place, it worked great. No issues at all and now my 4 node cluster is working. So it turns out the issue was the switch not being able to handle multicast traffic.

Thank you to all who provided me assistance.