Cluster Issues

Oct 12, 2025
6
0
1
Trying to get a new cluster setup with 4 identical nodes. I currently have 3 of the nodes setup and working in a cluster. When I add the 4th node, it never fully connects and eventually creates issues, such as all four nodes losing connection. They go grey with the question mark or a red x. As soon as I power off the 4th node, the other 3 start working correctly again.

When I run pvecm status, the 1st node's Ring ID changes. I've removed the 4th node, wiped it clean and re-added it with the same results.

I'm at a loss, any ideas?
 
Clue there ... you wiped it clean. And then you probably rejoined it with the same name ...

Did you delete /etc/pve/nodes/OLD-NODE-YOU-NUKED before rejoining the rebuilt machine?
Did you comment out the old ssh key in /etc/pve/priv/authorized_keys before rejoining the rebuilt machine?

Also, are all the nodes running corosync on the same subnet?
I've had issues like that when I selected the wrong subnet for the cluster join.
 
Last edited:
HI, stale node entries or mismatched SSH keys can definitely cause cluster sync chaos.

In addition, make sure the new node’s ring0_addr matches the existing subnet in /etc/pve/corosync.conf, and that /etc/hosts across all nodes correctly maps each node’s cluster IP. Any mismatch there will break quorum when the 4th joins.
 
  • Like
Reactions: tcabernoch
So I have wiped all the nodes, placed them on a switch, all by themself, first three nodes were working great in the cluster, as soon as I added the 4th, it all broke. I've tested all of the network cards, no issues. I've replaced networking cables as well.
 
What network Hardware are you using? Do you have dedicated Networks for cluster communication and Ceph/ZFS replication ( if you happen to use one of it)?
 
After you had wiped all the nodes, then reinstalled, is this 4th (fatal) node the same (I mean the physical server) as the 4th node before the wiping?
What if you change the order of adding the servers?
I mean: does the same server cause the problem, or the 4th one no matter which hardware is "4th"?
 
These are Dell PowerEdge servers, all with the same hardware. Yes the same 4th node when added it all breaks. So I just wiped node 2,3, and 4, create a cluster with those 3 and from the CLI it appears to be working when I run pvecm status, but I cannot get to the webgui from any of the 3 nodes.