Cluster Issues

tstrenk · Oct 12, 2025

Trying to get a new cluster setup with 4 identical nodes. I currently have 3 of the nodes setup and working in a cluster. When I add the 4th node, it never fully connects and eventually creates issues, such as all four nodes losing connection. They go grey with the question mark or a red x. As soon as I power off the 4th node, the other 3 start working correctly again.

When I run pvecm status, the 1st node's Ring ID changes. I've removed the 4th node, wiped it clean and re-added it with the same results.

I'm at a loss, any ideas?

SteveITS · Oct 13, 2025

Sounds like an IP conflict? Bad NIC?

tstrenk · Oct 13, 2025

Thanks for the response.

Definitely not an IP conflict. I can ping each node just fine from each node. If it were a bad NIC, wouldn't it not be able to ping?

tcabernoch · Oct 13, 2025

Clue there ... you wiped it clean. And then you probably rejoined it with the same name ...

Did you delete /etc/pve/nodes/OLD-NODE-YOU-NUKED before rejoining the rebuilt machine?
Did you comment out the old ssh key in /etc/pve/priv/authorized_keys before rejoining the rebuilt machine?

Also, are all the nodes running corosync on the same subnet?
I've had issues like that when I selected the wrong subnet for the cluster join.

readyspace · Oct 13, 2025

HI, stale node entries or mismatched SSH keys can definitely cause cluster sync chaos.

In addition, make sure the new node’s ring0_addr matches the existing subnet in /etc/pve/corosync.conf, and that /etc/hosts across all nodes correctly maps each node’s cluster IP. Any mismatch there will break quorum when the 4th joins.

tstrenk · Oct 14, 2025

I'm going to wipe all of them clean again and see what happens. I will post results.

readyspace · Oct 14, 2025

You can also try shutting down all nodes, then start again from node 1 sequentially to node 4. It should work too.

tstrenk · Oct 17, 2025

So I have wiped all the nodes, placed them on a switch, all by themself, first three nodes were working great in the cluster, as soon as I added the 4th, it all broke. I've tested all of the network cards, no issues. I've replaced networking cables as well.

Johannes S · Oct 17, 2025

What network Hardware are you using? Do you have dedicated Networks for cluster communication and Ceph/ZFS replication ( if you happen to use one of it)?

Onslow · Oct 17, 2025

After you had wiped all the nodes, then reinstalled, is this 4th (fatal) node the same (I mean the physical server) as the 4th node before the wiping?
What if you change the order of adding the servers?
I mean: does the same server cause the problem, or the 4th one no matter which hardware is "4th"?

tstrenk · Oct 17, 2025

These are Dell PowerEdge servers, all with the same hardware. Yes the same 4th node when added it all breaks. So I just wiped node 2,3, and 4, create a cluster with those 3 and from the CLI it appears to be working when I run pvecm status, but I cannot get to the webgui from any of the 3 nodes.

tstrenk · Oct 18, 2025

It's very strange, if I shutdown node 2, then 3 and 4 work fine. As soon as node 2 comes back online, the cluster fails. If I shutdown node 3, same thing, 2 and 4 work fine.

readyspace · Oct 21, 2025

Yeah — that behavior pretty much screams Corosync identity or network conflict, not hardware.

When two nodes work fine but adding a third (or fourth) causes the whole cluster to fall apart, it’s usually one of these:

1. Duplicate nodeid or ring0_addr entries in /etc/pve/corosync.conf (worth double-checking on every node).
2. Stale /etc/pve/nodes/<oldnode> directories or SSH keys left from earlier attempts.
3. Multicast instability on the switch — try setting transport: udpu to use unicast and restart Corosync on all nodes.

If switching to unicast stabilizes it, you’ve found the culprit — some Dell onboard NICs and smart switches just don’t play nice with multicast clustering traffic.

tstrenk · Oct 21, 2025

readyspace said:
Yeah — that behavior pretty much screams Corosync identity or network conflict, not hardware.

When two nodes work fine but adding a third (or fourth) causes the whole cluster to fall apart, it’s usually one of these:

1. Duplicate nodeid or ring0_addr entries in /etc/pve/corosync.conf (worth double-checking on every node).
2. Stale /etc/pve/nodes/<oldnode> directories or SSH keys left from earlier attempts.
3. Multicast instability on the switch — try setting transport: udpu to use unicast and restart Corosync on all nodes.

If switching to unicast stabilizes it, you’ve found the culprit — some Dell onboard NICs and smart switches just don’t play nice with multicast clustering traffic.

Thanks for the information about the transport. I was researching and finding that might be the issue. However, I'm having trouble getting it switched over to udpu. After a fresh install, how can I make this change?

VictorSTS · Oct 21, 2025

Unless you set it manually or this is a cluster that has been upgraded since ancient times, your PVE is using unicast. PVE does not use multicast since PVE 4.x IIRC, when Corosync 3.x was introduced with the use of unicast kronosnet.

Post your /etc/pve/corosync.conf in each node and make 100% sure it is a symlink on each node to /etc/corosync/corosync.conf.

Also, check service logs with jornalctl -u corosync.service both during normal operation and during/after setting that fourth node in the cluster. Wondering why you didn't start debugging from there

tstrenk · Oct 21, 2025

I was able to set Transport: udpu but I also needed to set crypto_cipher and crypto_hash to none in order for corosync to start back up. By making these changes to the corosync.conf file, I was able to get the 3 nodes connected. I'm currently moving VMs off from the 4th node, and once finished, I will add it to the cluster. Hopefully this works. I will post results.

tstrenk · Oct 22, 2025

After wiping the 4th node, reinstalling, and then adding it to the cluster with the transport: udpu setting already in place, it worked great. No issues at all and now my 4 node cluster is working. So it turns out the issue was the switch not being able to handle multicast traffic.

Thank you to all who provided me assistance.

Johannes S · Oct 22, 2025

tstrenk said:
I was able to set Transport: udpu but I also needed to set crypto_cipher and crypto_hash to none in order for corosync to start back up. By making these changes to the corosync.conf file, I was able to get the 3 nodes connected. I'm currently moving VMs off from the 4th node, and once finished, I will add it to the cluster. Hopefully this works. I will post results.

Doesn't this disable the encryption completly? I wouldn't feel to comfortable with this change. On the other hand if your cluster network is only reachable from the cluster node the potential risc might be acceptable

VictorSTS · Oct 22, 2025

tstrenk said:
was the switch not being able to handle multicast traffic.

Again, Corosync does not use multicast with the default kronosnet transport, multicast isn't the issue.

Johannes S said:
Doesn't this disable the encryption completly?

It does and also does not support native link redundacy IIRC.

tstrenk · Oct 22, 2025

Johannes S said:
Doesn't this disable the encryption completly? I wouldn't feel to comfortable with this change. On the other hand if your cluster network is only reachable from the cluster node the potential risc might be acceptable

The cluster network is segmented and on a completely different VLAN. So not too worried at the moment.

VictorSTS said:
Again, Corosync does not use multicast with the default kronosnet transport, multicast isn't the issue.

It does and also does not support native link redundacy IIRC.

So if Corosync does not use multicast and uses knet, then why does udpu work? Why is knet causing issues?

Cluster Issues

New Member

Active Member

New Member

Active Member

Renowned Member

New Member

Renowned Member

New Member

Distinguished Member

Active Member

New Member

New Member

Renowned Member

New Member

Distinguished Member

New Member

New Member

Distinguished Member

Distinguished Member

New Member

We value your privacy