Joining new nodes simultaneously de-stabilizes and revert the cluster back to standalone hosts

josephnp

New Member
Apr 6, 2026
3
0
1
Hello all,

I'm running into this odd issue with joining new nodes simultaneously to my existing cluster. This seems to destabilizes it and revert all the hosts back to standalone machines. I'm wondering if anyone has run into this issue before and was able to figure out why it happens?

Observed behavior

========== Phase 1 — transient healthy cluster ==========

Cluster briefly forms with all nodes:

Members[5]: 1 2 3 4 5
node has quorum
pmxcfs: starting data synchronisation

========== Phase 2 — corosync instability ==========

Within seconds:

Token loss:

Token has not been received

Messaging failures:

cpg_send_message failed: CS_ERR_TRY_AGAIN
Rapid membership changes

========== Phase 3 — link / membership flapping ==========

Across nodes:

Links dropping:

host X link: down

Repeated membership churn:

Members joined: 1 2 3
Members left: 1 2 3
Failed to receive the leave message
Retransmit List

========== Phase 4 — split / inconsistent state ==========

Nodes disagree on membership:

ignore sync request from wrong member
remove message from non-member

pmxcfs queue buildup and retries:

cpg_send_message retried
dfsm_deliver_queue growing

========== Phase 5 — quorum loss and pmxcfs failure ==========

Cluster partitions:

Members[2]: 4 5
Members left: 1 2 3
node lost quorum

pmxcfs failure:

received write while not quorate
leaving CPG group
quorum_initialize failed
cmap_initialize failed
 
/etc/pve actually was deleted somehow during this fiasco. That's why the cluster reverted to being standalone node. I'm trying to figure out how that happens.

root@host1:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
10.60.116.92 host1.com host1

# The following lines are desirable for IPv6 capable hosts

::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
root@host1:~#
root@host1:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

# Bond member physical interfaces
auto ens3f6
iface ens3f6 inet manual

auto ens3f7
iface ens3f7 inet manual

auto ens3f0
iface ens3f0 inet manual

auto ens3f1
iface ens3f1 inet manual

auto bond0
iface bond0 inet manual
bond-slaves ens3f6 ens3f7
bond-mode active-backup
bond-miimon 100
bond-xmit-hash-policy layer3+4

auto vmbr0
iface vmbr0 inet manual
bridge-ports bond0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-1000

auto vmbr0.616
iface vmbr0.616 inet static
address 10.6.16.92/24
gateway 10.6.16.1

auto bond1
iface bond1 inet manual
bond-slaves ens3f0 ens3f1
bond-mode active-backup
bond-miimon 100
bond-xmit-hash-policy layer3+4

auto vmbr1
iface vmbr1 inet manual
bridge-ports bond1
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-1000
 
/etc/pve actually was deleted somehow
You no longer have a cluster. or at least not a functioning one. before proceeding to troubleshoot adding a node, do you want to rescue the existing cluster, or start from scratch? if you want to rescue the existing cluster, you need to establish enough existing members for quorum (which will also populate /etc/pve.)

Are any cluster members alive and available?
 
Thanks for the reply. This happened last week and we had to recreate the cluster from scratch. I'm interested in understanding how having 2 new nodes joined simultaneously, ended up de-stabilizing the cluster and reverted them back to standalone hosts due to the deletion of /etc/pve. How does that even happen? I looked at the logs but couldn't figure it out.
 
I'm interested in understanding how having 2 new nodes joined simultaneously, ended up de-stabilizing the cluster and reverted them back to standalone hosts due to the deletion of /etc/pve. How does that even happen?
/etc/pve isnt a normal filesystem. its a special cfs that is kept in a database format, and gets distributed and synchronized in real time. you can read about it here: https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs)

/etc/pve doesnt have to be deleted to NOT BE STARTED- a cluster node out of quorum wouldnt be allowed to mount it. it is possible to force mount it by using local mode, like so:
Code:
systemctl stop pve-cluster
systemctl stop corosync
pmxcfs -l

naturally this can only work on a node that has already been joined. Now to your next question- why or how can this happen. in most cases, this is due to network problems or misconfiguration, which is the reason I asked you to post what I did. For your specific configuration, I would advise to NOT use any vmbrs for corosync traffic; pick two interfaces (any two NICs, not bonds) and give them their own, SEPERATE subnets from the rest of your networks. use those explicitly as ring0 and ring1. and before you ask- yes, those interfaces can ALSO participate in bonds besides- just keep everything vlan'ed off.

Pay very close attention to MTU- each ring interface must match mtu with all other members of that same ring in the cluster. If you are comingling the interfaces with the bonds, make sure the bonds are not using a LARGER MTU then the interface is.

Before you actually add a node, make sure you have complete confidence (ping, iperf) that the new node has perfect connectivity with the rest of the cluster for each ring. This should be relatively simple to verify, but critical. follow these rules and you should be clustering away happily.
 
it's not a super good idea to use corosync over active-backup bond.
see: https://pve.proxmox.com/wiki/Cluster_Manager

at minimum, configure bond-primary to be sure to have the correct interface up by default, verify that all active interfaces are correctly on the same switch.
and if only the nic of 1 node is goinf down, and failover to second nic on another switch, it need to be able to communicate with the other servers nics on the other switch.