Cluster reboot after adding a new node

Hi all,

I'm running a PVE cluster with 18 nodes and about 750VMs, which has been running stable for several months.
Last week I've added a 19th node. The node was added to the cluster, and after 60 seconds, watchdog-mux simultaneously triggered a restart of all 18 existing nodes (the new node remained online).

I can't figure out what happened based on the journal logs. Does anyone have any idea where/how I can determine the root cause of the reboot?
Attached to this thread are log/version files from three nodes in the cluster.

Thanks in advance for any guidance on how to track this down!


For Context:
- The Cluster ist streched across 2 local datacenters.
- Both DCs are interconnected with a redundant 10 Gbit/s dark fiber (The DWDM fiber latency is negligible ).
- pve-node-101/105 are in DC1 and pve-node-203 is in DC2
- The 19th node "pve-node-199" was added (via the gui) to the existing cluster via node "pve-node-101".
 

Attachments

Hello,

It can happen when the nodes simultaneously lose quorum or stall their cluster stack long enough that the watchdog’s timeout expires.
Try to check:
- Jumbo frames (e.g. 9000) vs standard (1500) can silently drop packets
- Bonding misconfiguration, can be that LACP or failover modes not properly synced across nodes
- NIC driver quirks. Some drivers (e.g. ixgbe, i40e) have watchdog resets or offload bugs
- Routing asymmetry. Packets take different paths in each direction, breaking Corosync’s logic
- Firewall interference, if UDP ports or multicast blocked intermittently
- Time sync issues like NTP jumps can cause token timeouts
- DWDM jitter or failover. Even brief latency spikes can cause token loss in stretched clusters
- Corosync config inconsistency like Duplicate nodeid, ring address collision, wrong bindnetaddr, or mis-typed nodelist
 
I’ve checked the network side thoroughly and couldn’t find any issues. Based on what I’ve seen, I suspect this is more likely a Proxmox (pmxcfs) bug than a networking problem.

What I observed
  • Network
    • All nodes are connected via L2 (no routing, no firewall in between).
    • Even nodes on the same switch rebooted. In my understanding, they should still have seen each other and gone into read-only (split-brain) instead of rebooting.
  • DWDM / Failover
    • Switch logs don’t show any failovers. The DC with the majority (11 votes) should have stayed up.
  • Packets / Protocols
    • No packet drops visible.
    • STP and storm control weren’t triggered. Metrics show 0 dropped/errored packets.
  • MTU
    • Set to 1500 everywhere.
  • Time sync
    • All nodes use the same set of local NTP servers. Monitoring didn’t report any drift.
    • I can’t say for sure that the new node had the correct time.
  • Corosync
    • About 50 seconds before watchdog-mux reported “client watchdog expired”, corosync logged the join of the 19th member and “ready to provide service”.

My theory
Even if this had been a pure network issue, I would have expected all nodes to stay online and not reboot, with just a small delay (~3s).

What looks more likely to me is that pmxcfs hit a bug: a new node joining (maybe with incorrect time) caused pmxcfs to freeze/lock. With /etc/pve/priv/lock frozen, pve-ha-lrm stopped sending heartbeats, watchdog-mux fired, and the nodes rebooted.
That would also explain why the new node didn’t reboot — its pve-ha-lrm was still idle and didn’t hold a ha_agent_lock.

Problem
Unfortunately, pmxcfs logs are very sparse and hard to interpret, which makes this difficult to prove.
 
Last edited:
  • Like
Reactions: DerDanilo
Quick update on my findings so far:

On all 18 nodes, the same pattern occurs when the 19th node comes online:
  1. The 19th corosync member comes online.
  2. pmxcfs loggs a couple of cpg_send_message retries (retrys reach from 10 to 60).
  3. pmxcfs starts data synchronization.
  4. Corosync announces it’s ready to provide service.
  5. pmxcfs receives sync requests.
  6. Corosync retransmits
  7. The node reboots.
Does pmxcfs freeze while trying to synchronize? If so, is there a timeout for that process?
Could this behavior be caused by packet loss, or perhaps by invalid time synchronization between the nodes?

If I can’t figure this out, I won’t be able to trust the cluster anymore and may have to split it into smaller 5-node clusters to reduce the risk of a full infrastructure outage over what seems like a small Proxmox issue.


Here are the relevant logs:
Code:
Sep 03 14:47:59 pve-node-203 corosync[5442]:   [KNET  ] link: Resetting MTU for link 0 because host 19 joined
Sep 03 14:47:59 pve-node-203 corosync[5442]:   [KNET  ] host: host: 19 (passive) best link: 0 (pri: 1)
Sep 03 14:47:59 pve-node-203 corosync[5442]:   [QUORUM] Sync members[19]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Sep 03 14:47:59 pve-node-203 corosync[5442]:   [QUORUM] Sync joined[1]: 19
Sep 03 14:47:59 pve-node-203 corosync[5442]:   [TOTEM ] A new membership (1.3f2) was formed. Members joined: 19
Sep 03 14:47:59 pve-node-203 corosync[5442]:   [KNET  ] pmtud: PMTUD link change for host: 19 link: 0 from 469 to 1397
Sep 03 14:47:59 pve-node-203 corosync[5442]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Sep 03 14:48:02 pve-node-203 pmxcfs[1967]: [status] notice: cpg_send_message retry 10
Sep 03 14:48:03 pve-node-203 pmxcfs[1967]: [dcdb] notice: cpg_send_message retry 10
Sep 03 14:48:03 pve-node-203 pmxcfs[1967]: [status] notice: cpg_send_message retry 20
Sep 03 14:48:04 pve-node-203 pmxcfs[1967]: [dcdb] notice: cpg_send_message retry 20
Sep 03 14:48:04 pve-node-203 pmxcfs[1967]: [status] notice: cpg_send_message retry 30
Sep 03 14:48:05 pve-node-203 pmxcfs[1967]: [dcdb] notice: cpg_send_message retry 30
Sep 03 14:48:05 pve-node-203 pmxcfs[1967]: [status] notice: cpg_send_message retry 40
Sep 03 14:48:05 pve-node-203 pmxcfs[1967]: [dcdb] notice: members: 1/3086, 2/3127, 3/11556, 4/1827, 5/1817, 6/2009, 7/1967, 8/437963, 9/437198, 10/10591, 11/1807, 12/8637, 13/9043, 14/10681, 15/10566, 16/10572, 17/2923, 18/10095, 19/4083
Sep 03 14:48:05 pve-node-203 pmxcfs[1967]: [dcdb] notice: starting data syncronisation
Sep 03 14:48:05 pve-node-203 pmxcfs[1967]: [status] notice: members: 1/3086, 2/3127, 3/11556, 4/1827, 5/1817, 6/2009, 7/1967, 8/437963, 9/437198, 10/10591, 11/1807, 12/8637, 13/9043, 14/10681, 15/10566, 16/10572, 17/2923, 18/10095, 19/4083
Sep 03 14:48:05 pve-node-203 pmxcfs[1967]: [status] notice: starting data syncronisation
Sep 03 14:48:05 pve-node-203 corosync[5442]:   [QUORUM] Members[19]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Sep 03 14:48:05 pve-node-203 corosync[5442]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 03 14:48:05 pve-node-203 pmxcfs[1967]: [status] notice: cpg_send_message retried 45 times
Sep 03 14:48:05 pve-node-203 pmxcfs[1967]: [dcdb] notice: cpg_send_message retried 37 times
Sep 03 14:48:05 pve-node-203 pmxcfs[1967]: [dcdb] notice: received sync request (epoch 1/3086/00000042)
Sep 03 14:48:06 pve-node-203 pmxcfs[1967]: [status] notice: received sync request (epoch 1/3086/00000022)
Sep 03 14:48:07 pve-node-203 pvestatd[5623]: status update time (6.570 seconds)
Sep 03 14:48:19 pve-node-203 corosync[5442]:   [TOTEM ] Retransmit List: 95 96 97 98
Sep 03 14:48:23 pve-node-203 corosync[5442]:   [TOTEM ] Retransmit List: 95 96 97 98      
Sep 03 14:48:26 pve-node-203 corosync[5442]:   [TOTEM ] Retransmit List: 95 96 97 98 e5
Sep 03 14:48:29 pve-node-203 corosync[5442]:   [TOTEM ] Retransmit List: 95 96 97 98 ed
Sep 03 14:48:33 pve-node-203 corosync[5442]:   [TOTEM ] Retransmit List: 95 96 97 98
Sep 03 14:48:36 pve-node-203 corosync[5442]:   [TOTEM ] Retransmit List: 95 96 97 98
Sep 03 14:48:40 pve-node-203 corosync[5442]:   [TOTEM ] Retransmit List: 95 96 97 98
Sep 03 14:48:43 pve-node-203 corosync[5442]:   [TOTEM ] Retransmit List: 95 96 97 98
Sep 03 14:48:46 pve-node-203 corosync[5442]:   [TOTEM ] Retransmit List: 95 96 97 98
Sep 03 14:48:50 pve-node-203 corosync[5442]:   [TOTEM ] Retransmit List: 95 96 97 98
Sep 03 14:48:53 pve-node-203 corosync[5442]:   [TOTEM ] Retransmit List: 95 96 97 98
Sep 03 14:48:56 pve-node-203 corosync[5442]:   [TOTEM ] Retransmit List: 95 96 97 98
Sep 03 14:48:58 pve-node-203 watchdog-mux[1082]: client watchdog expired - disable watchdog updates                                
-- Boot a02ad8afd9e9413e942e38c59806ffc0 --
 
Last edited:
  • Like
Reactions: fhloston
  • Like
Reactions: Jan Wedershoven