[TOTEM ] Retransmit List ... causing entire HA cluster to reboot unexpectedly.

Jun 6, 2025
6
0
1
After our new cluster deployment was working well, we encountered an issue on two separate occasions over a two-week period: the entire cluster rebooted randomly without notifying us of any errors or fencing off the node that appeared to be causing issues. The logs all point to the watchdog timer expiring across our three-node cluster, forcing a full reboot of all three nodes. We’re puzzled why this is happening, as there are no relevant entries in the system logs for nodes 1–3. I have attached logs for pve2, which are virtually identical to those for all three nodes. On the networking side, there were no connection losses on the ports—each node is uplinked via LACP with 2×10 Gbps links into a dedicated untagged VLAN.


The first Corosync retransmit message appeared at 10:32:18, and that’s when things started to go downhill. By 11:23:52, the issues had worsened until the machine rebooted at 13:37:20. The root connection logs are from Veeam. Our switch handling the Corosync traffic is a Mikrotik switch, and I have also validated the LACP configuration. I've attached the relevant log file.

Thanks for your help!
 

Attachments

seems like your link is flapping.. is the corosync network separate? could you give more details about your network config?
 
Yes the corosync network is seperate, this applies to all 3 nodes. Screenshot is from node 2.
1749214925079.png

What I find so odd though, is those retransmits occur across those 3 nodes almost at the same time. And if the port is flapping, wouldn't the redundant LACP link take over the faulty link? Even then, wouldn't this cause the node to get fenced off instead of causing a watchdog timer expiration cluster wide?
 
using a bond as a corosync link can cause exactly these symptoms - your are effectively stacking failover and link recovery, which means it takes a lot longer to converge, which can mean that corosync is not able to converge itself before the link falls apart again.

the watchdog expiring is a/all node(s) being fenced.
 
Then in this case, would your recommendation be to have the corosync interface be a single non bonded interface? And if that is the case, would it be better to just "offline" one of the bonded ports on the switch side for each pve that way only 1 interface for each host is up, or it would be better to reconfigure the interface in proxmox as being a single one?
 
Last edited:
ideally you'd have two physical interfaces dedicated to corosync traffic (1G is enough - it's mostly latency that matters).
 
Ok. Cause each node has 2x 10gb on board nics. However, could you just clarify by what you mean with having 2 physical interfaces dedicated to corosync? Is that per node or as a whole for a 2 node cluster? Just want to be sure I fully understand.