[TOTEM ] Retransmit List ... causing entire HA cluster to reboot unexpectedly.

compliceit · Jun 6, 2025

After our new cluster deployment was working well, we encountered an issue on two separate occasions over a two-week period: the entire cluster rebooted randomly without notifying us of any errors or fencing off the node that appeared to be causing issues. The logs all point to the watchdog timer expiring across our three-node cluster, forcing a full reboot of all three nodes. We’re puzzled why this is happening, as there are no relevant entries in the system logs for nodes 1–3. I have attached logs for pve2, which are virtually identical to those for all three nodes. On the networking side, there were no connection losses on the ports—each node is uplinked via LACP with 2×10 Gbps links into a dedicated untagged VLAN.

The first Corosync retransmit message appeared at 10:32:18, and that’s when things started to go downhill. By 11:23:52, the issues had worsened until the machine rebooted at 13:37:20. The root connection logs are from Veeam. Our switch handling the Corosync traffic is a Mikrotik switch, and I have also validated the LACP configuration. I've attached the relevant log file.

Thanks for your help!

fabian · Jun 6, 2025

seems like your link is flapping.. is the corosync network separate? could you give more details about your network config?

compliceit · Jun 6, 2025

Yes the corosync network is seperate, this applies to all 3 nodes. Screenshot is from node 2.

What I find so odd though, is those retransmits occur across those 3 nodes almost at the same time. And if the port is flapping, wouldn't the redundant LACP link take over the faulty link? Even then, wouldn't this cause the node to get fenced off instead of causing a watchdog timer expiration cluster wide?

fabian · Jun 10, 2025

using a bond as a corosync link can cause exactly these symptoms - your are effectively stacking failover and link recovery, which means it takes a lot longer to converge, which can mean that corosync is not able to converge itself before the link falls apart again.

the watchdog expiring is a/all node(s) being fenced.

compliceit · Jun 10, 2025

Then in this case, would your recommendation be to have the corosync interface be a single non bonded interface? And if that is the case, would it be better to just "offline" one of the bonded ports on the switch side for each pve that way only 1 interface for each host is up, or it would be better to reconfigure the interface in proxmox as being a single one?

fabian · Jun 11, 2025

ideally you'd have two physical interfaces dedicated to corosync traffic (1G is enough - it's mostly latency that matters).

compliceit · Jun 11, 2025

Ok. Cause each node has 2x 10gb on board nics. However, could you just clarify by what you mean with having 2 physical interfaces dedicated to corosync? Is that per node or as a whole for a 2 node cluster? Just want to be sure I fully understand.

fabian · Jun 11, 2025

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_cluster_network has all the details, including how to reconfigure a cluster.

VictorSTS · Jun 11, 2025

compliceit said:
for a 2 node cluster

Keep in mind that in 2 node cluster, if one loses quorum, the other one will lose it too, as it won't have a majority of votes (will have just 1 vote with is exactly 50% of 2 votes total). A 2 node cluster + HA will not provide any redundancy/resiliency at all. At the very least, add a QDevice [1].

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_external_vote_support

compliceit · Jun 11, 2025

VictorSTS said:
Keep in mind that in 2 node cluster, if one loses quorum, the other one will lose it too, as it won't have a majority of votes (will have just 1 vote with is exactly 50% of 2 votes total). A 2 node cluster + HA will not provide any redundancy/resiliency at all. At the very least, add a QDevice [1].

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_external_vote_support

We're running a 3 node cluster. But good to know!

compliceit · Jun 11, 2025

fabian said:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_cluster_network has all the details, including how to reconfigure a cluster.

Thanks for the info, I'll look into this and get back to you if we face any issues!

Search

Search

[TOTEM ] Retransmit List ... causing entire HA cluster to reboot unexpectedly.

compliceit

New Member

Attachments

fabian

Proxmox Staff Member

compliceit

New Member

fabian

Proxmox Staff Member

compliceit

New Member

fabian

Proxmox Staff Member

compliceit

New Member

fabian

Proxmox Staff Member

VictorSTS

Distinguished Member

compliceit

New Member

compliceit

New Member

We value your privacy