Cluster retransmit issues

sin3vil · Mar 11, 2026

Greetings!

We have a 3-node cluster that we're trying to add a new member to.

The moment the new member joins there's a huge retransmit storm and cluster starts breaking up, eventually locking the pmxcfs filesystem and triggering watchdogs to reboot nodes. If i disable pve-cluster and corosync services on the new node the remaining 3 nodes almost instantly recover.

The old 3 nodes are exactly the same hardware, with 1G NICs for the corosync network. These are pve1 pve3 and pve4.
The new node is different, including a Broadcom BCM57414 NetXtreme 25G NIC that we have forced to 1G while troubleshooting. This is pve2.
Note that nodeid order is pve1>pve2>pve4>pve3.

Network is working fine.
I can ping and SSH to all nodes from each other.
MTU has been tested to 1500, but we've also set knet_mtu to 1300, just to be safe.

Generally, we're seeing pve2 plus 2 of the other nodes whine about some missing messages in journal logs, constantly. the 4th node, is generally not whining.
However, we'd expect that any missing messages get retransmitted and the journal messages to go away.

At this moment pve1, pve2 and pve3 throw [TOTEM ] Retransmit List: 22 23, while pve4 is silent.
I'm assuming this means that pve1, pve3 and pve4 are missing the messages and adding them to the retransmit_list, sending it off to the next node in the chain. pve2 is probably retransmitting these 3 messages and cleaning up the retransmit_list, before passing it on to pve4 (nodeid 3), which re-adds the messages to the retransmit_list.

What we're seeing with tcpdump is that pve1, pve3 and pve4 have minimal knet traffic, passing the token along, while pve2 is constantly spamming everyone with MTU-sized packets. We have disabled encryption to try and see what kind of traffic it is, but the vast majority of packets are not human readable. Some that are readable seem to be files from /etc/pve (like known_hosts) but I'm pretty sure that pve1, pve3 and pve4 already have those at latest version based on md5sums.

Sniffing on pve1, pve2 traffic is ~6 times more than the other 2 nodes :

Code:

# for i in {2..4}; do timeout 5 tcpdump -i eno1 "host 169.254.0.10$i and port 5405" -w /tmp/dump-pve$i.pcap;done
tcpdump: listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
780 packets captured
782 packets received by filter
0 packets dropped by kernel
tcpdump: listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
108 packets captured
158 packets received by filter
0 packets dropped by kernel
tcpdump: listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
131 packets captured
135 packets received by filter
0 packets dropped by kernel

Eventually the cluster sync falls apart, the ring may get reformed but we'll get retransmits for those new message ids.

All nodes are on pve-manager/8.4.17/c8c39014680186a7 (running kernel: 6.8.12-19-pve)

I'd very much appreciate some guidance here, on how to troubleshoot further, and some high level description on what Proxmox syncronizes via corosync that may be somehow conflicting on the receiving nodes, causing a retransmit loop.

EDIT:
Wanted to add that we've removed and re-joined pve2 multiple times while troubleshooting this, each time cleaning up it's corosync configs, pve-cluster database and remote node folders. We haven't rebuilt the cluster, as apart from pve2, other nodes work fine.

sin3vil · Mar 13, 2026

bump and prayers

sin3vil · Mar 17, 2026

bump and shattered hopes

spirit · Mar 18, 2026

can you send corosync log of each node ? (journalctl -u corosync).

is the nic for corosync link dedicated ? or do you have vm,storage,backup,...running on it too ? (no bandwidth saturation ? )

no spanning tree on the network ? do you use bonding or not ?

sin3vil · Mar 18, 2026

Hello Spirit, thanks for taking the time to reply.

The NIC is, normally, not dedicated. Yes, we know.
However, during aforementioned tests we had completely disabled the linux bridge for VM traffic and was monitoring both on nodes via iftop and the uplink port on the switch, and there was no saturation or even other (VLANed) traffic.
There's no bonding happening any any point. RSTP is enabled on the network in general but disabled for vmbr0 (that corosync sits on) on the proxmox nodes. It's not disabled on the switch ports however.

Note that we have multiple same clusters deployed, same "old" hardware, where a node was replaced with a "new" hardware. Old and new are exact matches with this problematic cluster. Switches and network topology are also the same. The only differences are what VMs are hosted and what exact versions the "new" hardware was joined on, In some cases it was all 7.4.x), in some 8.4.x like this and some had split (latest) 7.4.x + 8.4.x versions that weirdly worked without any fuss.

I've attache the corosync logs. This was from the last time the problematic node was joined, as I don't have it readily available to join back at the moment.
There's not much there apart from the retransmits and some link flaps.

spirit · Mar 19, 2026

About spanning-tree, you should really disable it on your physical switch port of for your proxmox nodes. a spanning tree convergence can happen on host reboot and broke the whole cluster for some second.

you don't need change knet_mtu, it's auto-compute by corosync.

it could be a bug with the broadcom nic on pve2 too. you don't have any error in kernel logs ? (#dmesg or /var/log/kern.log ? )

sin3vil · Mar 20, 2026

Sure, I'll look into spanning tree, but the switch isn't reporting any bridge cost recalcs, so I'm assuming no convergence was made. We also have this same setup in multiple other locations and never had issues with RSTP, even in cases were VMs were accidentally leaking traffic (the whole node went down but did not affect the cluster like this).

knet_mtu was changed cause most online resources pointed to MTU if you're missing messages, so while we confirmed MTU between nodes to be 1500, we still lowered it to a presumed "safe" 1300. I'll revert these changes once we get it actually working.

There are no kernel messages that appear related. NIC seems to be behaving properly. The issue appears to be stemming from the three old nodes, as they request retransmits, pve2 resends them, confirmed via tcpdump on both ends that packets get received on the network level, and continue requesting the same retransmits.
I've also checked 2 other locations that have the same PVE version, kernel and NIC firmware that are working fine.

kd-infradijon · Apr 10, 2026

Hello Sinevil,

We have a similar problem than you. Our previous installations worked fine, using the same equipment and the same connection, but know, we have the same errors than you.

We've tried a variety of troubleshooting steps in every possible direction—both on the system side and the network side—but without success.

Have you found any more concrete leads on your end since then, or are you still experiencing the same issue?

Regards,
IDEZ Ugo

kd-infradijon · Apr 10, 2026

I post on the forum too about this problem

:
- https://forum.proxmox.com/threads/c...tion-problems-proxmox-cluster-3-nodes.182559/

kd-infradijon · Apr 13, 2026

sin3vil said:
Greetings!

We have a 3-node cluster that we're trying to add a new member to.

The moment the new member joins there's a huge retransmit storm and cluster starts breaking up, eventually locking the pmxcfs filesystem and triggering watchdogs to reboot nodes. If i disable pve-cluster and corosync services on the new node the remaining 3 nodes almost instantly recover.

The old 3 nodes are exactly the same hardware, with 1G NICs for the corosync network. These are pve1 pve3 and pve4.
The new node is different, including a Broadcom BCM57414 NetXtreme 25G NIC that we have forced to 1G while troubleshooting. This is pve2.
Note that nodeid order is pve1>pve2>pve4>pve3.

Network is working fine.
I can ping and SSH to all nodes from each other.
MTU has been tested to 1500, but we've also set knet_mtu to 1300, just to be safe.

Generally, we're seeing pve2 plus 2 of the other nodes whine about some missing messages in journal logs, constantly. the 4th node, is generally not whining.
However, we'd expect that any missing messages get retransmitted and the journal messages to go away.

At this moment pve1, pve2 and pve3 throw [TOTEM ] Retransmit List: 22 23, while pve4 is silent.
I'm assuming this means that pve1, pve3 and pve4 are missing the messages and adding them to the retransmit_list, sending it off to the next node in the chain. pve2 is probably retransmitting these 3 messages and cleaning up the retransmit_list, before passing it on to pve4 (nodeid 3), which re-adds the messages to the retransmit_list.

What we're seeing with tcpdump is that pve1, pve3 and pve4 have minimal knet traffic, passing the token along, while pve2 is constantly spamming everyone with MTU-sized packets. We have disabled encryption to try and see what kind of traffic it is, but the vast majority of packets are not human readable. Some that are readable seem to be files from /etc/pve (like known_hosts) but I'm pretty sure that pve1, pve3 and pve4 already have those at latest version based on md5sums.

Sniffing on pve1, pve2 traffic is ~6 times more than the other 2 nodes :

Code:

# for i in {2..4}; do timeout 5 tcpdump -i eno1 "host 169.254.0.10$i and port 5405" -w /tmp/dump-pve$i.pcap;done tcpdump: listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes 780 packets captured 782 packets received by filter 0 packets dropped by kernel tcpdump: listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes 108 packets captured 158 packets received by filter 0 packets dropped by kernel tcpdump: listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes 131 packets captured 135 packets received by filter 0 packets dropped by kernel

Eventually the cluster sync falls apart, the ring may get reformed but we'll get retransmits for those new message ids.

All nodes are on pve-manager/8.4.17/c8c39014680186a7 (running kernel: 6.8.12-19-pve)

I'd very much appreciate some guidance here, on how to troubleshoot further, and some high level description on what Proxmox syncronizes via corosync that may be somehow conflicting on the receiving nodes, causing a retransmit loop.

EDIT:
Wanted to add that we've removed and re-joined pve2 multiple times while troubleshooting this, each time cleaning up it's corosync configs, pve-cluster database and remote node folders. We haven't rebuilt the cluster, as apart from pve2, other nodes work fine.

Hello,

Any news about your problem or not ?
We are still stuck.

Regards,
IDEZ Ugo

kd-infradijon · May 12, 2026

Hello,

We have solved the problem on our side

.

Regards,
IDEZ Ugo

sin3vil · May 12, 2026

Hi @kd-infradijon,

Thanks for replying back.
I can see from your thread that the solution was to force 10G (or 25G) links instead of auto-negotiation?
We didn't have the ability to do this in our case, as the three existing nodes have 1G NICs, but forcing everything to 1G did not solve the re-transmit issue.
Anyway, glad that you found a solution that works for you.

Cluster retransmit issues

sin3vil

Member

sin3vil

Member

sin3vil

Member

spirit

Distinguished Member

sin3vil

Member

Attachments

spirit

Distinguished Member

sin3vil

Member

kd-infradijon

New Member

kd-infradijon

New Member

kd-infradijon

New Member

kd-infradijon

New Member

sin3vil

Member

We value your privacy