Cluster retransmit issues

sin3vil

Member
Mar 11, 2024
9
0
6
Greetings!

We have a 3-node cluster that we're trying to add a new member to.

The moment the new member joins there's a huge retransmit storm and cluster starts breaking up, eventually locking the pmxcfs filesystem and triggering watchdogs to reboot nodes. If i disable pve-cluster and corosync services on the new node the remaining 3 nodes almost instantly recover.

The old 3 nodes are exactly the same hardware, with 1G NICs for the corosync network. These are pve1 pve3 and pve4.
The new node is different, including a Broadcom BCM57414 NetXtreme 25G NIC that we have forced to 1G while troubleshooting. This is pve2.
Note that nodeid order is pve1>pve2>pve4>pve3.

Network is working fine.
I can ping and SSH to all nodes from each other.
MTU has been tested to 1500, but we've also set knet_mtu to 1300, just to be safe.

Generally, we're seeing pve2 plus 2 of the other nodes whine about some missing messages in journal logs, constantly. the 4th node, is generally not whining.
However, we'd expect that any missing messages get retransmitted and the journal messages to go away.

At this moment pve1, pve2 and pve3 throw [TOTEM ] Retransmit List: 22 23, while pve4 is silent.
I'm assuming this means that pve1, pve3 and pve4 are missing the messages and adding them to the retransmit_list, sending it off to the next node in the chain. pve2 is probably retransmitting these 3 messages and cleaning up the retransmit_list, before passing it on to pve4 (nodeid 3), which re-adds the messages to the retransmit_list.

What we're seeing with tcpdump is that pve1, pve3 and pve4 have minimal knet traffic, passing the token along, while pve2 is constantly spamming everyone with MTU-sized packets. We have disabled encryption to try and see what kind of traffic it is, but the vast majority of packets are not human readable. Some that are readable seem to be files from /etc/pve (like known_hosts) but I'm pretty sure that pve1, pve3 and pve4 already have those at latest version based on md5sums.

Sniffing on pve1, pve2 traffic is ~6 times more than the other 2 nodes :
Code:
# for i in {2..4}; do timeout 5 tcpdump -i eno1 "host 169.254.0.10$i and port 5405" -w /tmp/dump-pve$i.pcap;done
tcpdump: listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
780 packets captured
782 packets received by filter
0 packets dropped by kernel
tcpdump: listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
108 packets captured
158 packets received by filter
0 packets dropped by kernel
tcpdump: listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
131 packets captured
135 packets received by filter
0 packets dropped by kernel

Eventually the cluster sync falls apart, the ring may get reformed but we'll get retransmits for those new message ids.

All nodes are on pve-manager/8.4.17/c8c39014680186a7 (running kernel: 6.8.12-19-pve)

I'd very much appreciate some guidance here, on how to troubleshoot further, and some high level description on what Proxmox syncronizes via corosync that may be somehow conflicting on the receiving nodes, causing a retransmit loop.

EDIT:
Wanted to add that we've removed and re-joined pve2 multiple times while troubleshooting this, each time cleaning up it's corosync configs, pve-cluster database and remote node folders. We haven't rebuilt the cluster, as apart from pve2, other nodes work fine.
 
Last edited:
Hello Spirit, thanks for taking the time to reply.

The NIC is, normally, not dedicated. Yes, we know.
However, during aforementioned tests we had completely disabled the linux bridge for VM traffic and was monitoring both on nodes via iftop and the uplink port on the switch, and there was no saturation or even other (VLANed) traffic.
There's no bonding happening any any point. RSTP is enabled on the network in general but disabled for vmbr0 (that corosync sits on) on the proxmox nodes. It's not disabled on the switch ports however.

Note that we have multiple same clusters deployed, same "old" hardware, where a node was replaced with a "new" hardware. Old and new are exact matches with this problematic cluster. Switches and network topology are also the same. The only differences are what VMs are hosted and what exact versions the "new" hardware was joined on, In some cases it was all 7.4.x), in some 8.4.x like this and some had split (latest) 7.4.x + 8.4.x versions that weirdly worked without any fuss.

I've attache the corosync logs. This was from the last time the problematic node was joined, as I don't have it readily available to join back at the moment.
There's not much there apart from the retransmits and some link flaps.
 

Attachments

About spanning-tree, you should really disable it on your physical switch port of for your proxmox nodes. a spanning tree convergence can happen on host reboot and broke the whole cluster for some second.

you don't need change knet_mtu, it's auto-compute by corosync.

it could be a bug with the broadcom nic on pve2 too. you don't have any error in kernel logs ? (#dmesg or /var/log/kern.log ? )
 
Sure, I'll look into spanning tree, but the switch isn't reporting any bridge cost recalcs, so I'm assuming no convergence was made. We also have this same setup in multiple other locations and never had issues with RSTP, even in cases were VMs were accidentally leaking traffic (the whole node went down but did not affect the cluster like this).

knet_mtu was changed cause most online resources pointed to MTU if you're missing messages, so while we confirmed MTU between nodes to be 1500, we still lowered it to a presumed "safe" 1300. I'll revert these changes once we get it actually working.

There are no kernel messages that appear related. NIC seems to be behaving properly. The issue appears to be stemming from the three old nodes, as they request retransmits, pve2 resends them, confirmed via tcpdump on both ends that packets get received on the network level, and continue requesting the same retransmits.
I've also checked 2 other locations that have the same PVE version, kernel and NIC firmware that are working fine.