Cluster retransmit issues

sin3vil

Member
Mar 11, 2024
5
0
6
Greetings!

We have a 3-node cluster that we're trying to add a new member to.

The moment the new member joins there's a huge retransmit storm and cluster starts breaking up, eventually locking the pmxcfs filesystem and triggering watchdogs to reboot nodes. If i disable pve-cluster and corosync services on the new node the remaining 3 nodes almost instantly recover.

The old 3 nodes are exactly the same hardware, with 1G NICs for the corosync network. These are pve1 pve3 and pve4.
The new node is different, including a Broadcom BCM57414 NetXtreme 25G NIC that we have forced to 1G while troubleshooting. This is pve2.
Note that nodeid order is pve1>pve2>pve4>pve3.

Network is working fine.
I can ping and SSH to all nodes from each other.
MTU has been tested to 1500, but we've also set knet_mtu to 1300, just to be safe.

Generally, we're seeing pve2 plus 2 of the other nodes whine about some missing messages in journal logs, constantly. the 4th node, is generally not whining.
However, we'd expect that any missing messages get retransmitted and the journal messages to go away.

At this moment pve1, pve2 and pve3 throw [TOTEM ] Retransmit List: 22 23, while pve4 is silent.
I'm assuming this means that pve1, pve3 and pve4 are missing the messages and adding them to the retransmit_list, sending it off to the next node in the chain. pve2 is probably retransmitting these 3 messages and cleaning up the retransmit_list, before passing it on to pve4 (nodeid 3), which re-adds the messages to the retransmit_list.

What we're seeing with tcpdump is that pve1, pve3 and pve4 have minimal knet traffic, passing the token along, while pve2 is constantly spamming everyone with MTU-sized packets. We have disabled encryption to try and see what kind of traffic it is, but the vast majority of packets are not human readable. Some that are readable seem to be files from /etc/pve (like known_hosts) but I'm pretty sure that pve1, pve3 and pve4 already have those at latest version based on md5sums.

Sniffing on pve1, pve2 traffic is ~6 times more than the other 2 nodes :
Code:
# for i in {2..4}; do timeout 5 tcpdump -i eno1 "host 169.254.0.10$i and port 5405" -w /tmp/dump-pve$i.pcap;done
tcpdump: listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
780 packets captured
782 packets received by filter
0 packets dropped by kernel
tcpdump: listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
108 packets captured
158 packets received by filter
0 packets dropped by kernel
tcpdump: listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
131 packets captured
135 packets received by filter
0 packets dropped by kernel

Eventually the cluster sync falls apart, the ring may get reformed but we'll get retransmits for those new message ids.

All nodes are on pve-manager/8.4.17/c8c39014680186a7 (running kernel: 6.8.12-19-pve)

I'd very much appreciate some guidance here, on how to troubleshoot further, and some high level description on what Proxmox syncronizes via corosync that may be somehow conflicting on the receiving nodes, causing a retransmit loop.

EDIT:
Wanted to add that we've removed and re-joined pve2 multiple times while troubleshooting this, each time cleaning up it's corosync configs, pve-cluster database and remote node folders. We haven't rebuilt the cluster, as apart from pve2, other nodes work fine.
 
Last edited: