corosync [TOTEM } Retransmit List..

derringer

Renowned Member
May 17, 2012
5
1
68
I wanted to post on this issue I am currently troubleshooting. It is not causing problems yet, but I do know that it implies a lag in the cluster communications and so could cause an issue down the road. This is a homelab install where I added a new server to the cluster, so I am still in validation phase of this new server. This thread can be a little experiment where I will update on what I have found, what I am doing to troubleshoot it, and what I do to finally hopefully fix it for production.

First off, this is a Realtek 10gbe onboard NIC on the server which is a newer chipset of RTL18127a using firmware in linux of " rtl8127a-1_0.0.5 05/14/25 "

The NIC is connected straight through to the NIC of the other cluster node where this issue occurs. This is a 1GBe link because the other side is a 1GBE NIC , with no switch, using the simplified Cluster corosync mech/ring network shown in the proxmox documentation (i.e. this is a 3-node cluster with 2 NIC ports on each node dedicated to Ring-Mesh based cluster communications, not using SDN, but the simple method that broadcasts to both nics in the ring of each of the 3 nodes.)
It may also be important to note that there is a secondary Corosync failover network (Link: 1,) which is a switched 10GBE network. This network is never called on during the above error, as of yet, to take over. (I kind of have a secondary question on that, but will hold until the end.)

Just to get things out of the way, please don't recommend a different type of SDN or non-ring network, or using the 10GBE switched network, etc., as this is really a proof of concept homelab thing where I am trying different setups to see how they work. This one, specifically, is trying to add a 1GBE connected node to the cluster where the other 2 nodes are direct connected at 10GBE. And, by the way, it is fully working, I am just seeing a few of these above TOTEM Retransmit errors in the logs from time to time.


So, the specific instance where the above error occurs is when the source initiates a replication to the secondary node (replications occur on the corosync network, as have been done on all other nodes in this cluster with this setup for over a year with zero issues. I have seen failovers to the secondary corosync network, which is by design if it ever has to, but not in this error case.):

-ZFS Replication from the node this error is on (we will call the 'Source Node',) to the secondary node. The Restransmit List log entry lines generally lasts from 1-10 seconds, depending on how much data ZFS replication must send over the link when replicating.
-These log items do not occur when the secondary node replicates back to this server in either server's logs, so it is specific to this server *sending* to the other.

Items tried so far:
-I have tried sending back the other direction on that same link, and the retransmits never occur.
-These retransmits never occur on the other link to the third node, in either direction, but that is on different NICs.


Next Steps:
-Will move the link between these two servers to other NICs on each end, as the issue could still theoretically be the secondary target node's NIC on this direct link (this troubleshooting step will narrow the issue to one of the 2 NICs involved in this Ring/Mesh network.


I'm mostly going to document here how I fix the issue or figure out the problem, so no need for anything but discussion here, especially if you've dealt with similar issues in the past.
For the record, My suspicion is it is the driver on this 'newish' Realtek 10GBe nic (driver being used is R8169 by the way, which I've researched to have been updated with a limited number of lines of code to make it compatible with this new 10GBe version when the driver used to be 1GBe only.)
 
Update on this, for anyone interested:

Issue moved with the movement from the Realtek 10GBe R8169 driver to a different NIC (different chipset and different kernel driver Realtek 2.5GBe,) but still on the motherboard.

Anecdotally, it appeared to get better with the most recent Proxmox Kernel and version update on the machine. Was harder to reproduce it, but eventually was able to reproduce it under a scheduled late night zfs replication as noted above for 7 seconds last night. Couldn't reproduce it doing full machine migrations with no zfs replication 'base' on second node, so it was a full copy of the VM at 125MB/s on that nic for 30+ mins, and it didn't reproduce during a migration.

Troubleshooting continues... Next step is the addition of a new NIC, probably an Intel PCI-e server NIC, to eliminate the possibility it could be both realtek NICs on this box.