I wanted to post on this issue I am currently troubleshooting. It is not causing problems yet, but I do know that it implies a lag in the cluster communications and so could cause an issue down the road. This is a homelab install where I added a new server to the cluster, so I am still in validation phase of this new server. This thread can be a little experiment where I will update on what I have found, what I am doing to troubleshoot it, and what I do to finally hopefully fix it for production.
First off, this is a Realtek 10gbe onboard NIC on the server which is a newer chipset of RTL18127a using firmware in linux of " rtl8127a-1_0.0.5 05/14/25 "
The NIC is connected straight through to the NIC of the other cluster node where this issue occurs. This is a 1GBe link because the other side is a 1GBE NIC , with no switch, using the simplified Cluster corosync mech/ring network shown in the proxmox documentation (i.e. this is a 3-node cluster with 2 NIC ports on each node dedicated to Ring-Mesh based cluster communications, not using SDN, but the simple method that broadcasts to both nics in the ring of each of the 3 nodes.)
It may also be important to note that there is a secondary Corosync failover network (Link: 1,) which is a switched 10GBE network. This network is never called on during the above error, as of yet, to take over. (I kind of have a secondary question on that, but will hold until the end.)
Just to get things out of the way, please don't recommend a different type of SDN or non-ring network, or using the 10GBE switched network, etc., as this is really a proof of concept homelab thing where I am trying different setups to see how they work. This one, specifically, is trying to add a 1GBE connected node to the cluster where the other 2 nodes are direct connected at 10GBE. And, by the way, it is fully working, I am just seeing a few of these above TOTEM Retransmit errors in the logs from time to time.
So, the specific instance where the above error occurs is when the source initiates a replication to the secondary node (replications occur on the corosync network, as have been done on all other nodes in this cluster with this setup for over a year with zero issues. I have seen failovers to the secondary corosync network, which is by design if it ever has to, but not in this error case.):
-ZFS Replication from the node this error is on (we will call the 'Source Node',) to the secondary node. The Restransmit List log entry lines generally lasts from 1-10 seconds, depending on how much data ZFS replication must send over the link when replicating.
-These log items do not occur when the secondary node replicates back to this server in either server's logs, so it is specific to this server *sending* to the other.
Items tried so far:
-I have tried sending back the other direction on that same link, and the retransmits never occur.
-These retransmits never occur on the other link to the third node, in either direction, but that is on different NICs.
Next Steps:
-Will move the link between these two servers to other NICs on each end, as the issue could still theoretically be the secondary target node's NIC on this direct link (this troubleshooting step will narrow the issue to one of the 2 NICs involved in this Ring/Mesh network.
I'm mostly going to document here how I fix the issue or figure out the problem, so no need for anything but discussion here, especially if you've dealt with similar issues in the past.
For the record, My suspicion is it is the driver on this 'newish' Realtek 10GBe nic (driver being used is R8169 by the way, which I've researched to have been updated with a limited number of lines of code to make it compatible with this new 10GBe version when the driver used to be 1GBe only.)
First off, this is a Realtek 10gbe onboard NIC on the server which is a newer chipset of RTL18127a using firmware in linux of " rtl8127a-1_0.0.5 05/14/25 "
The NIC is connected straight through to the NIC of the other cluster node where this issue occurs. This is a 1GBe link because the other side is a 1GBE NIC , with no switch, using the simplified Cluster corosync mech/ring network shown in the proxmox documentation (i.e. this is a 3-node cluster with 2 NIC ports on each node dedicated to Ring-Mesh based cluster communications, not using SDN, but the simple method that broadcasts to both nics in the ring of each of the 3 nodes.)
It may also be important to note that there is a secondary Corosync failover network (Link: 1,) which is a switched 10GBE network. This network is never called on during the above error, as of yet, to take over. (I kind of have a secondary question on that, but will hold until the end.)
Just to get things out of the way, please don't recommend a different type of SDN or non-ring network, or using the 10GBE switched network, etc., as this is really a proof of concept homelab thing where I am trying different setups to see how they work. This one, specifically, is trying to add a 1GBE connected node to the cluster where the other 2 nodes are direct connected at 10GBE. And, by the way, it is fully working, I am just seeing a few of these above TOTEM Retransmit errors in the logs from time to time.
So, the specific instance where the above error occurs is when the source initiates a replication to the secondary node (replications occur on the corosync network, as have been done on all other nodes in this cluster with this setup for over a year with zero issues. I have seen failovers to the secondary corosync network, which is by design if it ever has to, but not in this error case.):
-ZFS Replication from the node this error is on (we will call the 'Source Node',) to the secondary node. The Restransmit List log entry lines generally lasts from 1-10 seconds, depending on how much data ZFS replication must send over the link when replicating.
-These log items do not occur when the secondary node replicates back to this server in either server's logs, so it is specific to this server *sending* to the other.
Items tried so far:
-I have tried sending back the other direction on that same link, and the retransmits never occur.
-These retransmits never occur on the other link to the third node, in either direction, but that is on different NICs.
Next Steps:
-Will move the link between these two servers to other NICs on each end, as the issue could still theoretically be the secondary target node's NIC on this direct link (this troubleshooting step will narrow the issue to one of the 2 NICs involved in this Ring/Mesh network.
I'm mostly going to document here how I fix the issue or figure out the problem, so no need for anything but discussion here, especially if you've dealt with similar issues in the past.
For the record, My suspicion is it is the driver on this 'newish' Realtek 10GBe nic (driver being used is R8169 by the way, which I've researched to have been updated with a limited number of lines of code to make it compatible with this new 10GBe version when the driver used to be 1GBe only.)