Howto: Fix corosync [TOTEM ] Retransmit List errors

wosp · Oct 7, 2015

Hello,

Since we've seen (and fixed) "corosync [TOTEM ] Retransmit List: XXXX" errors in /var/log/cluster/corosync.log on several Proxmox VE clusters, and information on the internet is not always very clear about the solution, I thought it was a good idea to share some information on how to fix these errors. I strongly recommend to fix these errors, even if it are only a couple of them and you do not have any problems right now. I've seen a Proxmox VE cluster running for 137+ days without any noticeable issues (except some retransmit errors), but on one day one node was having much totem retransmits and the other nodes in the cluster was not seeing this node anymore and thus fenced it.

1. First of all, be absolutly sure multicast traffic is working fine. Please see: https://pve.proxmox.com/wiki/Multicast_notes. I've had the best results with IGMP querier enabled per VLAN on the switch(es) (be sure to configure IGMP querier on at least 2 switches if you have a HA-setup, because if it's only configured on 1 switch and that switch fails, your whole cluster will fail within a couple of minutes after the switch failed) and then disable IGMP querier on the Linux Bridge(s) of your Proxmox VE node's (/etc/network/interfaces):

Code:

post-up ( echo 0 > /sys/devices/virtual/net/$IFACE/bridge/multicast_querier )

If you use the Proxmox VE built-in firewall, be sure to allow multicast traffic. If everything is configured, test multicast traffic with omping.

2. If you are sure multicast traffic is working fine but still get some of these errors AND you have the default system/network MTU of 1500, consider changing your system/network MTU to 9000 (enable jumbo frames on the switch(es) and then on your Proxmox VE nodes interfaces (/etc/network/interfaces)). If this isn't possible, and you need to have the default MTU of 1500, edit the /etc/pve/cluster.conf file (please read the instructions at http://pve.proxmox.com/wiki/Fencing#General_HowTo_for_editing_the_cluster.conf first and don't forget to increase config_version!) and add the section:

Code:

<totem netmtu="1480"/>

For example:

Code:

<?xml version="1.0"?>
<cluster name="clustername" config_version="2">
  <totem netmtu="1480"/>
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey">
  </cman>
 
  <clusternodes>
  <clusternode name="node1" votes="1" nodeid="1"/>
  <clusternode name="node2" votes="1" nodeid="2"/>
  <clusternode name="node3" votes="1" nodeid="3"/></clusternodes>
 
</cluster>

If you have changed your MTU (already) to 9000, you don't need to change the netmtu in your cluster.conf, as this value will stay at the corosync default of 1500 which is fine in that case.

3. If you are sure multicast traffic is working fine, have a MTU of 9000 (or changed the netmtu to 1480 in cluster.conf, see above), but still get some of these errors AND you have nodes in your cluster that are noticeable slower then other nodes in your cluster, you may consider changing your corosync's window size. The default is 50. Never go higher then 256000 / MTU. So if you have a MTU of 1500, your max. window size will be 170 (256000 / 1500 = 170). This should be save (however, I didn't test this!) with a system MTU of 9000 also, because the corosync netmtu will stay at the default of 1500. If you increase your corosync netmtu also (for example to 8980), your max. window size will be 28 (which is lower than the default value!). But I don't recommend to do this and did not yet see any configuration where this was needed (just leave netmtu at the default when using system MTU of 9000, so your window size can be default (or a bit higher) also). In general I think you should be very reserved with changing your window size, the default of 50 is a save value in most of the cases. However, if needed you can do this as follows:

Edit the /etc/pve/cluster.conf file (please read the instructions at http://pve.proxmox.com/wiki/Fencing#General_HowTo_for_editing_the_cluster.conf first and don't forget to increase config_version!) and add the section:

Code:

<totem window_size="170"/>

For example:

Code:

<?xml version="1.0"?>
<cluster name="clustername" config_version="2">
  <totem window_size="170"/>
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey">
  </cman>
 
  <clusternodes>
  <clusternode name="node1" votes="1" nodeid="1"/>
  <clusternode name="node2" votes="1" nodeid="2"/>
  <clusternode name="node3" votes="1" nodeid="3"/></clusternodes>
 
</cluster>

4. If the steps above didn't help I suggest you check your network drivers and hardware (is there a node with a very high load, a configuration error on one or more ports on the switch(es), one or more bad network cables/cards etc.). However, in most cases then you also see multicast traffic beeing (partially) dropped when testing with omping.

I hope the above information will help someone.

t.lamprecht · Oct 7, 2015

Thanks for sharing those informations!!

I only want to append that this is intended for Proxmox VE 3.4, our new 4.0 release has the same possibilities, but /etc/pve/cluster.conf is replaced by /etc/pve/corosync.conf . You cannot use the same format (e.g. copy-paste) but see
# man corosync.conf

Note that you must also increase config_version, and (re)saving it brings the changes in affect on the whole cluster instantly (if possible, bindnet address changes would need an whole cluster restart, for example).
So you have to be really cautious and know what you do! Also copy it first, make the changes the rename, like:

Code:

cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new
nano /etc/pve/corosync.conf.new
# look a few times that everything is correct and test the changes in a test environment
# tinkering with corosync.conf in an production environment is most of the times really bad practice and should be avoided
mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf

wosp · Dec 6, 2015

Thomas, Thank you for the additional information regarding Proxmox VE 4.0. One more thing about totem retransmit errors: Recently I saw a setup with totem retransmit errors that, at first, I could not explain. After some more searching I saw the Proxmox VE host's used bonding with balance-tlb mode. After changing to bondmode active-backup the issues were resolved. Maybe this information is also interesting for someone with this kind of errors and using bond mode's other then active-backup.

And one more question for you (or a collegue): Can you tell me if multicast traffic is completly dropped (i.e. when a switch/IGMP querier crashes), how long multicast traffic may be lost before a quorom lost occurs? I'm asking because I use 2 switches with both IGMP querier enabled. When both operational, Switch 1 has the lowest IP and is therefor master ('Querier-mode enabled'), switch 2 goes into 'Non-Querier mode' and sees switch 1 as active querier. When switch 1 is gone, switch 2 can take up to 60 seconds before taking over the 'Querier-mode enabled', unfortunately I cannot speed this up (minimum value in switch config is 60 secs). After switch 2 is in 'Querier-mode enabled' it can also take some seconds before all Proxmox VE host's have working multicast traffic again (sees the new IGMP querier) and all hosts sees eachother. So I would like to have that it takes at least 90 secs before a quorom lost occurs. Can you tell me what the default value is before a quorum lost occurs and how I can change it?

Thank you,

Wouter

spirit · Dec 7, 2015

wosp said:
Thomas, Thank you for the additional information regarding Proxmox VE 4.0. One more thing about totem retransmit errors: Recently I saw a setup with totem retransmit errors that, at first, I could not explain. After some more searching I saw the Proxmox VE host's used bonding with balance-tlb mode. After changing to bondmode active-backup the issues were resolved. Maybe this information is also interesting for someone with this kind of errors and using bond mode's other then active-backup.

And one more question for you (or a collegue): Can you tell me if multicast traffic is completly dropped (i.e. when a switch/IGMP querier crashes), how long multicast traffic may be lost before a quorom lost occurs? I'm asking because I use 2 switches with both IGMP querier enabled. When both operational, Switch 1 has the lowest IP and is therefor master ('Querier-mode enabled'), switch 2 goes into 'Non-Querier mode' and sees switch 1 as active querier. When switch 1 is gone, switch 2 can take up to 60 seconds before taking over the 'Querier-mode enabled', unfortunately I cannot speed this up (minimum value in switch config is 60 secs). After switch 2 is in 'Querier-mode enabled' it can also take some seconds before all Proxmox VE host's have working multicast traffic again (sees the new IGMP querier) and all hosts sees eachother. So I would like to have that it takes at least 90 secs before a quorom lost occurs. Can you tell me what the default value is before a quorum lost occurs and how I can change it?

Thank you,

Wouter

On cisco switch, the default igmp Group membership timeout is 240s.

For the querier failover time, I'm not sure, maybe it's related to igmp querier interval.

wosp · Dec 7, 2015

I don't mean the switch values/settings, I know them. I mean the time that Corosync on the Proxmox VE hosts may have lost multicast traffic before a quorum lost occurs.

t.lamprecht · Dec 7, 2015

After a switch crash/plugged out Network cable or some other reason which drops every (multicast) traffic corosync recognizes this very fast, in a matter of milliseconds. So the thing is now, you will loose quorum for sure, which will result in a read only mounted /etc/pve (our cluster filesystem) but the running VMs will stay running. You cannot start new VMs/CTs general Proxmox VE jobs during those 60 seconds of lost quorum but after that everything should work again as expected. Which means even as you are not fully functional during a router crash, you normally do not loose services (=VMs, CTs).

wosp · Dec 7, 2015

Hi Thomas,

Thank you for your answer. I know Corosync will recognize a drop-out of multicast very soon and give 'Retransmit errors' almost instantly. However, this is not effecting running VM's at that time because they keep running, as you said. But, there is a maximum of time for this, right?

For example, Yesterday I did a reload on a switch which is primary IGMP querier. A mistake of me was that the switches in this setup had a Querier Expiry Interval (the maximum time to wait before the other switch takesover querier-mode) of 125 secs (already changed it to 60 secs). During the reload (+/- 3 mins) the querier was takenover after almost 2 mins. So, there was no multicast traffic possible for almost 2 mins. There was no problem the first 90 secs (only retransmit erros in the logfile). After 90 secs there was a CLM configuration change (each node only sees himself = quorom lost). After almost 2 mins (when the secondary querier was operating) node1 and node3 have made connection again (again a CLM configuration change) and decided to fence node2 (node2 was takeing a bit longer to connect again with node1 and node3). If there were only retransmit errors, this was no problem (because quorom is not lost and thus no fecing will occur). There is only a problem when a CLM configuration change occurs (and not all nodes connect at the same time with eachother when they are back online/multicast is working again), then nodes will be fenced as soon as there is a majority between other nodes.

So, to prevent this. We need to make sure the second querier is operating BEFORE a CLM configuration change occurs. While looking at the logs right now, I think I have found the answer myself (timeout is 90 secs), but can you confirm? And can you tell me if it is possible to change this value to (for example) 120 secs?

============================
NODE1
============================

Dec 06 12:58:02 corosync [TOTEM ] Retransmit List: 114d3df 114d3e0 114d3e1 114d3e2 114d3e3
.....
Dec 06 12:59:32 corosync [CLM ] CLM CONFIGURATION CHANGE

============================
NODE2
============================

Dec 06 12:58:02 corosync [TOTEM ] Retransmit List: 114d3df 114d3e0 114d3e1 114d3e2 114d3e3
.....
Dec 06 12:59:32 corosync [TOTEM ] FAILED TO RECEIVE

============================
NODE3
============================

Dec 06 12:58:05 corosync [TOTEM ] Retransmit List: 114d3ee 114d3ef 114d3f0 114d3f1 114d3f2
.....
Dec 06 12:59:32 corosync [CLM ] CLM CONFIGURATION CHANGE

============================

dietmar · Dec 7, 2015

wosp said:
During the reload (+/- 3 mins) the querier was takenover after almost 2 mins. So, there was no multicast traffic possible for almost 2 mins.

I though membership should not change as long as there is no querier? Maybe the second switch was off before, so it does not know membership?

wosp · Dec 7, 2015

No, the second switch was not offline before (158 days of uptime). During the reload of the primary switch/querier I was logged in on the secondary switch and saw that the secondary switch was in "Non-Querier mode" for almost 2 mins during the reload. I don't think that's strange, because the max. takeover time was 125 secs, so the secondary switch only checks every 125 secs if he needs to switch to "Querier-mode". When the switch is in "Non-Querier" mode it doesn't keep track of memberships, it only sees there is another querier with a lower IP and therefor goes back to sleep and check again after the time-out (every 125 seconds yesterday, every 60 secs now).

Based on the logfiles provided I assume Corosync waits 90 secs, after multicast traffic is dropped, before a CLM configuration change occurs. So there should be no problem anymore now, but I would like to have confirmed Corosync waits 90 secs before a CLM configuration change occurs when multicast traffic is completly dropped and also would like to know how to change this 90 secs to i.e. 120 secs.

dietmar · Dec 8, 2015

wosp said:
When the switch is in "Non-Querier" mode it doesn't keep track of membership

I think you mix different things here. Stepping back, how are the nodes connected to the switches?

wosp · Dec 8, 2015

Each node is connected with 1x 10 Gbit (Twinax) to each switch (so 2x 10 Gbit in total on each node). Configured via bond-mode active-backup. The switches are connected to eachother with 4x 10 Gbit LACP. Switches are Dell N4032F's.

//Edit:

Some switch status information:

===============================================================
Primary Switch
===============================================================
sw01#show ip igmp snooping querier detail

Last Querier
VLAN ID Address IGMP Version
------- ---------------- ------------

Global IGMP Snooping querier status
-----------------------------------
IGMP Snooping Querier Mode..................... Enable
Querier Address................................ 0.0.0.0
IGMP Version................................... 2
Querier Query Interval......................... 60
Querier Expiry Interval........................ 60

Vlan 10 : IGMP Snooping querier status
----------------------------------------------
IGMP Snooping Querier Vlan Mode................ Enable
Querier Election Participate Mode.............. Enable
Querier Vlan Address........................... 192.168.110.5
Operational State.............................. Querier
Operational version............................ 2
Operational Max Resp Time...................... 10

===============================================================
Secondary Switch
===============================================================
sw02#show ip igmp snooping querier detail

Last Querier
VLAN ID Address IGMP Version
------- ---------------- ------------
10 192.168.110.5 v2

Global IGMP Snooping querier status
-----------------------------------
IGMP Snooping Querier Mode..................... Enable
Querier Address................................ 0.0.0.0
IGMP Version................................... 2
Querier Query Interval......................... 60
Querier Expiry Interval........................ 60

Vlan 10 : IGMP Snooping querier status
----------------------------------------------
IGMP Snooping Querier Vlan Mode................ Enable
Querier Election Participate Mode.............. Enable
Querier Vlan Address........................... 192.168.110.6
Operational State.............................. Non-Querier
Last Querier Address........................... 192.168.110.5
Operational version............................ 2
Operational Max Resp Time...................... 11
===============================================================

dietmar · Dec 8, 2015

wosp said:
The switches are connected to eachother with 4x 10 Gbit LACP. Switches are Dell N4032F's.

I do not really know those switches, but maybe you can define a static multicast group to avoid such problems?
Another solution would be to use corosync redundant ring protocol (rrp), see

https://pve.proxmox.com/wiki/Separate_Cluster_Network#Redundant_Ring_Protocol

wosp · Dec 8, 2015

Mmm, but do you think I have a problem right now, when the second switch become querier within a max. of 60 seconds? As far as I can oversee I don't, because it seems that it's no problem if Corosync loses multicast traffic up to 90 secs. That's the only thing I would like to be sure of, and if yes, how this 90 secs can be changed if wanted (which I don't plan to do with the current setup, because I think it's fine now with the 60 secs takeover time).

Pourya Mehdinejad · Mar 21, 2020

Thank you all for the helpful tips.
I believe mulitcases are gone in Proxmox 6, however we have a 15 node cluster running on Proxmox 6 and corosync 3, but we are having these "corosync [TOTEM ] Retransmit List: XXXX " all over our nodes.
any idea what it could be ?
each node is connected to the main switches (Cisco Nexus 5500) with 2 x 10Gb in LACP mode.

t.lamprecht · Mar 21, 2020

Pourya Mehdinejad said:
each node is connected to the main switches (Cisco Nexus 5500) with 2 x 10Gb in LACP mode.

So all of the traffic runs just on that switch?
Maybe even using ceph?

While the bandwidth maybe really enough and not saturated this is still an issue for corosync.
Corosync doesn't need much bandwidth, you can get a long way with a 100Mbps network. But it's highly sensible to latencies, so other traffic can disrupt it fast. I guess you don't have those 10 GBps just for "fun" but that you actually use a good chunk of it, so this is your issue. Use a separate network for corosync where it has guaranteed sending time and those retransmits should go away.

We know of a big Proxmox VE 6 setup in india with 52 nodes in a single cluster, so it can work with many nodes just fine as long as one ensures corosync can send (network and CPU wise).

t.lamprecht · Mar 21, 2020

And yes, as this is quite an old thread and talking about corosync 2, which got replaced with quite a different underlying communication stack in PVE 6, opening a new thread would be better the next time

Pourya Mehdinejad · Mar 21, 2020

t.lamprecht said:
So all of the traffic runs just on that switch?
Maybe even using ceph?

While the bandwidth maybe really enough and not saturated this is still an issue for corosync.
Corosync doesn't need much bandwidth, you can get a long way with a 100Mbps network. But it's highly sensible to latencies, so other trffic can disrupt it fast. I guess you don't have those 10 GBps just for "fun" but that you actually use a good chunk of it, so this is your issue. Use a separate network for corosync where it has guaranteed sending time and those should go away.

We know of a big Proxmox VE 6 setup in india with 52 nodes in a single cluster, so it can work with many nodes just fine as long as one ensures corosync can send (network and CPU wise).

Thanks for your response.
We have 2 x 100GB NIC for ceph with separated pair of switches
The thing is we cannot find any bottleneck in the entire setup, no high CPU load on the switches, no very high traffic on the ports.
So it 's really a mystery for me.

And yes I know its an old thread, but I thought since it is referring to the same issue, it might be easier to track the issue and continue the conversation.

t.lamprecht · Mar 21, 2020

Pourya Mehdinejad said:
We have 2 x 100GB NIC for ceph with separated pair of switches

What runs over them, the ceph private network or both, private and public?

Pourya Mehdinejad · Mar 21, 2020

t.lamprecht said:
What runs over them, the ceph private network or both, private and public?

Both private and public, we didn't have any issue on ceph eversince,
The issue is on Corosync which is on 10GB networks

Howto: Fix corosync [TOTEM ] Retransmit List errors

Renowned Member

Proxmox Staff Member

Renowned Member

Distinguished Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Member

Proxmox Staff Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member