Corosync Totem Re-transmission Issues

Hi ALL,

I don't known if it can help, But I have add a lot of problems recently with multicast, because of linux bridge multicast_snooping feature.
(So if your proxmox host ip use for corosync is on vmbr0, it's possible that you have the problem too).

Can you try

"echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping"
?

Another possibily is to put ips on interfaces and not the vmbr bridges
 
Last edited:
Hi ALL,

I don't known if it can help, But I have add a lot of problems recently with multicast, because of linux bridge multicast_snooping feature.
(So if your proxmox host ip use for corosync is on vmbr0, it's possible that you have the problem too).

Can you try

"echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping"
?

Another possibily is to put ips on interfaces and not the vmbr bridges

Appreciate the input. I did try moving DRBD/Cluster communication to the actual eth# device instead of vmbr# with no luck. I also tried your suggestion with no luck.

root@proxmox2:/var/log/cluster# echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping
root@proxmox2:/var/log/cluster# asmping 224.0.2.1 10.211.47.1
asmping joined (S,G) = (*,224.0.2.234)
pinging 10.211.47.1 from 10.211.47.2
unicast from 10.211.47.1, seq=1 dist=0 time=0.135 ms
unicast from 10.211.47.1, seq=2 dist=0 time=0.244 ms
unicast from 10.211.47.1, seq=3 dist=0 time=0.207 ms
unicast from 10.211.47.1, seq=4 dist=0 time=0.225 ms
unicast from 10.211.47.1, seq=5 dist=0 time=0.223 ms
unicast from 10.211.47.1, seq=6 dist=0 time=0.239 ms
unicast from 10.211.47.1, seq=7 dist=0 time=0.222 ms
unicast from 10.211.47.1, seq=8 dist=0 time=0.182 ms
unicast from 10.211.47.1, seq=9 dist=0 time=0.090 ms
unicast from 10.211.47.1, seq=10 dist=0 time=0.204 ms
unicast from 10.211.47.1, seq=11 dist=0 time=0.226 ms
unicast from 10.211.47.1, seq=12 dist=0 time=0.176 ms
unicast from 10.211.47.1, seq=13 dist=0 time=0.263 ms
unicast from 10.211.47.1, seq=14 dist=0 time=0.220 ms
 
What is your physical network switch model ? Are you sure it's supporting multicast ?
Also do you use some iptables rules ?

That is what makes things really odd. This is a dedicated 10GB back-end absolutely no switch.

I also flushed all IP rules the other day.
 
Last edited:
We have recently started experiencing this issue, on PVE 3.0. Our cluster had been working well for around 6 months, but after going to 3.0 this issue is affecting us.

I have a very basic setup with 2 nodes. Upon boot the cluster will work for several minutes, I will then see the logs begin to flood with [TOTEM ] Retransmit List messages and the cluster will eventually fall over with:

Code:
Jul  3 10:46:08 vhbtgmar04 corosync[5942]:   [TOTEM ] FAILED TO RECEIVE
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] CLM CONFIGURATION CHANGE
Jul  3 10:46:10 vhbtgmar04 pmxcfs[6145]: [status] notice: node lost quorum
Jul  3 10:46:10 vhbtgmar04 pmxcfs[6145]: [dcdb] notice: members: 1/6145
Jul  3 10:46:10 vhbtgmar04 kernel: dlm: closing connection to node 2
Jul  3 10:46:10 vhbtgmar04 pmxcfs[6145]: [dcdb] notice: members: 1/6145
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] New Configuration:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] #011r(0) ip(10.10.11.12) 
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] Members Left:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] #011r(0) ip(10.10.11.11) 
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] Members Joined:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CMAN  ] quorum lost, blocking activity
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [QUORUM] Members[1]: 1
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] CLM CONFIGURATION CHANGE
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] New Configuration:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] #011r(0) ip(10.10.11.12) 
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] Members Left:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] Members Joined:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.11.12) ; members(old:2 left:1)
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [MAIN  ] Completed service synchronization, ready to provide service.

If I restart cman and pve-cluster it will come back to life, but then fail again after the same time period.

I initially thought the issue was related to my bonding configuration, so put the PVE control/cluster traffic on it's own dedicated nic (eth0) which made no difference, I then reloaded both nodes fresh from the 3.0 installer as my cluster had previously been upgraded all the way from 2.1 to 3, this also made no difference.
I also tried reducing the TOTEM NetMTU which did not resolve the issue (verified MTU was correct with tcpdump)

Versions from NODE1:

Code:
Version: 6.2.0
Config Version: 4
Cluster Name: pve-marua
Cluster Id: 54987
Cluster Member: Yes
Cluster Generation: 2092
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: vhbtgmar04
Node ID: 1
Multicast addresses: 239.192.214.162 
Node addresses: 10.10.11.12



Versions from NODE2:

Code:
Version: 6.2.0
Config Version: 4
Cluster Name: pve-marua
Cluster Id: 54987
Cluster Member: Yes
Cluster Generation: 2092
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: vhbtgmar05
Node ID: 2
Multicast addresses: 239.192.214.162 
Node addresses: 10.10.11.11

The machines are Xeon 54xx series with Intel e1000e NIC connected via a Extreme Networks x460.

Any ideas on the cause of this issue? what can we do to try and get this resolved ?
 
We have recently started experiencing this issue, on PVE 3.0. Our cluster had been working well for around 6 months, but after going to 3.0 this issue is affecting us.

I have a very basic setup with 2 nodes. Upon boot the cluster will work for several minutes, I will then see the logs begin to flood with [TOTEM ] Retransmit List messages and the cluster will eventually fall over with:

Code:
Jul  3 10:46:08 vhbtgmar04 corosync[5942]:   [TOTEM ] FAILED TO RECEIVE
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] CLM CONFIGURATION CHANGE
Jul  3 10:46:10 vhbtgmar04 pmxcfs[6145]: [status] notice: node lost quorum
Jul  3 10:46:10 vhbtgmar04 pmxcfs[6145]: [dcdb] notice: members: 1/6145
Jul  3 10:46:10 vhbtgmar04 kernel: dlm: closing connection to node 2
Jul  3 10:46:10 vhbtgmar04 pmxcfs[6145]: [dcdb] notice: members: 1/6145
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] New Configuration:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] #011r(0) ip(10.10.11.12) 
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] Members Left:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] #011r(0) ip(10.10.11.11) 
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] Members Joined:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CMAN  ] quorum lost, blocking activity
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [QUORUM] Members[1]: 1
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] CLM CONFIGURATION CHANGE
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] New Configuration:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] #011r(0) ip(10.10.11.12) 
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] Members Left:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] Members Joined:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.11.12) ; members(old:2 left:1)
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [MAIN  ] Completed service synchronization, ready to provide service.

If I restart cman and pve-cluster it will come back to life, but then fail again after the same time period.

I initially thought the issue was related to my bonding configuration, so put the PVE control/cluster traffic on it's own dedicated nic (eth0) which made no difference, I then reloaded both nodes fresh from the 3.0 installer as my cluster had previously been upgraded all the way from 2.1 to 3, this also made no difference.
I also tried reducing the TOTEM NetMTU which did not resolve the issue (verified MTU was correct with tcpdump)

Versions from NODE1:

Code:
Version: 6.2.0
Config Version: 4
Cluster Name: pve-marua
Cluster Id: 54987
Cluster Member: Yes
Cluster Generation: 2092
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: vhbtgmar04
Node ID: 1
Multicast addresses: 239.192.214.162 
Node addresses: 10.10.11.12



Versions from NODE2:

Code:
Version: 6.2.0
Config Version: 4
Cluster Name: pve-marua
Cluster Id: 54987
Cluster Member: Yes
Cluster Generation: 2092
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: vhbtgmar05
Node ID: 2
Multicast addresses: 239.192.214.162 
Node addresses: 10.10.11.11

The machines are Xeon 54xx series with Intel e1000e NIC connected via a Extreme Networks x460.

Any ideas on the cause of this issue? what can we do to try and get this resolved ?

Can you try

"echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping"

on each host ?

(this disable multicast snooping/filtering, which can sometime badly block multicast traffic)
 
Can you try

"echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping"

on each host ?

(this disable multicast snooping/filtering, which can sometime badly block multicast traffic)

Hi Spirit, in our case we have the management IP's directly on eth0 so there is no bridge involved. We do have IGMP snooping on the Extreme switch, but this has not changed any time recently so I dont think they could suddenly cause this issue.

As a side note, I changed the transport to be unicast and the cluster has returned to stable.


Have the Proxmox dev's recently changed anything to do with Multicast/Corosync ?
 
This seems to have solved my problem


Hi ALL,

I don't known if it can help, But I have add a lot of problems recently with multicast, because of linux bridge multicast_snooping feature.
(So if your proxmox host ip use for corosync is on vmbr0, it's possible that you have the problem too).

Can you try

"echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping"
?

Another possibily is to put ips on interfaces and not the vmbr bridges
 
We are still having issues with this. Ever since 3.0 came out PVE clustering has been troublesome.

We have confirmed we can transmit multicast between the hosts using asmping/ssmping but we still cannot get quorum between hosts.

What was changed in regards to the clustering between 2.3 and 3.0 ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!