Corosync Totem Re-transmission Issues

spirit · Mar 4, 2013

Hi ALL,

I don't known if it can help, But I have add a lot of problems recently with multicast, because of linux bridge multicast_snooping feature.
(So if your proxmox host ip use for corosync is on vmbr0, it's possible that you have the problem too).

Can you try

"echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping"
?

Another possibily is to put ips on interfaces and not the vmbr bridges

adamb · Mar 4, 2013

spirit said:
Hi ALL,

I don't known if it can help, But I have add a lot of problems recently with multicast, because of linux bridge multicast_snooping feature.
(So if your proxmox host ip use for corosync is on vmbr0, it's possible that you have the problem too).

Can you try

"echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping"
?

Another possibily is to put ips on interfaces and not the vmbr bridges

Appreciate the input. I did try moving DRBD/Cluster communication to the actual eth# device instead of vmbr# with no luck. I also tried your suggestion with no luck.

root@proxmox2:/var/log/cluster# echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping
root@proxmox2:/var/log/cluster# asmping 224.0.2.1 10.211.47.1
asmping joined (S,G) = (*,224.0.2.234)
pinging 10.211.47.1 from 10.211.47.2
unicast from 10.211.47.1, seq=1 dist=0 time=0.135 ms
unicast from 10.211.47.1, seq=2 dist=0 time=0.244 ms
unicast from 10.211.47.1, seq=3 dist=0 time=0.207 ms
unicast from 10.211.47.1, seq=4 dist=0 time=0.225 ms
unicast from 10.211.47.1, seq=5 dist=0 time=0.223 ms
unicast from 10.211.47.1, seq=6 dist=0 time=0.239 ms
unicast from 10.211.47.1, seq=7 dist=0 time=0.222 ms
unicast from 10.211.47.1, seq=8 dist=0 time=0.182 ms
unicast from 10.211.47.1, seq=9 dist=0 time=0.090 ms
unicast from 10.211.47.1, seq=10 dist=0 time=0.204 ms
unicast from 10.211.47.1, seq=11 dist=0 time=0.226 ms
unicast from 10.211.47.1, seq=12 dist=0 time=0.176 ms
unicast from 10.211.47.1, seq=13 dist=0 time=0.263 ms
unicast from 10.211.47.1, seq=14 dist=0 time=0.220 ms

spirit · Mar 4, 2013

What is your physical network switch model ? Are you sure it's supporting multicast ?
Also do you use some iptables rules ?

adamb · Mar 4, 2013

spirit said:
What is your physical network switch model ? Are you sure it's supporting multicast ?
Also do you use some iptables rules ?

That is what makes things really odd. This is a dedicated 10GB back-end absolutely no switch.

I also flushed all IP rules the other day.

nz_monkey · Jul 3, 2013

We have recently started experiencing this issue, on PVE 3.0. Our cluster had been working well for around 6 months, but after going to 3.0 this issue is affecting us.

I have a very basic setup with 2 nodes. Upon boot the cluster will work for several minutes, I will then see the logs begin to flood with [TOTEM ] Retransmit List messages and the cluster will eventually fall over with:

Code:

Jul  3 10:46:08 vhbtgmar04 corosync[5942]:   [TOTEM ] FAILED TO RECEIVE
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] CLM CONFIGURATION CHANGE
Jul  3 10:46:10 vhbtgmar04 pmxcfs[6145]: [status] notice: node lost quorum
Jul  3 10:46:10 vhbtgmar04 pmxcfs[6145]: [dcdb] notice: members: 1/6145
Jul  3 10:46:10 vhbtgmar04 kernel: dlm: closing connection to node 2
Jul  3 10:46:10 vhbtgmar04 pmxcfs[6145]: [dcdb] notice: members: 1/6145
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] New Configuration:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] #011r(0) ip(10.10.11.12) 
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] Members Left:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] #011r(0) ip(10.10.11.11) 
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] Members Joined:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CMAN  ] quorum lost, blocking activity
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [QUORUM] Members[1]: 1
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] CLM CONFIGURATION CHANGE
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] New Configuration:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] #011r(0) ip(10.10.11.12) 
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] Members Left:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CLM   ] Members Joined:
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.11.12) ; members(old:2 left:1)
Jul  3 10:46:10 vhbtgmar04 corosync[5942]:   [MAIN  ] Completed service synchronization, ready to provide service.

If I restart cman and pve-cluster it will come back to life, but then fail again after the same time period.

I initially thought the issue was related to my bonding configuration, so put the PVE control/cluster traffic on it's own dedicated nic (eth0) which made no difference, I then reloaded both nodes fresh from the 3.0 installer as my cluster had previously been upgraded all the way from 2.1 to 3, this also made no difference.
I also tried reducing the TOTEM NetMTU which did not resolve the issue (verified MTU was correct with tcpdump)

Versions from NODE1:

Code:

Version: 6.2.0
Config Version: 4
Cluster Name: pve-marua
Cluster Id: 54987
Cluster Member: Yes
Cluster Generation: 2092
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: vhbtgmar04
Node ID: 1
Multicast addresses: 239.192.214.162 
Node addresses: 10.10.11.12

Versions from NODE2:

Code:

Version: 6.2.0
Config Version: 4
Cluster Name: pve-marua
Cluster Id: 54987
Cluster Member: Yes
Cluster Generation: 2092
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: vhbtgmar05
Node ID: 2
Multicast addresses: 239.192.214.162 
Node addresses: 10.10.11.11

The machines are Xeon 54xx series with Intel e1000e NIC connected via a Extreme Networks x460.

Any ideas on the cause of this issue? what can we do to try and get this resolved ?

spirit · Jul 3, 2013

nz_monkey said:
We have recently started experiencing this issue, on PVE 3.0. Our cluster had been working well for around 6 months, but after going to 3.0 this issue is affecting us.

I have a very basic setup with 2 nodes. Upon boot the cluster will work for several minutes, I will then see the logs begin to flood with [TOTEM ] Retransmit List messages and the cluster will eventually fall over with:

Code:

Jul 3 10:46:08 vhbtgmar04 corosync[5942]: [TOTEM ] FAILED TO RECEIVE Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] CLM CONFIGURATION CHANGE Jul 3 10:46:10 vhbtgmar04 pmxcfs[6145]: [status] notice: node lost quorum Jul 3 10:46:10 vhbtgmar04 pmxcfs[6145]: [dcdb] notice: members: 1/6145 Jul 3 10:46:10 vhbtgmar04 kernel: dlm: closing connection to node 2 Jul 3 10:46:10 vhbtgmar04 pmxcfs[6145]: [dcdb] notice: members: 1/6145 Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] New Configuration: Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] #011r(0) ip(10.10.11.12) Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] Members Left: Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] #011r(0) ip(10.10.11.11) Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] Members Joined: Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CMAN ] quorum lost, blocking activity Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [QUORUM] Members[1]: 1 Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] CLM CONFIGURATION CHANGE Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] New Configuration: Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] #011r(0) ip(10.10.11.12) Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] Members Left: Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] Members Joined: Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CPG ] chosen downlist: sender r(0) ip(10.10.11.12) ; members(old:2 left:1) Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [MAIN ] Completed service synchronization, ready to provide service.

If I restart cman and pve-cluster it will come back to life, but then fail again after the same time period.

I initially thought the issue was related to my bonding configuration, so put the PVE control/cluster traffic on it's own dedicated nic (eth0) which made no difference, I then reloaded both nodes fresh from the 3.0 installer as my cluster had previously been upgraded all the way from 2.1 to 3, this also made no difference.
I also tried reducing the TOTEM NetMTU which did not resolve the issue (verified MTU was correct with tcpdump)

Versions from NODE1:

Code:

Version: 6.2.0 Config Version: 4 Cluster Name: pve-marua Cluster Id: 54987 Cluster Member: Yes Cluster Generation: 2092 Membership state: Cluster-Member Nodes: 2 Expected votes: 2 Total votes: 2 Node votes: 1 Quorum: 2 Active subsystems: 5 Flags: Ports Bound: 0 Node name: vhbtgmar04 Node ID: 1 Multicast addresses: 239.192.214.162 Node addresses: 10.10.11.12

Versions from NODE2:

Code:

Version: 6.2.0 Config Version: 4 Cluster Name: pve-marua Cluster Id: 54987 Cluster Member: Yes Cluster Generation: 2092 Membership state: Cluster-Member Nodes: 2 Expected votes: 2 Total votes: 2 Node votes: 1 Quorum: 2 Active subsystems: 5 Flags: Ports Bound: 0 Node name: vhbtgmar05 Node ID: 2 Multicast addresses: 239.192.214.162 Node addresses: 10.10.11.11

The machines are Xeon 54xx series with Intel e1000e NIC connected via a Extreme Networks x460.

Any ideas on the cause of this issue? what can we do to try and get this resolved ?

Can you try

"echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping"

on each host ?

(this disable multicast snooping/filtering, which can sometime badly block multicast traffic)

nz_monkey · Jul 3, 2013

spirit said:
Can you try

"echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping"

on each host ?

(this disable multicast snooping/filtering, which can sometime badly block multicast traffic)

Hi Spirit, in our case we have the management IP's directly on eth0 so there is no bridge involved. We do have IGMP snooping on the Extreme switch, but this has not changed any time recently so I dont think they could suddenly cause this issue.

As a side note, I changed the transport to be unicast and the cluster has returned to stable.

Have the Proxmox dev's recently changed anything to do with Multicast/Corosync ?

spirit · Jul 4, 2013

nz_monkey said:
Have the Proxmox dev's recently changed anything to do with Multicast/Corosync ?

No, nothing have change recently.
If you don't use linux bridge, I don't see bug on the host side.

Do you use bonding ?

planbdigital · Jul 25, 2013

This seems to have solved my problem

spirit said:
Hi ALL,

I don't known if it can help, But I have add a lot of problems recently with multicast, because of linux bridge multicast_snooping feature.
(So if your proxmox host ip use for corosync is on vmbr0, it's possible that you have the problem too).

Can you try

"echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping"
?

Another possibily is to put ips on interfaces and not the vmbr bridges

nz_monkey · Aug 2, 2013

We are still having issues with this. Ever since 3.0 came out PVE clustering has been troublesome.

We have confirmed we can transmit multicast between the hosts using asmping/ssmping but we still cannot get quorum between hosts.

What was changed in regards to the clustering between 2.3 and 3.0 ?

Search

Search

Corosync Totem Re-transmission Issues

spirit

Distinguished Member

adamb

Famous Member

spirit

Distinguished Member

adamb

Famous Member

nz_monkey

Renowned Member

spirit

Distinguished Member

nz_monkey

Renowned Member

spirit

Distinguished Member

planbdigital

New Member

nz_monkey

Renowned Member