[SOLVED] corosync lock and TOTEM Retransmit after upgrade 4.4 to 5.4

bash99

New Member
Nov 2, 2016
4
2
3
45
we have a mixed 4.4/5.3 cluster, it's our production system, so the upgrade is very slow.

recently we got a lot problems when upgrade or add new installed 5.4 box to cluster, sometimes the whole cluster locked up, corosync use 100%cpu. only separate the new box from cluster can restore it.

logs like:
"pvesr failed"

"corosync[4216]: notice [TOTEM ] Retransmit List 59 5a 101 102 ad ae 83 84 d7 d8"

after test with omping, we found that multicast failed on new upgraded box.
check with https://pve.proxmox.com/wiki/Multicast_notes

IGMP snooping is on in switch,but is also on in new upgraded box.
IGMP querier is off in new upgraded box, which should be on by Multicast_notes document.

So something changed between 4.4 and 5.5?

we add
post-up ( echo 1 > /sys/devices/virtual/net/$IFACE/bridge/multicast_querier )
post-up ( echo 0 > /sys/class/net/$IFACE/bridge/multicast_snooping )
on network settings and restart network

Everything works normally.
 
you should enable multicast querier on your physical switch instead.
I have see problems in past, when reboot a node which was the multicast querier, breaking multicast on switch because other node querier don't take relay.