I have a three-node cluster that suddenly started losing quorum. Restarting some services (pve-cluster, cman, etc) worked only for 3-5 minutes. It lost quorum time after time.
After checking multicast traffic with omping [1] I saw that multicast worked fine at high packet rates:
But I was losing multicast packets in a long test:
I tried to disable/enable/change IGMP snooping configuration in my switch but I didn't see any change. Omping started to lose multicast packets after 3-4 minutes:
Finally, I found the solution [2]. Enabling promiscuous mode on my bridge interface (in every node) solves the problem:
Now I can see I don't lose any multicast packets and my cluster's quorum is stable. So, apparently, it wasn't the switch but the proxmox itself...
Now, the question is: why? is there any better solution?
Thanks!!
1: https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quorum_and_cluster_issues
2: https://forum.proxmox.com/threads/pvecm-status-activity-blocked.26910/#post-135333
After checking multicast traffic with omping [1] I saw that multicast worked fine at high packet rates:
Code:
root@prox:~/scripts# omping -c 10000 -i 0.001 -F -q 192.168.1.1 192.168.1.2 192.168.1.3
192.168.1.2 : waiting for response msg
192.168.1.3 : waiting for response msg
192.168.1.3 : joined (S,G) = (*, 232.43.211.234), pinging
192.168.1.2 : joined (S,G) = (*, 232.43.211.234), pinging
192.168.1.2 : waiting for response msg
192.168.1.2 : server told us to stop
192.168.1.3 : given amount of query messages was sent
192.168.1.2 : unicast, xmt/rcv/%loss = 9534/9534/0%, min/avg/max/std-dev = 0.052/0.132/1.145/0.030
192.168.1.2 : multicast, xmt/rcv/%loss = 9534/9533/0% (seq>=2 0%), min/avg/max/std-dev = 0.066/0.142/1.152/0.032
192.168.1.3 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.030/0.078/1.045/0.027
192.168.1.3 : multicast, xmt/rcv/%loss = 10000/9999/0% (seq>=2 0%), min/avg/max/std-dev = 0.036/0.084/1.054/0.027
But I was losing multicast packets in a long test:
Code:
root@prox:~# omping -c 600 -i 1 -q 192.168.1.1 192.168.1.2 192.168.1.3
192.168.1.2 : waiting for response msg
192.168.1.3 : waiting for response msg
192.168.1.3 : joined (S,G) = (*, 232.43.211.234), pinging
192.168.1.2 : joined (S,G) = (*, 232.43.211.234), pinging
192.168.1.2 : given amount of query messages was sent
192.168.1.3 : given amount of query messages was sent
192.168.1.2 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.064/0.163/1.017/0.045
192.168.1.2 : multicast, xmt/rcv/%loss = 600/264/56%, min/avg/max/std-dev = 0.098/0.175/0.304/0.033
192.168.1.3 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.051/0.100/0.559/0.031
192.168.1.3 : multicast, xmt/rcv/%loss = 600/264/56%, min/avg/max/std-dev = 0.058/0.113/0.566/0.041
I tried to disable/enable/change IGMP snooping configuration in my switch but I didn't see any change. Omping started to lose multicast packets after 3-4 minutes:
Code:
...
prox1.mydomain.com : multicast, seq=253, size=69 bytes, dist=0, time=0.190ms
prox2.mydomain.com : multicast, seq=251, size=69 bytes, dist=0, time=0.172ms
prox1.mydomain.com : unicast, seq=254, size=69 bytes, dist=0, time=0.161ms
prox2.mydomain.com : multicast, seq=252, size=69 bytes, dist=0, time=0.155ms
prox2.mydomain.com : unicast, seq=252, size=69 bytes, dist=0, time=0.153ms
prox1.mydomain.com : multicast, seq=254, size=69 bytes, dist=0, time=0.209ms
prox2.mydomain.com : unicast, seq=253, size=69 bytes, dist=0, time=0.171ms
prox2.mydomain.com : multicast, seq=253, size=69 bytes, dist=0, time=0.179ms
prox1.mydomain.com : unicast, seq=255, size=69 bytes, dist=0, time=0.262ms
prox1.mydomain.com : multicast, seq=255, size=69 bytes, dist=0, time=0.310ms
prox1.mydomain.com : unicast, seq=256, size=69 bytes, dist=0, time=0.130ms
prox2.mydomain.com : unicast, seq=254, size=69 bytes, dist=0, time=0.172ms
prox1.mydomain.com : multicast, seq=256, size=69 bytes, dist=0, time=0.178ms
prox2.mydomain.com : multicast, seq=254, size=69 bytes, dist=0, time=0.221ms
prox1.mydomain.com : unicast, seq=257, size=69 bytes, dist=0, time=0.116ms
prox2.mydomain.com : unicast, seq=255, size=69 bytes, dist=0, time=0.129ms
prox1.mydomain.com : multicast, seq=257, size=69 bytes, dist=0, time=0.164ms
prox2.mydomain.com : multicast, seq=255, size=69 bytes, dist=0, time=0.178ms
prox1.mydomain.com : unicast, seq=258, size=69 bytes, dist=0, time=0.151ms
prox2.mydomain.com : unicast, seq=256, size=69 bytes, dist=0, time=0.147ms
prox1.mydomain.com : multicast, seq=258, size=69 bytes, dist=0, time=0.199ms
prox2.mydomain.com : multicast, seq=256, size=69 bytes, dist=0, time=0.196ms
prox1.mydomain.com : unicast, seq=259, size=69 bytes, dist=0, time=0.119ms
prox2.mydomain.com : unicast, seq=257, size=69 bytes, dist=0, time=0.127ms
prox1.mydomain.com : multicast, seq=259, size=69 bytes, dist=0, time=0.167ms
prox2.mydomain.com : multicast, seq=257, size=69 bytes, dist=0, time=0.176ms ===> LAST MULTICAST PACKET
prox1.mydomain.com : unicast, seq=260, size=69 bytes, dist=0, time=0.158ms
prox2.mydomain.com : unicast, seq=258, size=69 bytes, dist=0, time=0.152ms
prox1.mydomain.com : unicast, seq=261, size=69 bytes, dist=0, time=0.129ms
prox2.mydomain.com : unicast, seq=259, size=69 bytes, dist=0, time=0.155ms
prox2.mydomain.com : unicast, seq=260, size=69 bytes, dist=0, time=0.143ms
prox1.mydomain.com : unicast, seq=262, size=69 bytes, dist=0, time=0.200ms
prox1.mydomain.com : unicast, seq=263, size=69 bytes, dist=0, time=0.121ms
prox2.mydomain.com : unicast, seq=261, size=69 bytes, dist=0, time=0.135ms
prox1.mydomain.com : unicast, seq=264, size=69 bytes, dist=0, time=0.116ms
prox2.mydomain.com : unicast, seq=262, size=69 bytes, dist=0, time=0.126ms
prox1.mydomain.com : unicast, seq=265, size=69 bytes, dist=0, time=0.133ms
prox2.mydomain.com : unicast, seq=263, size=69 bytes, dist=0, time=0.134ms
prox1.mydomain.com : unicast, seq=266, size=69 bytes, dist=0, time=0.127ms
prox2.mydomain.com : unicast, seq=264, size=69 bytes, dist=0, time=0.160ms
prox1.mydomain.com : unicast, seq=267, size=69 bytes, dist=0, time=0.125ms
prox2.mydomain.com : unicast, seq=265, size=69 bytes, dist=0, time=0.126ms
prox1.mydomain.com : unicast, seq=268, size=69 bytes, dist=0, time=0.112ms
prox2.mydomain.com : unicast, seq=266, size=69 bytes, dist=0, time=0.126ms
prox1.mydomain.com : unicast, seq=269, size=69 bytes, dist=0, time=0.137ms
prox2.mydomain.com : unicast, seq=267, size=69 bytes, dist=0, time=0.148ms
prox1.mydomain.com : unicast, seq=270, size=69 bytes, dist=0, time=0.145ms
prox2.mydomain.com : unicast, seq=268, size=69 bytes, dist=0, time=0.151ms
prox1.mydomain.com : unicast, seq=271, size=69 bytes, dist=0, time=0.145ms
Finally, I found the solution [2]. Enabling promiscuous mode on my bridge interface (in every node) solves the problem:
Code:
ip link set vmbr0 promisc on
Now I can see I don't lose any multicast packets and my cluster's quorum is stable. So, apparently, it wasn't the switch but the proxmox itself...
Now, the question is: why? is there any better solution?
Thanks!!
1: https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quorum_and_cluster_issues
2: https://forum.proxmox.com/threads/pvecm-status-activity-blocked.26910/#post-135333
Last edited: