MultiCast Issue

yswery

Well-Known Member
May 6, 2018
84
5
48
55
I have been having issues with the clustering function on proxmox for a while now I hope someone can help.

I seem to be getting packet loss for what ever reason and I have no idea how to better diagnose this on our PROXMOX 5.3 system

As seen in this OMPING you can see there is loss (happens all the time)
Note: 10.50.0.10 Is that node IP address (the same node we are testing from within)

Code:
root@matterhorn:~# omping 10.50.0.10 -c 20
10.50.0.10 : waiting for response msg
10.50.0.10 : joined (S,G) = (*, 232.43.211.234), pinging
10.50.0.10 :   unicast, seq=1, size=69 bytes, dist=0, time=0.026ms
10.50.0.10 : multicast, seq=1, size=69 bytes, dist=0, time=0.038ms
10.50.0.10 :   unicast, seq=2, size=69 bytes, dist=0, time=0.074ms
10.50.0.10 : multicast, seq=2, size=69 bytes, dist=0, time=0.089ms
...
10.50.0.10 :   unicast, seq=8, size=69 bytes, dist=0, time=0.076ms
10.50.0.10 : multicast, seq=8, size=69 bytes, dist=0, time=0.092ms
10.50.0.10 :   unicast, seq=9, size=69 bytes, dist=0, time=0.067ms
10.50.0.10 : multicast, seq=9, size=69 bytes, dist=0, time=0.080ms
10.50.0.10 : multicast, seq=10, size=69 bytes, dist=0, time=0.095ms
10.50.0.10 : given amount of query messages was sent

10.50.0.10 :   unicast, xmt/rcv/%loss = 20/9/55%, min/avg/max/std-dev = 0.026/0.079/0.135/0.028
10.50.0.10 : multicast, xmt/rcv/%loss = 20/10/50%, min/avg/max/std-dev = 0.038/0.094/0.149/0.027

As you can see on the 10th packet it just stopps working. I can not seem to find much in my syslogs about this either.. and have a real hard issue because I can not use clustering becuase of this

System Info:
Code:
root@matterhorn:~# uname -r
4.15.18-8-pve

Code:
root@matterhorn:~# pveversion
pve-manager/5.3-5/97ae681d (running kernel: 4.15.18-8-pve)

Code:
root@matterhorn:~# pct list| wc -l
285
root@matterhorn:~# qm list|wc -l
11

As you can see above we are running a lot of containers on this one machine too.
 
Hi,

I would say you have a problem in your network may be the multicast Querier is not working correctly.
 
@wolfgang
Thanks for the reply. I tried to disable igmp snppoing in the switch and also multicast_snooping o the proxmox node, and but its still the same issue. Would it be by any chance related to the number of containers running on the machine?

Is there anywhere I can better look and diagnose what is at fault here?
 
10.50.0.10 : unicast, xmt/rcv/%loss = 20/9/55%, min/avg/max/std-dev = 0.026/0.079/0.135/0.028 10.50.0.10 : multicast, xmt/rcv/%loss = 20/10/50%, min/avg/max/std-dev = 0.038/0.094/0.149/0.027

You do have packet loss also for unicast packets - it seems the problem is elsewhere - maybe an ip is being used twice, broken cable, broken switch?
 
You do have packet loss also for unicast packets - it seems the problem is elsewhere - maybe an ip is being used twice, broken cable, broken switch?

Yep thats what I thought at first too, but I looked through the network and through the set up and couldnt find anything

Also, it seems that (like the omping I showed above) its *ALWAYS* failing on the 10th packet and therefor after. if I cancel the process and start a new omping it once again starts fresh and 10th packet fails. I almost think and feel like its a "setting" or "limit" somewhere inside the proxmox machine (example a default sysctl config that I am maxing out due to so many processes/containers?)
 
Also for extra verification I added a new IP address tp the VMBR0 interface of the proxmox node, a new random IP address which I check before hand is not used anywhere and I got exact same results:

Code:
# omping  10.50.0.123 -c 20
10.50.0.123 : waiting for response msg
10.50.0.123 : joined (S,G) = (*, 232.43.211.234), pinging
10.50.0.123 :   unicast, seq=1, size=69 bytes, dist=0, time=0.021ms
10.50.0.123 : multicast, seq=1, size=69 bytes, dist=0, time=0.029ms
10.50.0.123 :   unicast, seq=2, size=69 bytes, dist=0, time=0.091ms
10.50.0.123 : multicast, seq=2, size=69 bytes, dist=0, time=0.105ms
10.50.0.123 :   unicast, seq=3, size=69 bytes, dist=0, time=0.073ms
10.50.0.123 : multicast, seq=3, size=69 bytes, dist=0, time=0.087ms
10.50.0.123 :   unicast, seq=4, size=69 bytes, dist=0, time=0.071ms
10.50.0.123 : multicast, seq=4, size=69 bytes, dist=0, time=0.085ms
10.50.0.123 :   unicast, seq=5, size=69 bytes, dist=0, time=0.078ms
10.50.0.123 : multicast, seq=5, size=69 bytes, dist=0, time=0.092ms
10.50.0.123 :   unicast, seq=6, size=69 bytes, dist=0, time=0.078ms
10.50.0.123 : multicast, seq=6, size=69 bytes, dist=0, time=0.091ms
10.50.0.123 :   unicast, seq=7, size=69 bytes, dist=0, time=0.077ms
10.50.0.123 : multicast, seq=7, size=69 bytes, dist=0, time=0.091ms
10.50.0.123 :   unicast, seq=8, size=69 bytes, dist=0, time=0.076ms
10.50.0.123 : multicast, seq=8, size=69 bytes, dist=0, time=0.090ms
10.50.0.123 :   unicast, seq=9, size=69 bytes, dist=0, time=0.072ms
10.50.0.123 : multicast, seq=9, size=69 bytes, dist=0, time=0.085ms
10.50.0.123 : multicast, seq=10, size=69 bytes, dist=0, time=0.092ms
10.50.0.123 : given amount of query messages was sent

10.50.0.123 :   unicast, xmt/rcv/%loss = 20/9/55%, min/avg/max/std-dev = 0.021/0.071/0.091/0.020
10.50.0.123 : multicast, xmt/rcv/%loss = 20/10/50%, min/avg/max/std-dev = 0.029/0.085/0.105/0.020
 
* hm - you can try to run tcpdump while testing with omping - maybe you'll see where the packets get dropped
* maybe the switch-logs provide some insight
 
Did a TCPDUMP and here is what came out, seems liek its just stopped:

Code:
root@matterhorn:~# tcpdump -i vmbr0 -s0 -vv net 224.0.0.0/4 -n
tcpdump: listening on vmbr0, link-type EN10MB (Ethernet), capture size 262144 bytes
10:09:03.184827 IP (tos 0x0, ttl 64, id 46169, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xc3a3!] UDP, length 69
10:09:03.192691 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 32, options (RA))
    10.50.0.10 > 232.43.211.234: igmp v2 report 232.43.211.234
10:09:04.185001 IP (tos 0x0, ttl 64, id 46299, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xc325!] UDP, length 69
10:09:05.184884 IP (tos 0x0, ttl 64, id 46425, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xc3a0!] UDP, length 69
10:09:06.184837 IP (tos 0x0, ttl 64, id 46427, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xc3a4!] UDP, length 69
10:09:07.185912 IP (tos 0x0, ttl 64, id 46498, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xbf86!] UDP, length 69
10:09:08.186932 IP (tos 0x0, ttl 64, id 46557, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xbb79!] UDP, length 69
10:09:09.188067 IP (tos 0x0, ttl 64, id 46603, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xb713!] UDP, length 69
10:09:09.868720 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 32, options (RA))
    10.50.0.10 > 232.43.211.234: igmp v2 report 232.43.211.234
10:09:10.188815 IP (tos 0x0, ttl 64, id 46743, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xb40e!] UDP, length 69
10:09:11.188818 IP (tos 0x0, ttl 64, id 46796, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xb417!] UDP, length 69
10:09:12.189862 IP (tos 0x0, ttl 64, id 46968, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xb012!] UDP, length 69
10:09:15.500713 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 32, options (RA))
    10.50.0.10 > 232.43.211.234: igmp v2 report 232.43.211.234


On the same node I had a very similar issue where SNMP hangs only on UDP after Nth query. Happens only n UDP and a restart to the SNMP process always resets. See my issue here https://serverfault.com/questions/940604/snmp-timeout-stuck-mid-way

What I am saying it seems VERY similar where there is some kernal limit per process on UDP stuff? could this be?
 
If I pay for the premium subscriptions plus support would that mean the proxmox team would be able to see this issue more directly with my server potentially. I am 99% sure there is a issue on local server rather than external network? Is this something that would suit the "X support tickets / year"?