MultiCast Issue

yswery

Well-Known Member
May 6, 2018
78
5
48
54
I have been having issues with the clustering function on proxmox for a while now I hope someone can help.

I seem to be getting packet loss for what ever reason and I have no idea how to better diagnose this on our PROXMOX 5.3 system

As seen in this OMPING you can see there is loss (happens all the time)
Note: 10.50.0.10 Is that node IP address (the same node we are testing from within)

Code:
root@matterhorn:~# omping 10.50.0.10 -c 20
10.50.0.10 : waiting for response msg
10.50.0.10 : joined (S,G) = (*, 232.43.211.234), pinging
10.50.0.10 :   unicast, seq=1, size=69 bytes, dist=0, time=0.026ms
10.50.0.10 : multicast, seq=1, size=69 bytes, dist=0, time=0.038ms
10.50.0.10 :   unicast, seq=2, size=69 bytes, dist=0, time=0.074ms
10.50.0.10 : multicast, seq=2, size=69 bytes, dist=0, time=0.089ms
...
10.50.0.10 :   unicast, seq=8, size=69 bytes, dist=0, time=0.076ms
10.50.0.10 : multicast, seq=8, size=69 bytes, dist=0, time=0.092ms
10.50.0.10 :   unicast, seq=9, size=69 bytes, dist=0, time=0.067ms
10.50.0.10 : multicast, seq=9, size=69 bytes, dist=0, time=0.080ms
10.50.0.10 : multicast, seq=10, size=69 bytes, dist=0, time=0.095ms
10.50.0.10 : given amount of query messages was sent

10.50.0.10 :   unicast, xmt/rcv/%loss = 20/9/55%, min/avg/max/std-dev = 0.026/0.079/0.135/0.028
10.50.0.10 : multicast, xmt/rcv/%loss = 20/10/50%, min/avg/max/std-dev = 0.038/0.094/0.149/0.027

As you can see on the 10th packet it just stopps working. I can not seem to find much in my syslogs about this either.. and have a real hard issue because I can not use clustering becuase of this

System Info:
Code:
root@matterhorn:~# uname -r
4.15.18-8-pve

Code:
root@matterhorn:~# pveversion
pve-manager/5.3-5/97ae681d (running kernel: 4.15.18-8-pve)

Code:
root@matterhorn:~# pct list| wc -l
285
root@matterhorn:~# qm list|wc -l
11

As you can see above we are running a lot of containers on this one machine too.
 
Hi,

I would say you have a problem in your network may be the multicast Querier is not working correctly.
 
@wolfgang
Thanks for the reply. I tried to disable igmp snppoing in the switch and also multicast_snooping o the proxmox node, and but its still the same issue. Would it be by any chance related to the number of containers running on the machine?

Is there anywhere I can better look and diagnose what is at fault here?
 
10.50.0.10 : unicast, xmt/rcv/%loss = 20/9/55%, min/avg/max/std-dev = 0.026/0.079/0.135/0.028 10.50.0.10 : multicast, xmt/rcv/%loss = 20/10/50%, min/avg/max/std-dev = 0.038/0.094/0.149/0.027

You do have packet loss also for unicast packets - it seems the problem is elsewhere - maybe an ip is being used twice, broken cable, broken switch?
 
You do have packet loss also for unicast packets - it seems the problem is elsewhere - maybe an ip is being used twice, broken cable, broken switch?

Yep thats what I thought at first too, but I looked through the network and through the set up and couldnt find anything

Also, it seems that (like the omping I showed above) its *ALWAYS* failing on the 10th packet and therefor after. if I cancel the process and start a new omping it once again starts fresh and 10th packet fails. I almost think and feel like its a "setting" or "limit" somewhere inside the proxmox machine (example a default sysctl config that I am maxing out due to so many processes/containers?)
 
Also for extra verification I added a new IP address tp the VMBR0 interface of the proxmox node, a new random IP address which I check before hand is not used anywhere and I got exact same results:

Code:
# omping  10.50.0.123 -c 20
10.50.0.123 : waiting for response msg
10.50.0.123 : joined (S,G) = (*, 232.43.211.234), pinging
10.50.0.123 :   unicast, seq=1, size=69 bytes, dist=0, time=0.021ms
10.50.0.123 : multicast, seq=1, size=69 bytes, dist=0, time=0.029ms
10.50.0.123 :   unicast, seq=2, size=69 bytes, dist=0, time=0.091ms
10.50.0.123 : multicast, seq=2, size=69 bytes, dist=0, time=0.105ms
10.50.0.123 :   unicast, seq=3, size=69 bytes, dist=0, time=0.073ms
10.50.0.123 : multicast, seq=3, size=69 bytes, dist=0, time=0.087ms
10.50.0.123 :   unicast, seq=4, size=69 bytes, dist=0, time=0.071ms
10.50.0.123 : multicast, seq=4, size=69 bytes, dist=0, time=0.085ms
10.50.0.123 :   unicast, seq=5, size=69 bytes, dist=0, time=0.078ms
10.50.0.123 : multicast, seq=5, size=69 bytes, dist=0, time=0.092ms
10.50.0.123 :   unicast, seq=6, size=69 bytes, dist=0, time=0.078ms
10.50.0.123 : multicast, seq=6, size=69 bytes, dist=0, time=0.091ms
10.50.0.123 :   unicast, seq=7, size=69 bytes, dist=0, time=0.077ms
10.50.0.123 : multicast, seq=7, size=69 bytes, dist=0, time=0.091ms
10.50.0.123 :   unicast, seq=8, size=69 bytes, dist=0, time=0.076ms
10.50.0.123 : multicast, seq=8, size=69 bytes, dist=0, time=0.090ms
10.50.0.123 :   unicast, seq=9, size=69 bytes, dist=0, time=0.072ms
10.50.0.123 : multicast, seq=9, size=69 bytes, dist=0, time=0.085ms
10.50.0.123 : multicast, seq=10, size=69 bytes, dist=0, time=0.092ms
10.50.0.123 : given amount of query messages was sent

10.50.0.123 :   unicast, xmt/rcv/%loss = 20/9/55%, min/avg/max/std-dev = 0.021/0.071/0.091/0.020
10.50.0.123 : multicast, xmt/rcv/%loss = 20/10/50%, min/avg/max/std-dev = 0.029/0.085/0.105/0.020
 
* hm - you can try to run tcpdump while testing with omping - maybe you'll see where the packets get dropped
* maybe the switch-logs provide some insight
 
Did a TCPDUMP and here is what came out, seems liek its just stopped:

Code:
root@matterhorn:~# tcpdump -i vmbr0 -s0 -vv net 224.0.0.0/4 -n
tcpdump: listening on vmbr0, link-type EN10MB (Ethernet), capture size 262144 bytes
10:09:03.184827 IP (tos 0x0, ttl 64, id 46169, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xc3a3!] UDP, length 69
10:09:03.192691 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 32, options (RA))
    10.50.0.10 > 232.43.211.234: igmp v2 report 232.43.211.234
10:09:04.185001 IP (tos 0x0, ttl 64, id 46299, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xc325!] UDP, length 69
10:09:05.184884 IP (tos 0x0, ttl 64, id 46425, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xc3a0!] UDP, length 69
10:09:06.184837 IP (tos 0x0, ttl 64, id 46427, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xc3a4!] UDP, length 69
10:09:07.185912 IP (tos 0x0, ttl 64, id 46498, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xbf86!] UDP, length 69
10:09:08.186932 IP (tos 0x0, ttl 64, id 46557, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xbb79!] UDP, length 69
10:09:09.188067 IP (tos 0x0, ttl 64, id 46603, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xb713!] UDP, length 69
10:09:09.868720 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 32, options (RA))
    10.50.0.10 > 232.43.211.234: igmp v2 report 232.43.211.234
10:09:10.188815 IP (tos 0x0, ttl 64, id 46743, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xb40e!] UDP, length 69
10:09:11.188818 IP (tos 0x0, ttl 64, id 46796, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xb417!] UDP, length 69
10:09:12.189862 IP (tos 0x0, ttl 64, id 46968, offset 0, flags [DF], proto UDP (17), length 97)
    10.50.0.10.4321 > 232.43.211.234.4321: [bad udp cksum 0xc6b0 -> 0xb012!] UDP, length 69
10:09:15.500713 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 32, options (RA))
    10.50.0.10 > 232.43.211.234: igmp v2 report 232.43.211.234


On the same node I had a very similar issue where SNMP hangs only on UDP after Nth query. Happens only n UDP and a restart to the SNMP process always resets. See my issue here https://serverfault.com/questions/940604/snmp-timeout-stuck-mid-way

What I am saying it seems VERY similar where there is some kernal limit per process on UDP stuff? could this be?
 
If I pay for the premium subscriptions plus support would that mean the proxmox team would be able to see this issue more directly with my server potentially. I am 99% sure there is a issue on local server rather than external network? Is this something that would suit the "X support tickets / year"?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!