Multicast problem solved enabling promisc mode

Discussion in 'Proxmox VE: Installation and configuration' started by Xabi, Apr 25, 2016.

  1. Xabi

    Xabi New Member

    Joined:
    Apr 25, 2016
    Messages:
    5
    Likes Received:
    0
    I have a three-node cluster that suddenly started losing quorum. Restarting some services (pve-cluster, cman, etc) worked only for 3-5 minutes. It lost quorum time after time.

    After checking multicast traffic with omping [1] I saw that multicast worked fine at high packet rates:

    Code:
    root@prox:~/scripts# omping -c 10000 -i 0.001 -F -q 192.168.1.1 192.168.1.2 192.168.1.3
    192.168.1.2 : waiting for response msg
    192.168.1.3 : waiting for response msg
    192.168.1.3 : joined (S,G) = (*, 232.43.211.234), pinging
    192.168.1.2 : joined (S,G) = (*, 232.43.211.234), pinging
    192.168.1.2 : waiting for response msg
    192.168.1.2 : server told us to stop
    192.168.1.3 : given amount of query messages was sent
    
    192.168.1.2 :   unicast, xmt/rcv/%loss = 9534/9534/0%, min/avg/max/std-dev = 0.052/0.132/1.145/0.030
    192.168.1.2 : multicast, xmt/rcv/%loss = 9534/9533/0% (seq>=2 0%), min/avg/max/std-dev = 0.066/0.142/1.152/0.032
    192.168.1.3 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.030/0.078/1.045/0.027
    192.168.1.3 : multicast, xmt/rcv/%loss = 10000/9999/0% (seq>=2 0%), min/avg/max/std-dev = 0.036/0.084/1.054/0.027
    But I was losing multicast packets in a long test:

    Code:
    root@prox:~# omping -c 600 -i 1 -q 192.168.1.1 192.168.1.2 192.168.1.3
    192.168.1.2 : waiting for response msg
    192.168.1.3 : waiting for response msg
    192.168.1.3 : joined (S,G) = (*, 232.43.211.234), pinging
    192.168.1.2 : joined (S,G) = (*, 232.43.211.234), pinging
    192.168.1.2 : given amount of query messages was sent
    192.168.1.3 : given amount of query messages was sent
    
    192.168.1.2 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.064/0.163/1.017/0.045
    192.168.1.2 : multicast, xmt/rcv/%loss = 600/264/56%, min/avg/max/std-dev = 0.098/0.175/0.304/0.033
    192.168.1.3 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.051/0.100/0.559/0.031
    192.168.1.3 : multicast, xmt/rcv/%loss = 600/264/56%, min/avg/max/std-dev = 0.058/0.113/0.566/0.041
    I tried to disable/enable/change IGMP snooping configuration in my switch but I didn't see any change. Omping started to lose multicast packets after 3-4 minutes:

    Code:
        ...
    
        prox1.mydomain.com : multicast, seq=253, size=69 bytes, dist=0, time=0.190ms
        prox2.mydomain.com    : multicast, seq=251, size=69 bytes, dist=0, time=0.172ms
        prox1.mydomain.com :   unicast, seq=254, size=69 bytes, dist=0, time=0.161ms
        prox2.mydomain.com    : multicast, seq=252, size=69 bytes, dist=0, time=0.155ms
        prox2.mydomain.com    :   unicast, seq=252, size=69 bytes, dist=0, time=0.153ms
        prox1.mydomain.com : multicast, seq=254, size=69 bytes, dist=0, time=0.209ms
        prox2.mydomain.com    :   unicast, seq=253, size=69 bytes, dist=0, time=0.171ms
        prox2.mydomain.com    : multicast, seq=253, size=69 bytes, dist=0, time=0.179ms
        prox1.mydomain.com :   unicast, seq=255, size=69 bytes, dist=0, time=0.262ms
        prox1.mydomain.com : multicast, seq=255, size=69 bytes, dist=0, time=0.310ms
        prox1.mydomain.com :   unicast, seq=256, size=69 bytes, dist=0, time=0.130ms
        prox2.mydomain.com    :   unicast, seq=254, size=69 bytes, dist=0, time=0.172ms
        prox1.mydomain.com : multicast, seq=256, size=69 bytes, dist=0, time=0.178ms
        prox2.mydomain.com    : multicast, seq=254, size=69 bytes, dist=0, time=0.221ms
        prox1.mydomain.com :   unicast, seq=257, size=69 bytes, dist=0, time=0.116ms
        prox2.mydomain.com    :   unicast, seq=255, size=69 bytes, dist=0, time=0.129ms
        prox1.mydomain.com : multicast, seq=257, size=69 bytes, dist=0, time=0.164ms
        prox2.mydomain.com    : multicast, seq=255, size=69 bytes, dist=0, time=0.178ms
        prox1.mydomain.com :   unicast, seq=258, size=69 bytes, dist=0, time=0.151ms
        prox2.mydomain.com    :   unicast, seq=256, size=69 bytes, dist=0, time=0.147ms
        prox1.mydomain.com : multicast, seq=258, size=69 bytes, dist=0, time=0.199ms
        prox2.mydomain.com    : multicast, seq=256, size=69 bytes, dist=0, time=0.196ms
        prox1.mydomain.com :   unicast, seq=259, size=69 bytes, dist=0, time=0.119ms
        prox2.mydomain.com    :   unicast, seq=257, size=69 bytes, dist=0, time=0.127ms
        prox1.mydomain.com : multicast, seq=259, size=69 bytes, dist=0, time=0.167ms
        prox2.mydomain.com    : multicast, seq=257, size=69 bytes, dist=0, time=0.176ms    ===> LAST MULTICAST PACKET
        prox1.mydomain.com :   unicast, seq=260, size=69 bytes, dist=0, time=0.158ms
        prox2.mydomain.com    :   unicast, seq=258, size=69 bytes, dist=0, time=0.152ms
        prox1.mydomain.com :   unicast, seq=261, size=69 bytes, dist=0, time=0.129ms
        prox2.mydomain.com    :   unicast, seq=259, size=69 bytes, dist=0, time=0.155ms
        prox2.mydomain.com    :   unicast, seq=260, size=69 bytes, dist=0, time=0.143ms
        prox1.mydomain.com :   unicast, seq=262, size=69 bytes, dist=0, time=0.200ms
        prox1.mydomain.com :   unicast, seq=263, size=69 bytes, dist=0, time=0.121ms
        prox2.mydomain.com    :   unicast, seq=261, size=69 bytes, dist=0, time=0.135ms
        prox1.mydomain.com :   unicast, seq=264, size=69 bytes, dist=0, time=0.116ms
        prox2.mydomain.com    :   unicast, seq=262, size=69 bytes, dist=0, time=0.126ms
        prox1.mydomain.com :   unicast, seq=265, size=69 bytes, dist=0, time=0.133ms
        prox2.mydomain.com    :   unicast, seq=263, size=69 bytes, dist=0, time=0.134ms
        prox1.mydomain.com :   unicast, seq=266, size=69 bytes, dist=0, time=0.127ms
        prox2.mydomain.com    :   unicast, seq=264, size=69 bytes, dist=0, time=0.160ms
        prox1.mydomain.com :   unicast, seq=267, size=69 bytes, dist=0, time=0.125ms
        prox2.mydomain.com    :   unicast, seq=265, size=69 bytes, dist=0, time=0.126ms
        prox1.mydomain.com :   unicast, seq=268, size=69 bytes, dist=0, time=0.112ms
        prox2.mydomain.com    :   unicast, seq=266, size=69 bytes, dist=0, time=0.126ms
        prox1.mydomain.com :   unicast, seq=269, size=69 bytes, dist=0, time=0.137ms
        prox2.mydomain.com    :   unicast, seq=267, size=69 bytes, dist=0, time=0.148ms
        prox1.mydomain.com :   unicast, seq=270, size=69 bytes, dist=0, time=0.145ms
        prox2.mydomain.com    :   unicast, seq=268, size=69 bytes, dist=0, time=0.151ms
        prox1.mydomain.com :   unicast, seq=271, size=69 bytes, dist=0, time=0.145ms
    Finally, I found the solution [2]. Enabling promiscuous mode on my bridge interface (in every node) solves the problem:

    Code:
    ip link set vmbr0 promisc on
    Now I can see I don't lose any multicast packets and my cluster's quorum is stable. So, apparently, it wasn't the switch but the proxmox itself...

    Now, the question is: why? is there any better solution?

    Thanks!!



    1: https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quorum_and_cluster_issues
    2: https://forum.proxmox.com/threads/pvecm-status-activity-blocked.26910/#post-135333
     
    #1 Xabi, Apr 25, 2016
    Last edited: Apr 25, 2016
  2. fireon

    fireon Well-Known Member
    Proxmox Subscriber

    Joined:
    Oct 25, 2010
    Messages:
    3,027
    Likes Received:
    186
    In the past (pve3) we had similar. I don't really know why... we changed on the advice of the PVE support to a seperatly vlan only for the clustercommunication, and yes we have done not only vlan, also a physical extra nic bonding for the clustercommunication. After that we never hat such problems. Maybe it is a little bit sensitive.

    But this was on version 3.x. On version 4.x we haven't tested without extra vlan. Every cluster has his one phy and vlan for the communication.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. Xabi

    Xabi New Member

    Joined:
    Apr 25, 2016
    Messages:
    5
    Likes Received:
    0
    Thanks for your reply,

    My cluster is on PVE3.4 and the nodes have the following network configuration:
    • A bonding (bond0) formed by two Gigabit interfaces (eth0 + eth1), using LACP (802.3ad)
    • In the switch I have an untagged VLAN (e.g. ID 6) that I use it not exclusively for cluster communication but also for some of the VM inside Proxmox.
    • I permit all tagged VLAN so I can configure any VM with the network configuration I need
    I would like to migrate my cluster to PVE4 so maybe I will try using a separated VLAN / Network interfaces just for the cluster networking.
     
  4. fireon

    fireon Well-Known Member
    Proxmox Subscriber

    Joined:
    Oct 25, 2010
    Messages:
    3,027
    Likes Received:
    186
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice