Bridge freeze and odd arping answer

dfateyev

New Member
May 25, 2015
3
0
1
I have several Proxmox nodes connected via OpenVPN with TAP interfaces. All TAP-connections on nodes bridged with their `vmbr1` bridges where are also VETH-interfaces from OpenVZ VMs. So it represents distributed L2 network among VMs.

At the first glance, all works like charm, but some time ago with increasing servers count I noticed that there were some freezes in the bridge. They may happen once or twice per day and last 10-120 seconds. It looks like the mutual ping or tcp/udp-connections can't pass through VMs located on the same node/different nodes. Meanwhile, they see each other by ARP (`ip neighbor` says 'reachable').

I enabled STP on `vmbr1` just to check if it fixes anything, but noticed that the ping results changed:
Code:
root@ovz1:~# ping 172.16.7.2
PING 172.16.7.2 (172.16.7.2) 56(84) bytes of data.
64 bytes from 172.16.7.2: icmp_req=1 ttl=64 time=0.024 ms
64 bytes from 172.16.7.2: icmp_req=1 ttl=64 time=20.8 ms (DUP!)
64 bytes from 172.16.7.2: icmp_req=1 ttl=64 time=20.8 ms (DUP!)
64 bytes from 172.16.7.2: icmp_req=1 ttl=64 time=41.7 ms (DUP!)
64 bytes from 172.16.7.2: icmp_req=2 ttl=64 time=0.040 ms
64 bytes from 172.16.7.2: icmp_req=2 ttl=64 time=20.9 ms (DUP!)
64 bytes from 172.16.7.2: icmp_req=2 ttl=64 time=20.9 ms (DUP!)
64 bytes from 172.16.7.2: icmp_req=2 ttl=64 time=41.7 ms (DUP!)
Checked with arping:
Code:
root@ovz1:~# arping -I eth0 -c 4 172.16.7.2
ARPING 172.16.7.2 from 172.16.7.1 eth0
Unicast reply from 172.16.7.2 [C6:76:4B:05:3E:33]  0.534ms
Unicast reply from 172.16.7.2 [C6:76:4B:05:3E:33]  21.677ms
Unicast reply from 172.16.7.2 [8E:3A:1A:68:77:24]  40.754ms
Unicast reply from 172.16.7.2 [CA:45:11:D4:38:03]  85.768ms
Unicast reply from 172.16.7.2 [6A:74:CF:8C:08:22]  101.030ms
Unicast reply from 172.16.7.2 [76:1B:99:84:9F:1A]  104.720ms
Unicast reply from 172.16.7.2 [76:1B:99:84:9F:1A]  125.585ms
Unicast reply from 172.16.7.2 [F6:C1:90:7B:73:34]  451.747ms
Unicast reply from 172.16.7.2 [B6:22:8A:69:E9:72]  604.148ms
Unicast reply from 172.16.7.2 [5E:5B:49:85:3C:A3]  716.400ms
Unicast reply from 172.16.7.2 [4E:0F:18:93:5B:11]  811.795ms
Unicast reply from 172.16.7.2 [4E:0F:18:93:5B:11]  21.896ms
Unicast reply from 172.16.7.2 [4E:0F:18:93:5B:11]  21.866ms
Unicast reply from 172.16.7.2 [4E:0F:18:93:5B:11]  21.836ms
Sent 4 probes (1 broadcast(s))
Received 14 response(s)
First two answers are valid, all other seems from random VMs from other nodes in my network.

Of course, when I disconnect VPN connection from the node, everything is back to normal:
Code:
root@ovz1:~# ping 172.16.7.2
PING 172.16.7.2 (172.16.7.2) 56(84) bytes of data.
64 bytes from 172.16.7.2: icmp_req=1 ttl=64 time=0.020 ms
64 bytes from 172.16.7.2: icmp_req=2 ttl=64 time=0.031 ms
64 bytes from 172.16.7.2: icmp_req=3 ttl=64 time=0.039 ms
64 bytes from 172.16.7.2: icmp_req=4 ttl=64 time=0.035 ms

There are neither duplicate MAC nor IP addresses on the nodes. The VMs which answer on ARP all have different and unique IP addresses.
Which can be a reason of bridge freeze and strange arping results?

P.S. On the nodes installed Proxmox version from 3.2-4 to 3.3-5.
 
Last edited:
Well, I reverted bridge settings to default to not to confuse other Proxmox nodes and get rid of DUP ping results.
But, still have these arping answers:
Code:
root@ovz2:/# arping -I eth0 -c 4 172.16.7.1
ARPING 172.16.7.1 from 172.16.7.3 eth0
Unicast reply from 172.16.7.1 [EE:18:2F:D7:D2:63]  0.540ms
Unicast reply from 172.16.7.1 [B6:22:8A:69:E9:72]  42.296ms
Unicast reply from 172.16.7.1 [4E:0F:18:93:5B:11]  50.368ms
Unicast reply from 172.16.7.1 [8E:3A:1A:68:77:24]  166.388ms
Unicast reply from 172.16.7.1 [6A:74:CF:8C:08:22]  172.183ms
Unicast reply from 172.16.7.1 [CA:45:11:D4:38:03]  207.392ms
Unicast reply from 172.16.7.1 [F6:C1:90:7B:73:34]  271.387ms
Unicast reply from 172.16.7.1 [5E:5B:49:85:3C:A3]  303.805ms
Unicast reply from 172.16.7.1 [76:1B:99:84:9F:1A]  407.756ms
Unicast reply from 172.16.7.1 [76:1B:99:84:9F:1A]  0.534ms
Unicast reply from 172.16.7.1 [76:1B:99:84:9F:1A]  0.545ms
Unicast reply from 172.16.7.1 [76:1B:99:84:9F:1A]  0.534ms
Sent 4 probes (1 broadcast(s))
Received 12 response(s)
Perhaps anybody has an idea why I'm seeing these results?
 
Check proxy arp settings
I do really use proxy arp:
Code:
net.ipv4.conf.all.proxy_arp = 1
net.ipv4.conf.default.proxy_arp = 1


# Enables source route verification
net.ipv4.conf.all.rp_filter = 0
but seems it's needed to interact all VMs located on different nodes.
Anyway, I'll try to disable arp proxying on node's `vmbr1` which is linked with VPN TAP and see if it helps.

UPD: Fixed with more precised `proxy_arp` and `rp_filter` values.
 
Last edited: