Duplicate ARP problem

Stuart Howlette

New Member
Aug 13, 2019
7
0
1
40
UK
Hi all,

We appear to be hitting an issue with duplicate ARP entries being generated by multiple Proxmox hosts. These are being generated for the management IP of the hosts.

The running version of each component is listed below: -
Code:
pveversion -v
proxmox-ve: 5.4-1 (running kernel: 4.15.18-15-pve)
pve-manager: 5.4-6 (running version: 5.4-6/aa7856c5)
pve-kernel-4.15: 5.4-3
pve-kernel-4.15.18-15-pve: 4.15.18-40
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-10
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-52
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-43
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-37
pve-container: 2.0-39
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-52
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

We appear to be hitting some strange behaviour where two interfaces on the hosts respond to ARP, with different MACs, and interestingly only if the source address of the ARP packet is 0.0.0.0.

Code:
 ip a  | grep -EiA2 "vmbr0|vport0"
vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 18:66:da:51:b3:eb brd ff:ff:ff:ff:ff:ff

vport0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether aa:83:29:09:fa:bc brd ff:ff:ff:ff:ff:ff
    inet 10.21.0.15/24 brd 10.21.0.255 scope global vport0

In the packet captures, we see ARP replies with a source MAC address of 18:66:da:51:b3:eb and aa:83:29:09:fa:bc

Previously this hasn't caused any issues. However, we recently upgraded our Juniper switching, which now enables ARP Suppression by default. This means that rather than both ARP messages being return for ARP replies, sometimes only one is getting through, and it so happens to be the one which starts blackholing traffic.

We can recreate the issue using arping, by turning ARP suppression back on, and sending ARP packets to the IP with a source IP of 0.0.0.0.

Using 0.0.0.0 as a source IP is a valid usage of ARP, and appears to be for duplicate ARP detection. Unfortunately this very detection is causing duplicate ARP responses, usefully enough!

We see this across 3 separate hosts, and the only way we can stop this problem is disable ARP suppression on our switches. This isn't a good permanent fix, as the next version of JunOS will remove the ARP suppression feature.

Does anyone have any insight into what could be causing this? I can provide PCAPs (although not yet, as I can't post links yet!) to show the behaviour.

I am more than happy to provide more details if it helps!
 
We may have worked around this, using the following sysctl option

net.ipv4.conf.vmbr0.arp_ignore=2

So far all is good. If this is the case, it would seem prudent for this to be part of the default Proxmox install I would have thought
 
# Main interface
allow-vmbr0 bond0
iface bond0 inet manual
ovs_bonds eno3 eno4
ovs_type OVSBond
ovs_bridge vmbr0
ovs_options lacp=active bond_mode=balance-tcp

auto lo
iface lo inet loopback

# Interface to secondary network
allow-vmbr1 eno1
iface eno1 inet manual
ovs_type OVSPort
ovs_bridge vmbr1

# Mirror to port capture server
allow-vmbr0 eno2
iface eno2 inet manual
ovs_type OVSPort
ovs_bridge vmbr0

iface eno3 inet manual

iface eno4 inet manual

# Management interface
allow-vmbr0 vport0
iface vport0 inet static
address 10.21.0.15
netmask 255.255.255.0
gateway 10.21.0.210
ovs_type OVSIntPort
ovs_bridge vmbr0

# Secondary network
allow-vmbr1 vport1
iface vport1 inet static
address 172.22.1.15
netmask 255.255.255.0
ovs_type OVSIntPort
ovs_bridge vmbr1
ovs_options tag=100

auto vmbr0
iface vmbr0 inet manual
ovs_type OVSBridge
ovs_ports bond0 vport0 eno2

auto vmbr1
iface vmbr1 inet manual
ovs_type OVSBridge
ovs_ports eno1 vport1

We're using openvswitch on these.
 
The mirror is done using OVS Mirrors.

Code:
ovs-vsctl -- set Bridge vmbr0 mirrors=@m \
 -- --id=@vm-01 get Port tap102i0 \
 -- --id=@vm-02 get Port tap158i0 \
 -- --id=@dest get Port eno2 \
 -- --id=@m create Mirror name=mirrorport select-src-port=@vm-01,@vm-02 output-port=@dest

 ovs-vsctl list Bridge vmbr0 | grep -i mirrors                                                                                                                                                                          mirrors             : [f8c0b5f4-0649-4331-ae5c-cd0eabf622f8]

In looking around at issues with Linux bridging, I found that this isn't just tied to either Proxmox or Openvswitch. This happens with standard Linux bridging too, hence the application of the arp_ignore.
 
A similar issue, but without Openvswitch or Proxmox
https://serverfault.com/questions/206316/linux-bridge-responding-to-arp-on-wrong-interface

Now that I can post links...

image.png

image-1.png
 
as you don't use vlan on vport0, do you have tried to remove it and setup ip on vmbr0 directly ?

We could, but that restricts us if we want to move to VLANs on vport0 in future. In the event that VLANs are passing through vmbr0, I would hope that it doesn't respond to ARP for the VLANs at least.

More than anything, I'm wondering if people have seen similar behaviour. We have our workaround, which is working well, and why I'm suggesting it should probably be a default behaviour/parameter when using bridges on Proxmox without IPs on them.
 
We could, but that restricts us if we want to move to VLANs on vport0 in future. In the event that VLANs are passing through vmbr0, I would hope that it doesn't respond to ARP for the VLANs at least.

More than anything, I'm wondering if people have seen similar behaviour. We have our workaround, which is working well, and why I'm suggesting it should probably be a default behaviour/parameter when using bridges on Proxmox without IPs on them.

What I'm not sure, it that it'll repond to arp if the interface have a vlan.
(maybe it's only a behaviour when interface have novlan and ip is setup on the interface)
 
What I'm not sure, it that it'll repond to arp if the interface have a vlan.
(maybe it's only a behaviour when interface have novlan and ip is setup on the interface)

That is kind of my assumption, however I also assumed an interface without an IP would never respond to ARP! :D
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!