Hi all,
We appear to be hitting an issue with duplicate ARP entries being generated by multiple Proxmox hosts. These are being generated for the management IP of the hosts.
The running version of each component is listed below: -
We appear to be hitting some strange behaviour where two interfaces on the hosts respond to ARP, with different MACs, and interestingly only if the source address of the ARP packet is 0.0.0.0.
In the packet captures, we see ARP replies with a source MAC address of 18:66:da:51:b3:eb and aa:83:29:09:fa:bc
Previously this hasn't caused any issues. However, we recently upgraded our Juniper switching, which now enables ARP Suppression by default. This means that rather than both ARP messages being return for ARP replies, sometimes only one is getting through, and it so happens to be the one which starts blackholing traffic.
We can recreate the issue using arping, by turning ARP suppression back on, and sending ARP packets to the IP with a source IP of 0.0.0.0.
Using 0.0.0.0 as a source IP is a valid usage of ARP, and appears to be for duplicate ARP detection. Unfortunately this very detection is causing duplicate ARP responses, usefully enough!
We see this across 3 separate hosts, and the only way we can stop this problem is disable ARP suppression on our switches. This isn't a good permanent fix, as the next version of JunOS will remove the ARP suppression feature.
Does anyone have any insight into what could be causing this? I can provide PCAPs (although not yet, as I can't post links yet!) to show the behaviour.
I am more than happy to provide more details if it helps!
We appear to be hitting an issue with duplicate ARP entries being generated by multiple Proxmox hosts. These are being generated for the management IP of the hosts.
The running version of each component is listed below: -
Code:
pveversion -v
proxmox-ve: 5.4-1 (running kernel: 4.15.18-15-pve)
pve-manager: 5.4-6 (running version: 5.4-6/aa7856c5)
pve-kernel-4.15: 5.4-3
pve-kernel-4.15.18-15-pve: 4.15.18-40
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-10
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-52
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-43
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-37
pve-container: 2.0-39
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-52
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2
We appear to be hitting some strange behaviour where two interfaces on the hosts respond to ARP, with different MACs, and interestingly only if the source address of the ARP packet is 0.0.0.0.
Code:
ip a | grep -EiA2 "vmbr0|vport0"
vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 18:66:da:51:b3:eb brd ff:ff:ff:ff:ff:ff
vport0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether aa:83:29:09:fa:bc brd ff:ff:ff:ff:ff:ff
inet 10.21.0.15/24 brd 10.21.0.255 scope global vport0
In the packet captures, we see ARP replies with a source MAC address of 18:66:da:51:b3:eb and aa:83:29:09:fa:bc
Previously this hasn't caused any issues. However, we recently upgraded our Juniper switching, which now enables ARP Suppression by default. This means that rather than both ARP messages being return for ARP replies, sometimes only one is getting through, and it so happens to be the one which starts blackholing traffic.
We can recreate the issue using arping, by turning ARP suppression back on, and sending ARP packets to the IP with a source IP of 0.0.0.0.
Using 0.0.0.0 as a source IP is a valid usage of ARP, and appears to be for duplicate ARP detection. Unfortunately this very detection is causing duplicate ARP responses, usefully enough!
We see this across 3 separate hosts, and the only way we can stop this problem is disable ARP suppression on our switches. This isn't a good permanent fix, as the next version of JunOS will remove the ARP suppression feature.
Does anyone have any insight into what could be causing this? I can provide PCAPs (although not yet, as I can't post links yet!) to show the behaviour.
I am more than happy to provide more details if it helps!