Invalid ARP responses cause network problems

lkiesow · Nov 17, 2022

While I have solved the issue, I hope this is helpful to others who run into the same problem. Also, it would be great if someone could verify that this is a sensible approach.

Recently, we would lose the network connection to our PVE servers on a semi-regular interval. Virtual machines on that server using a network bridge to the external network could still be reached, though. Assuming a network problem, I talked to one of our network engineers about this who, after a short investigation, pointed out that the server would not only respond to ARP requests with the MAC address it should respond with, but would actually respond with several different MAC addresses. This meant that occasionally the correct MAC was linked to the IP address of the server and sometimes… not. This probably comes down to timing and luck.

Looking into this using tcpdump, you can clearly see not only the network bridge, but also the virtual machine interfaces answering:

Code:

❯ tcpdump -ennqti any arp
eno2       B   ifindex   3 ac:78:d1:b4:72:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
fwpr3000p0 Out ifindex 119 ac:78:d1:b4:72:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
vmbr0      B   ifindex  48 ac:78:d1:b4:72:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
           ↓ the pve bridge answers
vmbr0      Out ifindex  48 90:1b:0e:10:60:db Reply 131.173.??.?? is-at 90:1b:0e:10:60:db, length 28
eno2       Out ifindex   3 90:1b:0e:10:60:db Reply 131.173.??.?? is-at 90:1b:0e:10:60:db, length 28
fwln3000i0 B   ifindex 120 ac:78:d1:b4:72:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
tap3000i0  Out ifindex 117 ac:78:d1:b4:72:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
fwbr3000i0 B   ifindex 118 ac:78:d1:b4:72:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
           ↓ the virtual machine answers as well
fwbr3000i0 Out ifindex 118 26:2b:40:3a:8e:88 Reply 131.173.??.?? is-at 26:2b:40:3a:8e:88, length 28
fwln3000i0 Out ifindex 120 26:2b:40:3a:8e:88 Reply 131.173.??.?? is-at 26:2b:40:3a:8e:88, length 28
fwpr3000p0 P   ifindex 119 26:2b:40:3a:8e:88 Reply 131.173.??.?? is-at 26:2b:40:3a:8e:88, length 28
eno2       Out ifindex   3 26:2b:40:3a:8e:88 Reply 131.173.??.?? is-at 26:2b:40:3a:8e:88, length 28

Talking to a few other Proxmox users, it seems like they have seen similar issues. Some worked around this by mapping the Proxmox web interface to one NIC and the virtual machines to another one. But that's not great and wasn't an option for one of my servers.

Looking for options I found that adjusting net.ipv4.conf.all.arp_ignore helps and I have not seen any more problems after setting this to 2, which means:

2 - reply only if the target IP address is local address configured on the incoming interface and both with the sender's IP address are part from same subnet on this interface

To test this, just run this on your PVE server:

Code:

echo 2 > /proc/sys/net/ipv4/conf/all/arp_ignore

The result is that only the network bridge will answer now. Here is a tcpdump showing how the servers now handle ARP requests:

Code:

eno2  B   ifindex 5 ac:78:d1:b4:e5:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
tap201i0 Out ifindex 25 ac:78:d1:b4:e5:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
vmbr1 B   ifindex 7 ac:78:d1:b4:e5:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
vmbr1 Out ifindex 7 ec:f4:bb:c2:4c:12 Reply 131.173.??.?? is-at ec:f4:bb:c2:4c:12, length 28
eno2  Out ifindex 5 ec:f4:bb:c2:4c:12 Reply 131.173.??.?? is-at ec:f4:bb:c2:4c:12, length 28
eno1  P   ifindex 4 ec:f4:bb:c2:4c:12 Reply 131.173.??.?? is-at ec:f4:bb:c2:4c:12, length 46

What is curious about this is that we only really noticed this issue over the last month or so. Maybe something has changed in PVE, or we were just lucky before.
Would it make sense for PVE to run with these settings by default?
Do you foresee any side effects I missed?

herzkerl · Dec 28, 2022

I’m having the same issue. https://forum.proxmox.com/threads/ceph-ip-laut-arp-tabelle-auch-bei-vmbr0.120105/

lkiesow · Dec 29, 2022

@herzkerl, did my outlined solution help? If so, you can make this permanent by creating a file like
/etc/sysctl.d/99-arp_ignore.conf with the content:

Code:

net.ipv4.conf.all.arp_ignore=2

I have this set on all PVE servers. Maybe, it should be set in PVE by default?

jordantrizz · May 19, 2023

I was banging my head against the wall thinking I had the same issue but had a bad iptables nat rule https://forum.proxmox.com/threads/change-port-from-8006-to-443.41710

Just posting this if anyone else runs into this.

ptanim · Oct 11, 2023

I think this is what fixed the issue that I was having for MONTHS!!!

(For anyone else experiencing the same issue

I have a vlan with 2 IP addresses going into a VM - public and LAN
I was experiencing connection drop outs in my public connection every 5 minutes or so for about 20 seconds (around the time of an ARP timeout). I finally found that if I flushed the ARP cache during the downtime my connection would come straight back up.

Changing the arp_ignore setting in my PVE has given me a solid connection with around 0% packet loss compared to a 10-20% packet loss before!!

Search

Search

Invalid ARP responses cause network problems

lkiesow

New Member

herzkerl

Member

lkiesow

New Member

jordantrizz

Member

ptanim

New Member