While I have solved the issue, I hope this is helpful to others who run into the same problem. Also, it would be great if someone could verify that this is a sensible approach.
Recently, we would lose the network connection to our PVE servers on a semi-regular interval. Virtual machines on that server using a network bridge to the external network could still be reached, though. Assuming a network problem, I talked to one of our network engineers about this who, after a short investigation, pointed out that the server would not only respond to ARP requests with the MAC address it should respond with, but would actually respond with several different MAC addresses. This meant that occasionally the correct MAC was linked to the IP address of the server and sometimes… not. This probably comes down to timing and luck.
Looking into this using
Talking to a few other Proxmox users, it seems like they have seen similar issues. Some worked around this by mapping the Proxmox web interface to one NIC and the virtual machines to another one. But that's not great and wasn't an option for one of my servers.
Looking for options I found that adjusting
To test this, just run this on your PVE server:
The result is that only the network bridge will answer now. Here is a
What is curious about this is that we only really noticed this issue over the last month or so. Maybe something has changed in PVE, or we were just lucky before.
Would it make sense for PVE to run with these settings by default?
Do you foresee any side effects I missed?
Recently, we would lose the network connection to our PVE servers on a semi-regular interval. Virtual machines on that server using a network bridge to the external network could still be reached, though. Assuming a network problem, I talked to one of our network engineers about this who, after a short investigation, pointed out that the server would not only respond to ARP requests with the MAC address it should respond with, but would actually respond with several different MAC addresses. This meant that occasionally the correct MAC was linked to the IP address of the server and sometimes… not. This probably comes down to timing and luck.
Looking into this using
tcpdump
, you can clearly see not only the network bridge, but also the virtual machine interfaces answering:
Code:
❯ tcpdump -ennqti any arp
eno2 B ifindex 3 ac:78:d1:b4:72:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
fwpr3000p0 Out ifindex 119 ac:78:d1:b4:72:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
vmbr0 B ifindex 48 ac:78:d1:b4:72:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
↓ the pve bridge answers
vmbr0 Out ifindex 48 90:1b:0e:10:60:db Reply 131.173.??.?? is-at 90:1b:0e:10:60:db, length 28
eno2 Out ifindex 3 90:1b:0e:10:60:db Reply 131.173.??.?? is-at 90:1b:0e:10:60:db, length 28
fwln3000i0 B ifindex 120 ac:78:d1:b4:72:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
tap3000i0 Out ifindex 117 ac:78:d1:b4:72:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
fwbr3000i0 B ifindex 118 ac:78:d1:b4:72:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
↓ the virtual machine answers as well
fwbr3000i0 Out ifindex 118 26:2b:40:3a:8e:88 Reply 131.173.??.?? is-at 26:2b:40:3a:8e:88, length 28
fwln3000i0 Out ifindex 120 26:2b:40:3a:8e:88 Reply 131.173.??.?? is-at 26:2b:40:3a:8e:88, length 28
fwpr3000p0 P ifindex 119 26:2b:40:3a:8e:88 Reply 131.173.??.?? is-at 26:2b:40:3a:8e:88, length 28
eno2 Out ifindex 3 26:2b:40:3a:8e:88 Reply 131.173.??.?? is-at 26:2b:40:3a:8e:88, length 28
Talking to a few other Proxmox users, it seems like they have seen similar issues. Some worked around this by mapping the Proxmox web interface to one NIC and the virtual machines to another one. But that's not great and wasn't an option for one of my servers.
Looking for options I found that adjusting
net.ipv4.conf.all.arp_ignore
helps and I have not seen any more problems after setting this to 2, which means:2 - reply only if the target IP address is local address configured on the incoming interface and both with the sender's IP address are part from same subnet on this interface
To test this, just run this on your PVE server:
Code:
echo 2 > /proc/sys/net/ipv4/conf/all/arp_ignore
The result is that only the network bridge will answer now. Here is a
tcpdump
showing how the servers now handle ARP requests:
Code:
eno2 B ifindex 5 ac:78:d1:b4:e5:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
tap201i0 Out ifindex 25 ac:78:d1:b4:e5:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
vmbr1 B ifindex 7 ac:78:d1:b4:e5:90 Request who-has 131.173.??.?? tell 0.0.0.0, length 46
vmbr1 Out ifindex 7 ec:f4:bb:c2:4c:12 Reply 131.173.??.?? is-at ec:f4:bb:c2:4c:12, length 28
eno2 Out ifindex 5 ec:f4:bb:c2:4c:12 Reply 131.173.??.?? is-at ec:f4:bb:c2:4c:12, length 28
eno1 P ifindex 4 ec:f4:bb:c2:4c:12 Reply 131.173.??.?? is-at ec:f4:bb:c2:4c:12, length 46
What is curious about this is that we only really noticed this issue over the last month or so. Maybe something has changed in PVE, or we were just lucky before.
Would it make sense for PVE to run with these settings by default?
Do you foresee any side effects I missed?
Last edited: