VM with random mac start getting spotty network at some point via IPv6

ernw-it

Member
Dec 13, 2022
3
0
6
We're running a 3-node cluster version PVE 8. We routinely reboot nodes for applying security updates or regular updates and most of the VMs have a somewhat high uptime (15-180 days) as we move them before node reboot. Most of them run IPv6-only, some dual stack. All of the "network device" for the VM are created with the Firewall off and VirtIO drivers (linux guests). Most of our VMs still have a random mac address set (which was the previous default), not yet from the proxmox assigned prefix. The hypervisor firewall is also off. The hardware is HP ProLiant Gen8

After some time we noticed strange behaviour of one individual VM:

The connection into and from the VM starts getting spotty, strange packet loss resulting in a lot of TCP retransmissions and connections being limited to a few mbit/s of speed. This appeared only to IPv6 and most prominent with the VM sending large amounts of data during e.g. a upload. During testing, we noticed that IPv4 was not affected, or at least not as much. IPv6 was very unstable at around 15-30 mbit/s for sending with hundrets to thousands retransmissions. Receiving yielded initially 3 gbit/s for the first second and dropped down to about 1 gbit for the remaining 9 seconds. The hardware node is connected via a 2x 10 gbit/s bond to the network and all testing was performed within our network.

We also tested and made sure that the hypervisor node of the slow VM itself worked fine via iperf3 tcp and udp (around 8 gbit/s sending and receiving). We also tested the invididual network VLAN at other places through other devices (VM and hardware) to rule out strange networking outside of the hypervisor.

We were able to fix the problem by using the webinterface for deleting the mac address, causing proxmox to generate one from the proxmox assigned prefix https://macaddress.io/statistics/company/32814 and rebooting the VM. The speeds immediately recovered for sending are at 8,5 gbit/s and receiving 7,5 gbit/s -- back to normal.

We then changed the mac back to the random mac and rebooted once again: no more problems, everything still at normal.

Does this ring a bell or did anyone have similar situations? Is this related to changing the mac? Or changing the mac from random to proxmox prefix? Or Rebooting the VM after? Does time play a role for the problem to appear?
 
I've now found five more VM on this cluster with the exact problem, so I'm able to do more testing:

- other VMs on the cluster and nodes are not affected
- moving the VM from one node to another does not change the problem
- migrating it back to the original node doesn't change anything
- executing `ip l set down ens18; ip l set up ens18` in the VM causes the network to completely stop, `ip a` reports the IPv4 address normally configured and the interface as up
- then executing `ifdown ens18; ifup ens18` in the VM immediately fixed the problem. Four more VMs to debug now :)
 
Last edited:
We solved the problem. The problem was network related: it appears when using vrrp gateways with evpn multihoming in a vlan in combination with vteps where no vrrp was configured in the same vlan. It persistent even when all vteps were reconfigured to using vrrp again and resulted in the affected VMs to have a wrong neighbor mac entry for
Code:
fe80::1
:
Code:
fe80::1 dev ens18 lladdr $mac router REACHABLE
(where $mac was one offending vtep without vrrp) which should read
Code:
fe80::1 dev ens18 lladdr 00:00:5e:00:01:01 router REACHABLE
. The fix is to remove the neighbor entry (I've seen no traffic impact):
Code:
ip neigh del fe80::1 dev ens18

Thank you for you attention and good day :)
 
Last edited: