[SOLVED] VXLAN SDN route/MAC address mismatch after migration

glueckself · Dec 31, 2023

Hi all!

I've discovered an issue with VXLAN and routing, and I'm not sure if it's because my setup is wrong/weird or if it's a bug (possibly in the kernel).

My setup is:

Three nodes in a cluster, each node has an /32 address on loopback lo: vmhost0 10.200.255.10/32, vmhost1 10.200.255.11/32 and vmhost2 10.200.255.12/32.
Their ethernet ports are connected to a L3 switch with a shared /29: sw-rack03 10.200.255.225/29, vmhost0 10.200.255.226/29 (and so on).

The nodes run FRR ospfd, together with the switch. The route table on e.g. vmhost0 thus looks like:

Code:

[...]
10.200.255.3 nhid 197 via 10.200.255.225 dev enp5s0.2224 proto ospf metric 20 # sw-rack03 lo loopback
10.200.255.4 nhid 203 via 10.200.255.229 dev enp5s0.2224 proto ospf metric 20 # MikroTik Test loopback
10.200.255.11 nhid 199 via 10.200.255.227 dev enp5s0.2224 proto ospf metric 20 # vmhost1 loopback
10.200.255.12 nhid 201 via 10.200.255.228 dev enp5s0.2224 proto ospf metric 20 # vmhost2 loopback
10.200.255.224/29 dev enp5s0.2224 proto kernel scope link src 10.200.255.226
[...]

I've added vxlan-local-tunnelip 10.200.255.10 to the vxlan tunnel interface, the remote VTEPs are the other loopback IPs.
The plan is to later add a second L3 switch and network to the nodes, so that the loopback IP becomes available via multiple links, devices and routes.
I have a VM on vmhost0 (named test1) and a VM that I migrate between vmhost1 and vmhost2 (named test3).
In general, this works, I can ping the loopback IPs at all times, the VXLAN-based bridge works and the VMs can ping each other.
I'm forced to run simple VXLAN zones because MikroTik doesn't support EVPN-VXLAN.

However, if I start pinging test3 from test1, and then live-migrate it from vmhost1 to vmhost2, the VMs cannot ping each other anymore and it only starts working after some time without any traffic between them. I do see ARP requests and replys between when tcpdump'ing inside the VM, but test3 stops receiving ICMP echo requests.

I've discovered on a tcpdump on vmhost0, that vmhost0 sends a packet to the correct new node IP, but to the wrong MAC:

Code:

# outer VXLAN packet
08:53:47.444715 a8:a1:59:2a:39:77 > a8:a1:59:1a:85:bb, ethertype 802.1Q (0x8100), length 156: vlan 2224, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 61195, offset 0, flags [none], proto UDP (17), length138)
  10.200.255.10.42858 > 10.200.255.12.4789: VXLAN, flags [I] (0x08), vni 200
# inner packet
    c2:6d:71:ba:bd:09 > bc:24:11:21:c4:d4, ethertype 802.1Q (0x8100), length 102: vlan 40, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 59902, offset 0, flags [DF], proto ICMP (1), length 84)
      10.200.201.1 > 10.200.201.2: ICMP echo request, id 53466, seq 1660, length 64

Looking at the ARP cache, the :bb MAC belongs to the IP that's on vmhost1 ethernet interface (and routes to .11), but the destination IP of the packet has correctly changed to .12 (vmhost2)

Code:

10.200.255.228           ether   a8:a1:59:41:6a:61   C                     enp5s0.2224
10.200.255.227           ether   a8:a1:59:1a:85:bb   C                     enp5s0.2224

Like said before, the ping's between the vmhosts work at all times, and the pings between the VMs start working if the traffic stops for some time (haven't measured exactly how long). I've also discovered that the test-VM ping starts working after flushing the route cache with ip route flush cache. It also works if I enable ip_forwarding on vmhost1.

I've attached my FRR config, please let me know if you need the other files to reproduce or any other logs/tests.

Thanks!

EDIT: here is the same issue from 4 years ago: https://www.reddit.com/r/networking/comments/ck8zin/issues_with_vxlan_live_migration_in_linux/

spirit · Dec 31, 2023

Hi,
I don't see the attached frr config ? (and /etc/pve/sdn/*.cfg)

Personnaly, I'll use bgp-evpn with exit-nodes + bgp peer on exit-node to mikrotik. This is a well tested setup.

About your setup, I'lm not sure to understand, be are you trying to do routing from the proxmox nodes ? (because vxlan zone is really for l2 network with you have a central gateway outside proxmox, on you mikrotik router for example, running flat vxlan network)

What is the ip address of vm test1 && test3 , and what is their gateway ?

because vxlan overlay should be transparent for host (at l2 level), until your are trying to do routing (use the bridge/vnet as gateway for the ip)
with bgp-evpn for example, frr is listening to bridge mac table && ip neighbor, and auto-update routes correctly.

glueckself · Dec 31, 2023

Hi!

I don't see the attached frr config ? (and /etc/pve/sdn/*.cfg)

Ahh, now that I read the preview-text, there are required file extensions. Sorry I missed that the first time.

I think that the vnets.cfg and zones.cfg are only used to generate /etc/network/interfaces.d/sdn (and possibly other files) and not used by the underlying Linux at all, so after initially creating the zone/vnet config with the web interface, I started modifying the /etc/network files directly. I attached them also.

Personnaly, I'll use bgp-evpn with exit-nodes + bgp peer on exit-node to mikrotik. This is a well tested setup.

I don't have any room for a separate host to be my exit node, and the other hardware I have available doesn't support VXLAN (or BGP-EVPN) at all. This is my homelab and my first experimenting with overlay networking, please let me know if I misunderstood you.

About your setup, I'lm not sure to understand, be are you trying to do routing from the proxmox nodes ? (because vxlan zone is really for l2 network with you have a central gateway outside proxmox, on you mikrotik router for example, running flat vxlan network)

No, I want to use VXLAN as a pure L2 overlay. There should be no routing inside the overlay. The underlay is routed to provide multiple paths to the loopback /32. I've attached a very barebones diagram (layout.png).
Currently, there is only the green underlay network, but the idea is that after the orange network gets added, there will be two routes to the /32 IPs of the loopback interfaces (thus, that IP will be redundantly available to the other nodes/VTEPs).

The overlay is available on all three nodes on the "vxlan" bridge (bad naming...), and the two VMs have an interface in that bridge. test1 has 10.200.201.1/24, test3 has 10.200.201.2/24, there are no gateways inside the overlay network (and no need, I'm trying to ping 10.200.201.2 from 10.200.201.1).

The ping inside the overlay works, until I migrate the test3 VM. Afterwards, I see the encapsulated packets exiting vmhost0, but the underlays destination MAC doesn't match the underlays destination IP's next-hop's MAC (i.e. dst 10.200.255.12 should be routed via 10.200.255.228 (a8:a1:59:41:6a:61), but the packet's frame has dst-mac a8:a1:59:1a:85:bb).

Thanks again

EDIT: formatting.

glueckself · Dec 31, 2023

Hi all!

Update: it seem's to be a bug in the kernel. I asked the OP of the Reddit thread I posted above (u/wingerd33), if they had a solution, and I'll quote them:

Yeah, took a look through the kernel code. In vxlan.c, turns out each CPU core has its own FDB cache that's used to look up where the encapsulated packet is sent.
It's a bug. When a packet is received from a new address, containing a frame with a known MAC, all caches should be updated or invalidated. But instead, only the CPU core that process the update, updates its cache. The others are stale, so if an outgoing frame gets handled by another core, it'll get sent to the wrong (old) VTEP.
However, what caught my eye was
if (tos == 0) do_cache_lookup;

So the workaround is, when you create the vxlan interface, set tos to something and it won't use the caches at all. Been running this in production for a few years now without issues.
ip l add type vxlan ... tos 32

However I did all this before proxmox had the sdn stuff built in. So, my vxlan interfaces are created by ifupdown hook scripts that are triggered by the bridge interfaces coming up. So, I was able to customize it.
For you, you may have to write a little patch for the proxmox code to implement this workaround.

Your issue may be different though. I don't have time to fully wrap my head around it right now, but I glanced at your post and you mention the packets go to the new host correctly, but with the wrong MAC.
IIRC, In my issue the packets were still going to the old host.

I can confirm, setting ToS works and I can migrate without issues. This can be done to existing interfaces with ip link set $interface type vxlan tos 2

EDIT: my "permanent" workaround will be to set vxlan-tos in /etc/network/interfaces.d/sdn.

spirit · Jan 1, 2024

Hi, Thanks for the extensive report.
I'll check this tomorrow when I'll be back from holiday, I don't remember to have problem with vxlan l2 && live migration. Maybe is it a kernel regression.

about bgp-evpn, you don't need to have a dedicated separeted exit-node node. You can use a 1 or 2 nodes where vm are running. The exit-nodes are forwardin traffic from evpn network to your real network. (so you can do classic bgp from exit-node to your mikrotik)

spirit · Jan 1, 2024

mmm, also, I see that you use vlan-aware vnet. (and it seem that you are using vlan tag too, inside the vxlan tunnel ?)

I'm curious to see if it's working without vlan-aware ?
(default ifupdown2 don't ave support for vlan inside vxlan, but we have patched it in proxmox because some user have request it).
The classic usage is more 1vxlan for each network without vlan inside.

and I'm seeing kernel patches about vxlan / vlan tos
https://lore.kernel.org/netdev/20220721202718.10092-1-matthias.may@westermo.com/t/

also what is your kernel version ?
and nic model ? (I'm using mellanox nic with vxlan offload, so maybe it don't have this problem because of offloading)

Search

Search

[SOLVED] VXLAN SDN route/MAC address mismatch after migration

glueckself

New Member

spirit

Distinguished Member

glueckself

New Member

Attachments

glueckself

New Member

spirit

Distinguished Member

spirit

Distinguished Member