Incorrect MTU calculation when CT sending packets over bridged VXLAN port

tim_eves · Apr 10, 2023

We have an overlay network (configured as described below) which worked fine on PVE6.4 but after migrating the containers to a PVE 7.2 node have noticed some odd behavior, with packets above a certain size being discarded by the vxlan interface, but only when sent from a container, VMs continue to work fine. I'll describe our config at this point:

Each PVE node is configured with it's host network address on vmbr0 as usual (192.0.2.X/24), a bridge, vmbr1, is created for guests only on the overlay network (203.0.113.0/24) and a configured VXLAN interface is added, that interface has the addresses of it's PVE node peers prepopulated into it's fdb. Guest containers and VMs are then added vmbr1. We turn on forwarding and also add a Masquerade rule to vmbr0 for traffic coming from 203.0.113.0/24 to permit guests to access the internet. This has worked well and permitted us share a private guest network address space to migrate machines between nodes on the cluster at will. The vmbr1 bridge has an MTU of 1500 as do all the ports on it. On the PVE 6.4 kernel this setup would, via PMTUD, instruct guests to send inter-node overlay traffic with an MTU of 1450, thus allowing the whole thing to work.

After upgrading, packets traversing the bridge to the vxlan port still generate ICMP "unreachable - need to frag (mtu 1450)", packets but the CT guest responds with a packet reducing the MTU by the TCP/IP header overhead. e.g. the Outgoing packet from a guest enters the bridge is 1552 incl TCP/IP (payload 1500 octets), the vxlan interface responds with a need to fragment ICMP packet requesting 1450 octets, the guest re-tries with a 1500 octet (incl TCP/IP) packet instead of the 1450 requested. This is the repeated until something times out, with the other end never seeing it's data and seeming to freeze. When this is tried with a VM instead on the same bridge it works fine.

I have tried setting the vxlan interfaces df parameter to "unset" (though the man pages says that's the default anyway), to no avail. I suspect this is either due to a change in the vxlan or bridging code or may be a bug/change in the veth driver.

We've run into this on several hosts with various pieces of hardware so I don't suspect host hardware driver issues (also the VM case works on the same hardware that the CT case doesn't).
Our reproducible test case involves creating a CT on a PVE7.x node (e.g. 203.0.113.20), configured with the above overlay network, and then running the following in that guest:

Bash:

nc -v -q0 -l -p 8008 < mtu_packet

(Where mtu_packet is a file containing 1500 bytes of zeros or random data)
On a 2nd machine configured with access to the same overlay network I run:

Bash:

nc 203.0.113.20 8008 | wc -c

If it works it will return 1500, otherwise netcat locks up.

(BTW I have picked addresses from the IETF TEST-NET-1 and TEST-NET-3 address spaces for illustrative purposes, we don't actually use those addresses anywhere)

Any help in further diagnosing this, solutions, or even alternative ways to achieve the same goals would be welcome.

Search

Search

Incorrect MTU calculation when CT sending packets over bridged VXLAN port

tim_eves

New Member

We value your privacy