Sending MPLS Frames Inter-Host causes Proxmox to discard the IP reply

Aug 9, 2024
1
0
1
We have a pretty complicated network environment within our Proxmox cluster, where we run 6x VyoS VMs running IS-IS and SR-MPLS. We started noticing that when a VM needs to send a frame to another VM which required an MPLS label, if that VM was in a different host than the one sending the frame with the label, the IP reply would be dropped at the destination host prior to sending it to the VM. We ran multiple captures inside the VM itself and at the Proxmox host. It would hit the host but never the VM. If the source packet was plain IP, this never happened.

Testing from the outside in (hardware based router) to any MPLS peer was always successful, but the inverse was not when the next-hop peer was in a different physical node from the one sending the reply.

The ultimate fix was to simply enable the nftables backend in Proxmox. I hope this can save someone several days of troubleshooting, I spent a good while on this.

Okay, here is a high-level synopsis of the troubleshooting journey and the ultimate resolution:

Initial Problem:

  • Setup: A multi-node Proxmox VE cluster hosting VyOS 1.5 VMs configured for SR-MPLS using ISIS. VMs (RRs, AGGs, IBRs) were distributed across different hosts for redundancy. Physical network uses Dell OS10 switches.
  • Symptoms: Intermittent but persistent failures occurred when trying to ping between the loopback interfaces of VyOS VMs located on different Proxmox hosts. However, pings worked fine if the source and destination (and intermediate AGG hop) were on the same host. Plain IP pings between transit interfaces across hosts also worked correctly, as did pings to the VyOS loopbacks from external physical routers (like BNG1). The failure specifically affected traffic requiring MPLS label imposition initiated by a VyOS VM going inter-host.
Troubleshooting Highlights & Key Findings:

  1. MPLS Egress Confirmed: Initial suspicion fell on VyOS failing to send the MPLS packets, but packet captures directly on the source VyOS VM's virtual NIC (e.g., eth1 on RR1) confirmed it was correctly encapsulating ICMP echo requests with the appropriate MPLS label learned via ISIS SR.
    • Example Failing Route (on RR1):
      I>* 10.127.0.1/32 [115/220] via 10.126.0.4, eth1, label 16001, weight 1 * via 10.126.0.40, eth2, label 16001, weight 1
  2. Reply Arriving at Host: Captures showed the plain IP ICMP echo reply did arrive back at the source Proxmox host's network bridge.
  3. Packet Drop Before VM: Captures showed the reply was not delivered from the host bridge to the source VM's tap interface.
  4. Other Variables Ruled Out: Tests confirmed the issue persisted regardless of host NIC hardware features (SR-IOV/NPAR), VM vNIC type (virtio vs E1000), and NIC offload settings within VyOS.
  5. Firewall Identified as Blocker: Temporarily stopping the pve-firewall service (and its backend iptables/nftables) on the source host allowed the pings to succeed.
  6. Root Cause Identified: The Proxmox host's stateful firewall (pve-firewall using its default iptables backend + conntrack) failed to track the connection state for flows initiated by guest VMs using MPLS encapsulation egressing the host bridge. This caused the firewall to incorrectly drop the valid incoming plain IP replies as untracked/invalid packets at the bridge level.
The Ultimate Fix:

The solution was to change the backend used by the Proxmox firewall (pve-firewall) on the hosts from the legacy iptables to the modern nftables.
  • Action: Enabled the nftables backend via the Proxmox GUI (Host -> Firewall -> Options -> Check "NFTables") or by setting nftables: 1 in /etc/pve/nodes/<node>/host.fw.
  • Result: With the nftables backend active, the host firewall correctly tracked the state for the MPLS-initiated flows (or handled the untracked replies more gracefully), allowing the ICMP replies to be delivered successfully to the originating VyOS VM. All inter-host MPLS pings started working reliably.
In essence, the problem was a limitation or bug in the interaction between the host's iptables/conntrack implementation and bridged MPLS traffic from guests, which was resolved by switching to the more modern nftables firewall backend.