Troubleshooting Massive Packet Loss with Proxmox Virtual Bridges and WAN Traffic

Knogle

Member
Sep 11, 2023
12
4
8
Hi friends, I hope you're doing well!

I'm encountering a specific issue in my network and could use some advice.

I experience random bursts of high packet loss in the network, particularly with my internet connection. Here’s the sequence of events:

  1. Initially, I noticed these issues with my first WAN connection.
  2. I then added a second WAN connection, known to be rock-stable.
  3. Unfortunately, the same issues occurred with the new connection.

Network Setup​

My network consists of the following:

  • 2 Proxmox Hosts: Each connected with:
    • 2x 1Gbps LACP links to their respective access switches (no VPC/MLAG).
    • 2x 10Gbps LACP links to the core switch.
  • 2 Access Switches:
    • sin01-edge-psw01:
      • Connected to the core switch (Nexus 3000) via a 4x 1Gbps LACP bond.
      • WAN edge routers are connected here.
      • VLAN 3 to Proxmox millenium-fbe49
    • sin01-edge-psw02:
      • Connected to the core switch via a 2x 1Gbps LACP bond.
      • Fedora host is connected here.
      • VLAN 3 to Proxmox millenium-fbe50
  • 1 Core Switch (Nexus 3000):
    • Central point of connection for access switches and Proxmox Hosts.
    • VLAN 7 to Proxmox millenium-fbe49 and millenium-fbe50
For simplicity, let’s focus on:

  • 2 VLANs
  • 1 WAN connection
Topology:

rich text editor image
rich text editor image

Observed Behavior​

  • When I ping the internet from the Fedora host (connected to sin01-edge-psw02), without using the OPNsense VM, there’s no packet loss. This suggests the switching fabric is functioning well.
  1. With OPNsense VM:
    • Sending traffic through the OPNsense VM introduces excessive packet loss.
    • A traceroute (MTR) reveals ~20% packet loss between the 192.168.3.0/24 network (VLAN3) and the OPNsense VM interface and from OPNsense to WAN also for traffic in inbound direction.
    • People can hear me well in programs like Discord, but i can't hear them at all, indicating inbound traffic loss (For sure the drops)
    • Key observation: Excessive packet drops are shown on the Proxmox virtual bridges.

Bridge Statistics​

vmbr0 Interface:​

RX: 2572036239 packets (637,300,259 dropped)
TX: 78666453 packets (0 dropped)

vmbr1 Interface:​

RX: 284869426 packets (10,593 dropped)
TX: 118726145 packets (0 dropped)

Testing Traffic​

  1. Low WAN Traffic:
    • Running a speed test over the WAN causes significant drops (~25,000 drops/sec on vmbr0).
  2. High LAN Traffic:
    • Running iperf3 within the 192.168.3.0/24 subnet shows only ~20 drops/sec—no significant issues.
  3. Changing Topology:
    • Moving the Proxmox-Fedora link entirely to the core switch (10Gbps fiber) reduced packet loss:
      • Less overall loss (~1%), but WAN-related traffic still caused heavy drops on the virtual bridge.

Key Findings​

  1. WAN Traffic Issue: Even low-rate WAN traffic causes massive drops on vmbr0.
  2. LAN Traffic Stable: High LAN traffic does not produce excessive drops.
  3. Virtualization Dependency: Drops occur only when traffic passes through a VM (e.g., OPNsense, OpenWrt).
  4. Host Consistency: Moving VMs between Proxmox hosts didn’t solve the issue (both hosts are identical hardware).
  5. Topology Changes: Eliminating copper connections between Proxmox and access switches reduces packet loss but doesn’t fully solve the problem.
I’m stumped! As a network engineer, I suspect an issue related to:

  • Virtual bridge performance or misconfiguration on Proxmox.
  • Possible driver, hardware offloading, or interrupt handling problems.
  • Any other potential issue?
Any advice on how to troubleshoot further or potential fixes would be greatly appreciated!