BUG: VM with tag=1 on VLAN-aware bridge — traffic exits bond0 untagged, VLAN 1 broken when native VLAN ≠ 1

luis15pt

New Member
Oct 21, 2024
5
0
1

BUG: VM with tag=1 on VLAN-aware bridge — traffic exits bond0 untagged, VLAN 1 broken when native VLAN ≠ 1​


Hi all,


Spent ~13 hours troubleshooting this with Claude.ai after a power outage took down my homelab. Documenting here as it appears to be a known limitation in Network.pm that isn't well documented.


The Problem​


VM configured with tag=1 on a VLAN-aware bridge. Upstream Cisco Nexus trunk has native VLAN 8 and VLAN 1 as tagged/allowed. VM traffic exits bond0 completely untagged (confirmed via tcpdump — no 802.1Q header), so the Cisco puts it on native VLAN 8 instead of VLAN 1.


Inbound VLAN 1 traffic from other devices arrives at the VM correctly (tagged frames are delivered). But the VM's replies go out untagged — asymmetric VLAN behaviour.


Root Cause​


In /usr/share/perl5/PVE/Network.pm:


Perl:
$tag = 1 if !$tag;
run_command(['/sbin/bridge', 'vlan', 'add', 'dev', $iface, 'vid', $tag, 'pvid', 'untagged']);

This makes tag=1 identical to no tag — both set VLAN 1 as PVID Egress Untagged on the tap interface. Since bond0 also has VLAN 1 as PVID Egress Untagged (Linux bridge default), VLAN 1 frames are always stripped on egress. Any other tag (2-4094) works fine.


Environment​


  • PVE 8.4.0, pve-manager 8.4.16, kernel 6.8.12-18-pve
  • qemu-server 8.4.5, ifupdown2 3.2.0-1+pmx11
  • VLAN-aware bridge on LACP bond, bridge-vids 2-4094
  • Cisco NX-OS 10.2(4), trunk with native VLAN 8, allowed VLANs 1,8

Evidence​


Bash:
# Both tap and bond0 strip VLAN 1:
bridge vlan show dev tap108i0  →  1 PVID Egress Untagged
bridge vlan show dev bond0     →  1 PVID Egress Untagged

# VM egress — no 802.1Q tag:
tcpdump -i bond0 -e -nn ether src bc:24:11:a2:2e:ce
→ ethertype ARP (0x0806)    # untagged
→ ethertype IPv4 (0x0800)   # untagged

# Other devices on VLAN 1 — tagged correctly:
tcpdump -i bond0 -e -nn vlan 1
→ ethertype 802.1Q (0x8100), vlan 1   # tagged ✓

What I Tried (all failed)​


  • tag=1 — traffic exits untagged
  • No tag + eth0.1 VLAN subinterface inside VM — bond0 still strips VLAN 1
  • tag=8,trunks=1 — tap correct but bond0 PVID still strips VLAN 1 on egress
  • bridge-pvid 8 on vmbr0 — lost host management access
  • bridge vlan del dev bond0 vid 1 — lost host management access
  • vlan dot1q tag native on Cisco — broke all other native VLAN 8 trunks

Current State​


VM temporarily moved to 192.168.8.x (VLAN 8). Not ideal — this is Home Assistant and needs L2 adjacency with 192.168.10.x devices for mDNS/discovery.
The proper fix appears to be bridge-pvid 8 + moving host management IP to vmbr0.8, but this requires IPMI/console access and changes across all cluster nodes.


Question​


Is tag=1 on a VLAN-aware bridge a known unsupported configuration? Should Network.pm differentiate between explicit tag=1 (user wants VLAN 1 tagged) and the default case? Or is there a simpler workaround I'm missing?


This VM was working with the same config before the power outage the mystery of how it previously worked remains unsolved.


Sorry for using AI but its translated my thoughts and experiments to a simple enough to read post.
 
Hi there, You've correctly identified the root cause. This is a Linux bridge fundamental: VLAN 1 is hardcoded as the default PVID (native VLAN) on every bridge port, so VLAN 1 egress is always untagged regardless of what Proxmox tells the bridge. Network.pm treats tag=1 and no-tag identically because at the kernel level they produce the same bridge VLAN table entry - there's no way to make VLAN 1 tagged on egress without removing it as PVID first. The proper fix is exactly what you identified: change bridge-pvid to 8 on vmbr0, which makes VLAN 8 the native/untagged VLAN and allows VLAN 1 to behave as a regular tagged VLAN. The safe way to do it without IPMI: first add an IP on vmbr0.8 (bridge VLAN subinterface for VLAN 8), confirm you can SSH to that IP, then move your main IP there and change bridge-pvid. You can script this to execute atomically so you don't lose SSH mid-change. Something like: ip addr add 192.168.8.x/24 dev vmbr0.8 && ssh from another session to verify && then proceed with the PVID change. Alternatively, if your Cisco trunk allows it, you could move the native VLAN on the Cisco side to something not used by any VM, which would also free VLAN 1 to be treated as tagged. Either way, your diagnosis is spot on and this catches a lot of people who move from unmanaged or access-port setups to proper VLAN-aware bridges.