BUG: VM with tag=1 on VLAN-aware bridge — traffic exits bond0 untagged, VLAN 1 broken when native VLAN ≠ 1
Hi all,
Spent ~13 hours troubleshooting this with Claude.ai after a power outage took down my homelab. Documenting here as it appears to be a known limitation in Network.pm that isn't well documented.
The Problem
VM configured with tag=1 on a VLAN-aware bridge. Upstream Cisco Nexus trunk has native VLAN 8 and VLAN 1 as tagged/allowed. VM traffic exits bond0 completely untagged (confirmed via tcpdump — no 802.1Q header), so the Cisco puts it on native VLAN 8 instead of VLAN 1.
Inbound VLAN 1 traffic from other devices arrives at the VM correctly (tagged frames are delivered). But the VM's replies go out untagged — asymmetric VLAN behaviour.
Root Cause
In /usr/share/perl5/PVE/Network.pm:
Perl:
$tag = 1 if !$tag;
run_command(['/sbin/bridge', 'vlan', 'add', 'dev', $iface, 'vid', $tag, 'pvid', 'untagged']);
This makes tag=1 identical to no tag — both set VLAN 1 as PVID Egress Untagged on the tap interface. Since bond0 also has VLAN 1 as PVID Egress Untagged (Linux bridge default), VLAN 1 frames are always stripped on egress. Any other tag (2-4094) works fine.
Environment
- PVE 8.4.0, pve-manager 8.4.16, kernel 6.8.12-18-pve
- qemu-server 8.4.5, ifupdown2 3.2.0-1+pmx11
- VLAN-aware bridge on LACP bond, bridge-vids 2-4094
- Cisco NX-OS 10.2(4), trunk with native VLAN 8, allowed VLANs 1,8
Evidence
Bash:
# Both tap and bond0 strip VLAN 1:
bridge vlan show dev tap108i0 → 1 PVID Egress Untagged
bridge vlan show dev bond0 → 1 PVID Egress Untagged
# VM egress — no 802.1Q tag:
tcpdump -i bond0 -e -nn ether src bc:24:11:a2:2e:ce
→ ethertype ARP (0x0806) # untagged
→ ethertype IPv4 (0x0800) # untagged
# Other devices on VLAN 1 — tagged correctly:
tcpdump -i bond0 -e -nn vlan 1
→ ethertype 802.1Q (0x8100), vlan 1 # tagged ✓
What I Tried (all failed)
- tag=1 — traffic exits untagged
- No tag + eth0.1 VLAN subinterface inside VM — bond0 still strips VLAN 1
- tag=8,trunks=1 — tap correct but bond0 PVID still strips VLAN 1 on egress
- bridge-pvid 8 on vmbr0 — lost host management access
- bridge vlan del dev bond0 vid 1 — lost host management access
- vlan dot1q tag native on Cisco — broke all other native VLAN 8 trunks
Current State
VM temporarily moved to 192.168.8.x (VLAN 8). Not ideal — this is Home Assistant and needs L2 adjacency with 192.168.10.x devices for mDNS/discovery.
The proper fix appears to be bridge-pvid 8 + moving host management IP to vmbr0.8, but this requires IPMI/console access and changes across all cluster nodes.
Question
Is tag=1 on a VLAN-aware bridge a known unsupported configuration? Should Network.pm differentiate between explicit tag=1 (user wants VLAN 1 tagged) and the default case? Or is there a simpler workaround I'm missing?
This VM was working with the same config before the power outage the mystery of how it previously worked remains unsolved.
Sorry for using AI but its translated my thoughts and experiments to a simple enough to read post.