VLAN-aware configuration kills TCP handshake when host doesn't have IP in a given VLAN

kiler129

Active Member
Oct 20, 2020
29
52
33
I have a very peculiar issue which I never saw before, and which took me forever to even narrow down to PVE host. My server has a single interface that carries multiple VLANs. Then VMs are attached to particular VLANs. My network config (here limited to two VLANs) is quite simple:

Code:
auto ensfp0
iface ensfp0 inet manual

auto vmbr1
iface vmbr1 inet manual
    bridge-ports ensfp0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-500

auto vmbr1.20
iface vmbr1.20 inet static
    address 10.0.20.2/24
    gateway 10.0.20.1

auto vmbr1.40
iface vmbr1.40 inet static
    address 10.0.40.2/24

Then in my VM config I set:
Code:
net0: virtio=12:34:56:78:9A:BC,bridge=vmbr1,tag=40


The above configuration does work. However, as soon as I remove PVE host from being present on vmbr1.40 things get "interesting":
  • The VM is able to ping hosts in the same network as well as ones in other ones (including going properly to the gateway and accessing WAN)
  • VM accepts inbound connections from the same network as well as other ones
  • TCP handshake fails if connection comes from WAN over the gateway that DST-NATs it.... which is incredibly strange

Given the symptoms I went back and forth over the networking side - switches, routers, firewall - and found nothing. On a hunch I added address to the vmbr1.40 and it magically started working. What's 100x worse here is removing the IP again didn't cause it to break and I am pulling my hairs out. Since I'm testing this on a separate host I was able to fully reboot it and it still working without an IP. Thanks to a maintenance window over the weekend I was also able to reboot most of the switches and the main gateway - it is still working.

Anyone sees anything strange in this configuration? I really don't like configurations that "magically" start working. I also see that Proxmox creates dozens of "tap#" interfaces, suggesting that something isn't totally right with my VLANs handling on the host.
 
I have a very similar set up with one interface and multiple vlans and I have been tearing my hair out the last few days!!! The amount of times I recreated a vm having everything working then poof can't get to any of the local ip's from any other vlan. Lost connections within smb so shares randomly dying for no apparent reason. This behaviour has been isolated to LXC's and VM's the host works flawlessly. I've nuked my entire network chasing this error, bebuit everything over and over.
 
I think this may have something to do with all vmbr0, vmbr0.10 and vmbr0.40 being defined, maybe there is something off with the same MAC showing on different VLANs (tagged & untagged)...

I do that differently. My take on that configuration would be like so:

Code:
auto ensfp0
iface ensfp0 inet manual

auto vmbr1
iface vmbr1 inet manual
    bridge-ports ensfp0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-500
    bridge-mcsnoop 0

auto lan20
iface lan20 inet static
    address 10.0.20.2/24
    gateway 10.0.20.1
    vlan-id 20
    vlan-raw-device vmbr0

auto lan40
iface lan40 inet static
    address 10.0.40.2/24
    vlan-id 40
    vlan-raw-device vmbr0
 
mine :
auto lo
iface lo inet loopback

iface enp2s0 inet manual

auto vmbr0
iface vmbr0 inet manual
bridge-ports enp2s0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094

# VLAN 1 - Management Network
auto vmbr0.1
iface vmbr0.1 inet manual



# VLAN 10 - Trusted Network
auto vmbr0.10
iface vmbr0.10 inet manual

# VLAN 20 - IOT Network
auto vmbr0.20
iface vmbr0.20 inet static
address 192.168.20.100/24
gateway 192.168.20.1

# VLAN 30 - Guest Network
auto vmbr0.30
iface vmbr0.30 inet manual

# VLAN 40 - Windows Server Network
auto vmbr0.40
iface vmbr0.40 inet manual

# VLAN 50 - VMware Management Network
auto vmbr0.50
iface vmbr0.50 inet manual

# VLAN 60 - VMware High Availability Network
auto vmbr0.60
iface vmbr0.60 inet manual

source /etc/network/interfaces.d/*
 
Why do you even define all those VLANs on Proxmox? Any one of them that has no IP configured has virtually no effect. In order to keep maintenance and potential problems low, I would not uses any of those besides the only one that has an IP and change vmbr0.20 to:
Code:
auto iot
iface iot inet static
  address 192.168.20.100/24
  gateway 192.168.20.1
  vlan-id 20
  vlan-raw-device vmbr0

Everything else if just a matter of using vmbr0 together with a VLAN tag number in the VM/LXC network definitions and the router knowing how to route between all of those VLANs. Proxmox just needs to provide a VLAN-aware bridge and have the clients connect to a specific VLAN.

P.S.: Your VLAN 1 will probably not work if you do not also allow it like so in vmbr0:

bridge-vids 1-4094

Be aware there is a lot of confusion between different switch manufacturers about what VLAN 1 means: Is it a VLAN with tag 1 or does it mean untagged traffic? That is why it is discouraged to use VLAN 1.
 
Last edited:
Thank you for the obsivation, I've just switched to this config after seeing the initials posters configuration. Prior to reading this post I was using the setup you recommended and previous to all these issues it had been working great. The only way I can get past these issues is by adding multiple nice with a vlan tag for each network I need communication between but obviously at this point I may as well not have vlans and firewall rules configured.