VLAN's Stopped Working in 6.0-11 update

Apr 11, 2019
16
0
6
24
I recently updated my 4-node Proxmox cluster to version 6.0-11 and after the update machines that are on a VLAN are no longer able to access the internet or connect to any other machine, or get an IP address. VM's and containers not on a VLAN continue to work just fine. There were no other changes to the environment made besides the update to proxmox. Each of my 4 nodes have 4 NIC's and all share an identical network configuration. They are all Dell PowerEdge R610 servers.

1573002388359.png

Does anyone have any ideas as to how I can resolve this? Currently about 700 vms/containers are unable to connect to the internet!!!
 
Last edited:
That is the first thing I did try, sorry I forgot to mention that. The systems are currently running the kernel 5.0.21-3

Hmm, and it still is broken? If so I'd not rather suspect the environment (network switches or similar?)..

Else, I mean, you can check the less /var/log/apt/history.log to see what the specific package set included in the last upgrade was. Maybe we can pin it down to a specific one.
 
Last edited:
Hmm, and it still is broken? If so I'd not rather suspect the environmnet (network switches or similar?)..

I have double, and triple checked the environment already. The firewall has not been modified and I have disabled automatic updates there. The switches are also unchanged. Additionally there are additional devices on the same VLANs that some of the VM's are trying to use and those devices work without issue. Since the only thing that was changed was packages on proxmox and other devices still function on the same VLAN I can only assume that it is something with the proxmox machines.


Else, I mean, you can check the less /var/log/apt/history.log[/code] to see what the specific package set included in the last upgrade was. Maybe we can pin it down to a specific one.

Here is the output from the log you mentioned. I made it a little more human readable but these are all the packages that were updated.
Code:
Start-Date: 2019-11-05  09:09:49
Commandline: apt upgrade
Install: pve-kernel-5.0.21-4-pve:amd64 (5.0.21-8, automatic)
Upgrade:
  - pve-kernel-5.0:amd64 (6.0-9, 6.0-10),
  - libpve-access-control:amd64 (6.0-2, 6.0-3),
  - pve-firmware:amd64 (3.0-2, 3.0-4),
  - pve-qemu-kvm:amd64 (4.0.1-3, 4.0.1-4),
  - pve-docs:amd64 (6.0-7, 6.0-8),
  - zfs-initramfs:amd64 (0.8.2-pve1, 0.8.2-pve2),
  - pve-container:amd64 (3.0-7, 3.0-10),
  - libspice-server1:amd64 (0.14.2-4~pve6, 0.14.2-4~pve6+1),
  - zfsutils-linux:amd64 (0.8.2-pve1, 0.8.2-pve2),
  - pve-manager:amd64 (6.0-9, 6.0-11),
  - spl:amd64 (0.8.2-pve1, 0.8.2-pve2),
  - libzfs2linux:amd64 (0.8.2-pve1, 0.8.2-pve2),
  - libpve-guest-common-perl:amd64 (3.0-1, 3.0-2),
  - libpve-common-perl:amd64 (6.0-5, 6.0-6),
  - lxc-pve:amd64 (3.1.0-65, 3.2.1-1),
  - qemu-server:amd64 (6.0-9, 6.0-13),
  - pve-kernel-helper:amd64 (6.0-9, 6.0-11),
  - libzpool2linux:amd64 (0.8.2-pve1, 0.8.2-pve2),
  - libnvpair1linux:amd64 (0.8.2-pve1, 0.8.2-pve2),
  - libuutil1linux:amd64 (0.8.2-pve1, 0.8.2-pve2)
End-Date: 2019-11-05  09:10:40
 
Okay, so in an attempt to get some of my services back online I moved some VM's and Containers off of the VLANs and they still were not able to connect. With this discovery I did some playing around. The guests are not able to connect if they are going over a bonded network. So in this case I have bonded eno3 and eno4 together into bond0 and then created a bridge using bond0 called Guest Network.

Any guests using the Guest Network are unable to connect to the internet meanwhile if I connect them through either the vmbr0 or vmbr1 interface it works. Those two interfaces are not bonded and are directly linked to the network.

Screenshot_2019-11-07 pve2 - Proxmox Virtual Environment.png

So in summary it would appear the issue is not with VLANs but with bonded interfaces.
 
Hmm - tricky one ...

Things you could try to rule out further components:
* remove 'bond0'
* create 'vmbr2' with 'eno2' as port
* boot a guest with a VLAN-tagged interface - does it work?
* same with 'eno3' instead of 'eno2'

if one of the tries does not work - it might indicate a problem with one of the bond-ports (eno2, eno3) - or their corresponding switchports

if both bond members work fine individually:
* check `dmesg` after a fresh boot - are there any messages indicating any bond/network related problems?
* check the switch logs and state for that bond

as a last (but quite often very effective) resort I would start tcpdump on various points to see where the packets (or the responses) get lost:
* `tcpdump -envi vmbr2`
* `tcpdump -envi bond0`
* `tcpdump -envi eno2`
* `tcpdump -envi eno3`
* `tcpdump -envi tap<VMID>i<IFACEID>` (the tap-device on the host for the vm with id VMID and interface number IFACEID)

That should give you a clearer picture.

I hope this helps!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!