Very weird VLAN issue (not 100% sure where to pinpoint the issue...)

iprigger

Renowned Member
Sep 5, 2009
169
35
93
earth!
All,

I am having a very weird issue with one VLAN.

The Infrastructure consists of two 10G Switches with MLAG, two Uplink Switches (uplink using 2x 10G LACP to the MLAG Switches), a NetApp (Interface connected using LACP to the MLAG Switches) and seven Servers (active/backup to the MLAG Switches).

Now... I have a VM in VLAN 1001 connecting to a NFS Share on an interface on that same VLAN.

After ~2-3min it looks like the traffic is disrupted (i.e.: hangs). Funny enough... I don't see any weird things in the tcpdump.

Now, the weirdness gets really good:

If I add another interface to the VM (could be on the same vmbridge or any other, doesn't matter) that is in VLAN10... everything works flawless.

I only have those issues in VLAN1001 - all other VLANs work. In addition, I have tested using a dedicated machine connected to the same switch - and I don't see any issues there on vlan1001.

I know this is hard to pinpoint... but does anybody have a similar issue with proxmox v7?

Code:
proxmox-ve: 7.0-2 (running kernel: 5.11.22-4-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-7
pve-kernel-helper: 7.0-7
pve-kernel-5.4: 6.4-5
pve-kernel-5.11.22-4-pve: 5.11.22-8
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.4.128-1-pve: 5.4.128-1
pve-kernel-4.15: 5.4-8
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 14.2.21-1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-10
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.3-1
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

Kind Regards & thanks a lot!
Tobias
 
After ~2-3min it looks like the traffic is disrupted (i.e.: hangs). Funny enough... I don't see any weird things in the tcpdump.

Now, the weirdness gets really good:

If I add another interface to the VM (could be on the same vmbridge or any other, doesn't matter) that is in VLAN10... everything works flawless.

I only have those issues in VLAN1001 - all other VLANs work. In addition, I have tested using a dedicated machine connected to the same switch - and I don't see any issues there on vlan1001.
on a hunch - this could be a duplicate mac-address in that VLAN?
 
how does it look on netapp or switch side ?
would it be possible to share the tcpdump from both sides ?

is it happening when you transfer files to the nfs share or also in an idle state ?
Hi

From the netapp side -> difficult (i.e. not supported).

From the VM side -> no information at all....

it just gets slower and slower until it locks up.

I have checked the switches again - and all VLANs are set up identically... Spanning-Tree seems to be OK.

I am going to check again today as I migrated some LXC to KVM VMs and the LCX Network Throughput has been kind of very slow (35MB/sec) - as they are in the same VLAN it *could* be a hint.

Tobias
 
Have you tried to shut down just a link on every MLAG/LACP?
It seems a common mac address flapping when the traffic is moved by the load balancing algorithm from a link to another.

If everything works with a single link, you can enable again link by link to find which is the "channel" that is creating this behavior.
 
Have you tried to shut down just a link on every MLAG/LACP?
It seems a common mac address flapping when the traffic is moved by the load balancing algorithm from a link to another.

If everything works with a single link, you can enable again link by link to find which is the "channel" that is creating this behavior.
Hi,

Yes, I did try this - no change in behaviour. As I wrote above: If I let one interface in vlan1, vlan10 or any other vlan (without having any sort of traffic from that interface) it works.

I haven't tried this but will just create the interface without an IP.... and see what happens.

BTW: the virtual interfaces are on the exact same bridge.

Tobias
 
No, the VM having the Issue is a KVM VM. But I noticed some strange behaviour with LXC systems and migrated them.

Tobias
ok I'm asking 'cause I also had problems with lxc, though not with this pve version, which also had somehow the same symptoms you're describing and the problem was the tcp window became full after some seconds even with a simple apt update and then traffic stalled.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!