Very weird VLAN issue (not 100% sure where to pinpoint the issue...)

iprigger · Sep 13, 2021

All,

I am having a very weird issue with one VLAN.

The Infrastructure consists of two 10G Switches with MLAG, two Uplink Switches (uplink using 2x 10G LACP to the MLAG Switches), a NetApp (Interface connected using LACP to the MLAG Switches) and seven Servers (active/backup to the MLAG Switches).

Now... I have a VM in VLAN 1001 connecting to a NFS Share on an interface on that same VLAN.

After ~2-3min it looks like the traffic is disrupted (i.e.: hangs). Funny enough... I don't see any weird things in the tcpdump.

Now, the weirdness gets really good:

If I add another interface to the VM (could be on the same vmbridge or any other, doesn't matter) that is in VLAN10... everything works flawless.

I only have those issues in VLAN1001 - all other VLANs work. In addition, I have tested using a dedicated machine connected to the same switch - and I don't see any issues there on vlan1001.

I know this is hard to pinpoint... but does anybody have a similar issue with proxmox v7?

Code:

proxmox-ve: 7.0-2 (running kernel: 5.11.22-4-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-7
pve-kernel-helper: 7.0-7
pve-kernel-5.4: 6.4-5
pve-kernel-5.11.22-4-pve: 5.11.22-8
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.4.128-1-pve: 5.4.128-1
pve-kernel-4.15: 5.4-8
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 14.2.21-1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-10
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.3-1
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

Kind Regards & thanks a lot!
Tobias

6uellerbpanda · Sep 16, 2021

anything in the logs ?
traffic disrupted from the vm to the mounted nfs share or inside the vm itself or both ?

Stoiko Ivanov · Sep 16, 2021

iprigger said:
After ~2-3min it looks like the traffic is disrupted (i.e.: hangs). Funny enough... I don't see any weird things in the tcpdump.

Now, the weirdness gets really good:

If I add another interface to the VM (could be on the same vmbridge or any other, doesn't matter) that is in VLAN10... everything works flawless.

I only have those issues in VLAN1001 - all other VLANs work. In addition, I have tested using a dedicated machine connected to the same switch - and I don't see any issues there on vlan1001.

on a hunch - this could be a duplicate mac-address in that VLAN?

iprigger · Sep 19, 2021

6uellerbpanda said:
anything in the logs ?
traffic disrupted from the vm to the mounted nfs share or inside the vm itself or both ?

only from the VM to the NFS Share.... and unfortunately no logs.

tcpdump doesn't show anything bad either - it just seems like the traffic is stopping / getting delayed.

iprigger · Sep 19, 2021

Stoiko Ivanov said:
on a hunch - this could be a duplicate mac-address in that VLAN?

Checked that, not the case. and btw... the added interface in vlan10 does not do anything - nfs server inside vlan1001 is used...

As I said: it's weird.

6uellerbpanda · Sep 20, 2021

how does it look on netapp or switch side ?
would it be possible to share the tcpdump from both sides ?

is it happening when you transfer files to the nfs share or also in an idle state ?

iprigger · Sep 20, 2021

6uellerbpanda said:
how does it look on netapp or switch side ?
would it be possible to share the tcpdump from both sides ?

is it happening when you transfer files to the nfs share or also in an idle state ?

Hi

From the netapp side -> difficult (i.e. not supported).

From the VM side -> no information at all....

it just gets slower and slower until it locks up.

I have checked the switches again - and all VLANs are set up identically... Spanning-Tree seems to be OK.

I am going to check again today as I migrated some LXC to KVM VMs and the LCX Network Throughput has been kind of very slow (35MB/sec) - as they are in the same VLAN it *could* be a hint.

Tobias

6uellerbpanda · Sep 20, 2021

iprigger said:
I am going to check again today as I migrated some LXC to KVM VMs and the LCX Network Throughput has been kind of very slow (35MB/sec) - as they are in the same VLAN it *could* be a hint.

ahhh...so this VM is a lxc and not kvm you mean ?

iprigger · Sep 20, 2021

6uellerbpanda said:
ahhh...so this VM is a lxc and not kvm you mean ?

No, the VM having the Issue is a KVM VM. But I noticed some strange behaviour with LXC systems and migrated them.

Tobias

djzoidberg · Sep 20, 2021

Have you tried to shut down just a link on every MLAG/LACP?
It seems a common mac address flapping when the traffic is moved by the load balancing algorithm from a link to another.

If everything works with a single link, you can enable again link by link to find which is the "channel" that is creating this behavior.

iprigger · Sep 20, 2021

djzoidberg said:
Have you tried to shut down just a link on every MLAG/LACP?
It seems a common mac address flapping when the traffic is moved by the load balancing algorithm from a link to another.

If everything works with a single link, you can enable again link by link to find which is the "channel" that is creating this behavior.

Hi,

Yes, I did try this - no change in behaviour. As I wrote above: If I let one interface in vlan1, vlan10 or any other vlan (without having any sort of traffic from that interface) it works.

I haven't tried this but will just create the interface without an IP.... and see what happens.

BTW: the virtual interfaces are on the exact same bridge.

Tobias

6uellerbpanda · Sep 20, 2021

iprigger said:
No, the VM having the Issue is a KVM VM. But I noticed some strange behaviour with LXC systems and migrated them.

Tobias

ok I'm asking 'cause I also had problems with lxc, though not with this pve version, which also had somehow the same symptoms you're describing and the problem was the tcp window became full after some seconds even with a simple apt update and then traffic stalled.

Search

Search

Very weird VLAN issue (not 100% sure where to pinpoint the issue...)

iprigger

Renowned Member

6uellerbpanda

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

iprigger

Renowned Member

iprigger

Renowned Member

6uellerbpanda

Well-Known Member

iprigger

Renowned Member

6uellerbpanda

Well-Known Member

iprigger

Renowned Member

djzoidberg

Member

iprigger

Renowned Member

6uellerbpanda

Well-Known Member