Sporadic Networking drops on NEW LXCs only

zemsten · Apr 16, 2022

Alright, been struggling with this for a while.

I have an established environment with two VMs and 6 or so CTs spanning 4 VLANs. The mgmt interface is stand alone and works fine. I have a 4 port Intel NIC with all 4 ports in a bond and a bridge assigned on top of that. It has no IP. It is VLAN aware.

All the workloads that I have setup now work great. The problem comes when I create a new CT workload. It is OS independent. No matter what VLAN tag I assign it, I'll have networking for a while, then it's like the guest just falls off the network. I'll still have outbound connectivity if I pop a console in the guest, but I can't talk to the guest from any other host, including my firewall where all the routing happens. In fact, the ARP entry will even disappear after some time, as expected when ARP expire. Randomly it will come back and drop. No errors are logged on guest or host. The weird part is that it's only a problem for new workloads.

I have the 4 ports bonded with 802.3ad, LACP fast, and layer2+3 hashing. On the other end is a microtik CSS326, which appears to be just fine with the LAGG.

PCAPs from the firewall and the guest OS don't show me anything. There aren't any strange errors or a flood of TCP retransmissions or anything. It's just there, then it's not. Most of the time when it comes back up, my SSH sessions will persist and I'll be back to the races for another couple minutes before it drops again.

I'm at wit's end. I appreciate any help!

Code:

proxmox-ve: 7.1-1 (running kernel: 5.15.30-1-pve)
pve-manager: 7.1-12 (running version: 7.1-12/b3c09de3)
pve-kernel-5.15: 7.1-14
pve-kernel-helper: 7.1-14
pve-kernel-5.13: 7.1-9
pve-kernel-5.11: 7.0-10
pve-kernel-5.15.30-1-pve: 5.15.30-1
pve-kernel-5.15.27-1-pve: 5.15.27-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.11.22-7-pve: 5.11.22-12
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-7
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-5
libpve-guest-common-perl: 4.1-1
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.1-2
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-2
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-7
pve-cluster: 7.1-3
pve-container: 4.1-4
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-6
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.2.0-2
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

zemsten · Apr 16, 2022

zemsten said:
Alright, been struggling with this for a while.

I have an established environment with two VMs and 6 or so CTs spanning 4 VLANs. The mgmt interface is stand alone and works fine. I have a 4 port Intel NIC with all 4 ports in a bond and a bridge assigned on top of that. It has no IP. It is VLAN aware.

All the workloads that I have setup now work great. The problem comes when I create a new CT workload. It is OS independent. No matter what VLAN tag I assign it, I'll have networking for a while, then it's like the guest just falls off the network. I'll still have outbound connectivity if I pop a console in the guest, but I can't talk to the guest from any other host, including my firewall where all the routing happens. In fact, the ARP entry will even disappear after some time, as expected when ARP expire. Randomly it will come back and drop. No errors are logged on guest or host. The weird part is that it's only a problem for new workloads.

I have the 4 ports bonded with 802.3ad, LACP fast, and layer2+3 hashing. On the other end is a microtik CSS326, which appears to be just fine with the LAGG.

PCAPs from the firewall and the guest OS don't show me anything. There aren't any strange errors or a flood of TCP retransmissions or anything. It's just there, then it's not. Most of the time when it comes back up, my SSH sessions will persist and I'll be back to the races for another couple minutes before it drops again.

I'm at wit's end. I appreciate any help!

Code:

proxmox-ve: 7.1-1 (running kernel: 5.15.30-1-pve) pve-manager: 7.1-12 (running version: 7.1-12/b3c09de3) pve-kernel-5.15: 7.1-14 pve-kernel-helper: 7.1-14 pve-kernel-5.13: 7.1-9 pve-kernel-5.11: 7.0-10 pve-kernel-5.15.30-1-pve: 5.15.30-1 pve-kernel-5.15.27-1-pve: 5.15.27-1 pve-kernel-5.13.19-6-pve: 5.13.19-15 pve-kernel-5.13.19-2-pve: 5.13.19-4 pve-kernel-5.11.22-7-pve: 5.11.22-12 ceph-fuse: 14.2.21-1 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown: residual config ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.22-pve2 libproxmox-acme-perl: 1.4.1 libproxmox-backup-qemu0: 1.2.0-1 libpve-access-control: 7.1-7 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.1-5 libpve-guest-common-perl: 4.1-1 libpve-http-server-perl: 4.1-1 libpve-storage-perl: 7.1-2 libqb0: 1.0.5-1 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 4.0.12-1 lxcfs: 4.0.12-pve1 novnc-pve: 1.3.0-2 proxmox-backup-client: 2.1.5-1 proxmox-backup-file-restore: 2.1.5-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.4-7 pve-cluster: 7.1-3 pve-container: 4.1-4 pve-docs: 7.1-2 pve-edk2-firmware: 3.20210831-2 pve-firewall: 4.2-5 pve-firmware: 3.3-6 pve-ha-manager: 3.3-3 pve-i18n: 2.6-2 pve-qemu-kvm: 6.2.0-2 pve-xtermjs: 4.16.0-1 qemu-server: 7.1-4 smartmontools: 7.2-pve2 spiceterm: 3.2-2 swtpm: 0.7.1~bpo11+1 vncterm: 1.7-1 zfsutils-linux: 2.1.4-pve1

I should've also included my /etc/network/interfaces for reference. Here is is, with my IPs redacted.

Code:

auto lo
iface lo inet loopback

auto eno1
iface eno1 inet static
    address RFC1918/24
    gateway RFC1918-gateway-IP

iface enp9s0f0 inet manual

iface enp9s0f1 inet manual

iface enp9s0f2 inet manual

iface enp9s0f3 inet manual

auto bond0
iface bond0 inet manual
    bond-slaves enp9s0f0 enp9s0f1 enp9s0f2 enp9s0f3
    bond-miimon 100
    bond-mode 802.3ad
    bond-lacp-rate 1
    bond-xmit-hash-policy layer2+3

auto vmbr0
iface vmbr0 inet manual
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094

bobmc · Apr 17, 2022

This is one of those problems were you will be chasing your tail for ages. The problem is most likely at the switch end, have you asked on the Mikrotik forums? The behaviour you describe suggests the link is not considered active by the switch until you initiate traffic from the host.

I have seen some posts that suggest lacp pairs are less trouble than multiple links so you might consider running two pairs rather than a quad? just a thought...

zemsten · Apr 17, 2022

bobmc said:
This is one of those problems were you will be chasing your tail for ages. The problem is most likely at the switch end, have you asked on the Mikrotik forums? The behaviour you describe suggests the link is not considered active by the switch until you initiate traffic from the host.

I have seen some posts that suggest lacp pairs are less trouble than multiple links so you might consider running two pairs rather than a quad? just a thought...

I have not asked with the Microtik folks. I appreciate the ideas. The one large fact that makes me think it's not the switch, is the fact that I have other LXCs on the same VLAN, running from the same host, with no problems whatsoever.

Separately, I originally had two NICs in the LACP for several years and only recently upgraded to a 4-NIC intel card (as I thought maybe realtek drivers were my issue). Same thing there.

Because the old workloads are fine, and only new ones are affected, to me it feels like a default configuration option somewhere hidden away has changed since I made the old workloads in PVE. I'm just out of ideas now.

zemsten · Apr 20, 2022

Well I was really hoping for some more visibility here. This is a recurring issue. I've tried going back to an older kernel too and nothing changes.

This could be on me as far as configuration goes, but I haven't changed much on the host itself. Seems like a PVE issue to me.

Sporadic Networking drops on NEW LXCs only

zemsten

Member

zemsten

Member

bobmc

Famous Member

zemsten

Member

zemsten

Member

We value your privacy