Update from recent version to 6.2.12 causing unstable network for VMs

Kafoof · Oct 1, 2020

Hi,

We have a 4 Node cluster running CEPH and consistently upgraded for over a 1 year.
However after the nodes restarted for the new round of updates last night we found our VMs have intermittment packet lose.
In the kernel logs we are seeing STP events and packets received with its own source address. I should also note that by default each host has lldpd and ifupdown2 installed.

I have provided some relevant logs snippets down below:

Package Versions:

Code:

proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-12 (running version: 6.2-12/b287dd27)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 14.2.11-pve1
ceph-fuse: 14.2.11-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 0.8.21-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-2
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1

APT History:

Code:

Start-Date: 2020-09-22  09:46:47
Commandline: apt-get dist-upgrade
Install: proxmox-archive-keyring:amd64 (1.0, automatic)
Upgrade: pve-qemu-kvm:amd64 (5.1.0-1, 5.1.0-2), proxmox-backup-client:amd64 (0.8.15-1, 0.8.16-1), proxmox-ve:amd64 (6.2-1, 6.2-2)
End-Date: 2020-09-22  09:46:52

Start-Date: 2020-09-30  20:28:15 (This is the update that caused the issues)
Commandline: apt-get dist-upgrade
Install: pve-kernel-5.4.65-1-pve:amd64 (5.4.65-1, automatic)
Upgrade: pve-kernel-5.4:amd64 (6.2-6, 6.2-7), linux-libc-dev:amd64 (4.19.132-1, 4.19.146-1), pve-docs:amd64 (6.2-5, 6.2-6), pve-firewall:amd64 (4.1-2, 4.1-3), pve-container:amd64 (3.2-1, 3.2-2), proxmox-backup-client:amd64 (0.8.16-1, 0.8.21-1), libx11-6:amd64 (2:1.6.7-1, 2:1.6.7-1+deb10u1), ifupdown2:amd64 (3.0.0-1+pve2, 3.0.0-1+pve3), pve-manager:amd64 (6.2-11, 6.2-12), libx11-data:amd64 (2:1.6.7-1, 2:1.6.7-1+deb10u1), pve-kernel-helper:amd64 (6.2-6, 6.2-7), libx11-xcb1:amd64 (2:1.6.7-1, 2:1.6.7-1+deb10u1), base-files:amd64 (10.3+deb10u5, 10.3+deb10u6)

Logs from each hypervisors:

Code:

## Host 1 of 4
Oct  1 18:10:17 pm01 kernel: [ 2144.050932] vmbr2: port 2(tap105i0) entered disabled state
Oct  1 18:10:18 pm01 kernel: [ 2145.096068] device tap105i0 entered promiscuous mode
Oct  1 18:10:18 pm01 kernel: [ 2145.114930] vmbr2: port 2(tap105i0) entered blocking state
Oct  1 18:10:18 pm01 kernel: [ 2145.114934] vmbr2: port 2(tap105i0) entered disabled state
Oct  1 18:10:18 pm01 kernel: [ 2145.115167] vmbr2: port 2(tap105i0) entered blocking state
Oct  1 18:10:18 pm01 kernel: [ 2145.115170] vmbr2: port 2(tap105i0) entered forwarding state
(Virtual machines are running here)

## Host 2 of 4
Oct  1 17:48:32 pm02 kernel: [  565.249730] vmbr1: received packet on bond1.21 with own address as source address (addr:24:6e:96:13:9c:98, vlan:0)
Oct  1 17:48:32 pm02 kernel: [  565.249776] vmbr4: received packet on bond1.24 with own address as source address (addr:24:6e:96:13:9c:98, vlan:0)
Oct  1 18:09:33 pm02 kernel: [ 1826.819177] vmbr3: received packet on bond1.23 with own address as source address (addr:24:6e:96:13:9c:98, vlan:0)
Oct  1 18:12:25 pm02 snmpd[1923]: error on subcontainer 'ia_addr' insert (-1)
Oct  1 18:12:25 pm02 snmpd[1923]: error on subcontainer 'ia_addr' insert (-1)
Oct  1 18:12:25 pm02 snmpd[1923]: error on subcontainer 'ia_addr' insert (-1)
Oct  1 18:12:25 pm02 snmpd[1923]: error on subcontainer 'ia_addr' insert (-1)
Oct  1 18:12:25 pm02 snmpd[1923]: error on subcontainer 'ia_addr' insert (-1)
Oct  1 18:12:25 pm02 snmpd[1923]: error on subcontainer 'ia_addr' insert (-1)
Oct  1 18:12:25 pm02 snmpd[1923]: error on subcontainer 'ia_addr' insert (-1)
Oct  1 18:12:25 pm02 snmpd[1923]: error on subcontainer 'ia_addr' insert (-1)

## Host 3 of 4
Oct  1 17:53:06 pm03 kernel: [  483.507321] vmbr4: received packet on bond1.24 with own address as source address (addr:24:6e:96:11:64:c8, vlan:0)

## Host 4 of 4
Oct  1 18:05:09 pm04 kernel: [  287.186288] vmbr4: received packet on bond1.24 with own address as source address (addr:24:6e:96:13:b6:d8, vlan:0)

And a example of the network interface file (No changes made)

Code:

### NETWORK INTERFACE FILE
auto eno3
iface eno3 inet manual
#Intel i350 LOM - 1gbe - Port 1

auto eno4
iface eno4 inet manual
#Intel i350 LOM - 1gbe - Port 2

auto eno1
iface eno1 inet manual
#Intel x520 LOM - 10gbe - Port 1

auto eno2
iface eno2 inet manual
#Intel x520 LOM - 10gbe - Port 2

auto enp4s0f0
iface enp4s0f0 inet manual
#Intel x520 PCIe - 10gbe - Port 1

auto enp4s0f1
iface enp4s0f1 inet manual
#Intel x520 PCIe - 10gbe - Port 2

auto enp132s0
iface enp132s0 inet manual
#Mellanox ConnectX-3 Pro - 40gbe - Port 1

auto enp132s0d1
iface enp132s0d1 inet manual
#Mellanox ConnectX-3 Pro - 40gbe - Port 2

auto bond0
iface bond0 inet manual
        bond-slaves eno3 eno4
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3
        bond-lacp-rate 1
        bond-min-links 1
#Bond for inband management

auto bond1
iface bond1 inet manual
        bond-slaves eno1 eno2 enp4s0f0 enp4s0f1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3
        bond-lacp-rate 1
        bond-min-links 1
#Bond for VM data networks

auto bond2
iface bond2 inet manual
        bond-slaves enp132s0 enp132s0d1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3
        mtu 9000
        bond-lacp-rate 1
        bond-min-links 1
#Bond for storage network

auto vmbr0
iface vmbr0 inet static
        address 192.2.29.106/24
        gateway 192.2.29.254
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
#129_inband_management

auto vmbr1
iface vmbr1 inet manual
        bridge-ports bond1.21
        bridge-stp off
        bridge-fd 0
#21_dmz_sub_zone1

auto vmbr2
iface vmbr2 inet manual
        bridge-ports bond1.22
        bridge-stp off
        bridge-fd 0
#22_dmz_sub_zone2 (web front ends)

auto vmbr3
iface vmbr3 inet manual
        bridge-ports bond1.23
        bridge-stp off
        bridge-fd 0
#23_dmz_sub_zone3 (back end services)

auto vmbr4
iface vmbr4 inet manual
        bridge-ports bond1.24
        bridge-stp off
        bridge-fd 0
#24_dmz_sub_zone4

auto vmbr5
iface vmbr5 inet manual
        bridge-ports bond1.25
        bridge-stp off
        bridge-fd 0
#25_dmz_sub_zone5

auto vmbr6
iface vmbr6 inet static
        address 192.168.205.2/24
        bridge-ports bond2.205
        bridge-stp off
        bridge-fd 0
        mtu 9000
#205_dmz_cluster_link

auto vmbr7
iface vmbr7 inet static
        address 192.168.206.2/24
        bridge-ports bond2.206
        bridge-stp off
        bridge-fd 0
        mtu 9000
#206_dmz_ceph_storage

auto vmbr8
iface vmbr8 inet static
        address 192.168.207.2/24
        bridge-ports bond2.207
        bridge-stp off
        bridge-fd 0
        mtu 9000
#207_dmz_fs_storage

I generally upgrade this cluster once every 2 weeks as its proved to be very reliable. We never had network issues with this cluster in the past until we applied this recent update.

Any advice would be highly appreciated. Thank you in advance!

Kafoof · Oct 1, 2020

I should also note on the switch side I can see some interfaces intermittently falling out of lacp.

Code:

TAT41-SWL04#show etherchannel
Port Channel Port-Channel1:
  Active Ports: Ethernet2 PeerEthernet1 Ethernet1
  Configured, but inactive ports:
       Port             Reason unconfigured
    ------------------- -------------------
       PeerEthernet2

Port Channel Port-Channel2:
  Active Ports: PeerEthernet4 PeerEthernet3 Ethernet3 Ethernet4
Port Channel Port-Channel3:
  Active Ports: PeerEthernet6 PeerEthernet5 Ethernet6 Ethernet5
Port Channel Port-Channel4:
  Active Ports: Ethernet7 Ethernet8
  Configured, but inactive ports:
       Port             Reason unconfigured
    ------------------- -------------------
       PeerEthernet7
       PeerEthernet8

Port Channel Port-Channel999:
  Active Ports:  Ethernet47 Ethernet49 Ethernet48 Ethernet50 Ethernet46
                 Ethernet45 Ethernet51 Ethernet52

Kafoof · Oct 1, 2020

Small update from my last post. Rolling back ifupdown2 to the previous version had no improvement.
apt-get install ifupdown2=3.0.0-1+pve2
I get about 10 packets in before the packet lose sets in.

Kafoof · Oct 1, 2020

Single host update that is hosting all the VMs

Code:

 #roll back to a single host
apt install pve-firewall:amd64=4.1-2
apt install pve-manager:amd64=6.2-11
apt install pve-kernel-helper:amd64=6.2-6

After downgrading I have reset the host.

Pings will work temporary to a routed IP but then will experience packet loss for a period of time then restore.

Also seeing from the snmpd

[CODE]Oct  1 20:36:54 pm01 snmpd[1957]: error on subcontainer 'ia_addr' insert (-1)
Oct  1 20:36:54 pm01 snmpd[1957]: error on subcontainer 'ia_addr' insert (-1)
Oct  1 20:36:54 pm01 snmpd[1957]: error on subcontainer 'ia_addr' insert (-1)
Oct  1 20:36:54 pm01 snmpd[1957]: error on subcontainer 'ia_addr' insert (-1)
Oct  1 20:36:54 pm01 snmpd[1957]: error on subcontainer 'ia_addr' insert (-1)
Oct  1 20:36:54 pm01 snmpd[1957]: error on subcontainer 'ia_addr' insert (-1)
Oct  1 20:36:54 pm01 snmpd[1957]: error on subcontainer 'ia_addr' insert (-1)

MrXermon · Oct 1, 2020

Hi,

looks like we have a similar problem in our cluster. I did not yet had the time to troubleshoot it in depth but we're experiencing very unstable connectivity between the VMs, too. In case of IPv6 we've experienced IPv6 NDP not to be working and we were able to fix that by disabling multicast snooping and enabling multicast querier on the bridge. We're using ifupdown2, too.

Initially I though the issue is caused by our switches but as we've seen loss between VMs on the same physical host we're pretty sure it's a matter of the proxmox update. Regarding the snmpd issue mentioned in the post before, I can see the same issue in our environment but we've had that errors already in the log-files prior to the update.

Code:

proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-12 (running version: 6.2-12/b287dd27)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 0.8.21-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-2
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1

Regards
Jan

Kafoof · Oct 1, 2020

Hi MrXermon, Thanks for the response. Our environment is IPv4 only. Good to know about the SNMP as well thank you for the clarification.

I also should note that when I migrate a VM I can see this in the output now after upgrading

Code:

tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified

Kafoof · Oct 1, 2020

I downgraded all host but still experincing the same affect.
On the switch side I can see the MLAG states flapping

Code:

                                                                                      local/remote
   mlag       desc                                  state       local       remote          status
---------- ---------------------------- -------------------- ----------- ------------ ------------
      1       "Trunk to PM01"          active-full         Po1          Po1           up/up
      2       "Trunk to PM02"       active-partial         Po2          Po2         down/up
      3       "Trunk to PM03"       active-partial         Po3          Po3         down/up
      4       "Trunk to PM04"       active-partial         Po4          Po4         down/up
SWL03#show mlag interfaces
                                                                                      local/remote
   mlag       desc                                  state       local       remote          status
---------- ---------------------------- -------------------- ----------- ------------ ------------
      1       "Trunk to PM01"          active-full         Po1          Po1           up/up
      2       "Trunk to PM02"       active-partial         Po2          Po2         down/up
      3       "Trunk to PM03"          active-full         Po3          Po3           up/up
      4       "Trunk to PM04"       active-partial         Po4          Po4         down/up

MrXermon · Oct 1, 2020

Interesting. I'm currently reinstalling on of the hosts with the old image. Let's see if that solves the issue on that specific host. If that is the case we'll probably reinstall all hosts on the old image as a short-time fix. Anyhow it could be very interesting to troubleshoot this issue more deeply. I have another Proxmox user that has the same issue after the update aswell. He is using ifupdown2, too.

degan · Oct 1, 2020

We are not yet on 6.2.12, but i heard from many others that they have the same problems since 6.2.12.
I think thats a bigger problem and schould be analyzed by the PVE Team.
I hope we get an offical statement as fast as possible.

Kafoof · Oct 2, 2020

I think the upgrades might have triggered some unstable state on the Arista switch side. MLAG and LACP appears to have stabilized now.
After rebooting each switch in sequence it seems traffic has been restored.
I have now upgraded all 4 nodes back to 6.2-12 and it appears to be functioning normal.
I will monitor for over the weekend and report back if I see this behavior again.

MrXermon · Oct 2, 2020

Interesting to hear. I‘ve rebooted our switches like a week ago and it did not have any impact. We‘re using Dell Switches with LACP in a VLT setup to create the redundant connections. The interesting thing is that I could even measure loss between VMs on the same host.
Still, it would be very interesting to get someone of Proxmox involved in the troubleshooting.

spirit · Oct 5, 2020

do you have tried to rollback kernel to older version ?

about ifupdown2, I don't see any change in last version. (I'm running it with last version, with lacp bonding on 2 mellanox swiches, without any problem). Anyway, ifupdown2 only do the setup at start, (you can verifiy status with "cat /proc/net/bond/bondX").

Maybe it's a kernel bug with nic driver ?
What is your nic model ?

Search

Search

Update from recent version to 6.2.12 causing unstable network for VMs

Kafoof

Active Member

Kafoof

Active Member

Kafoof

Active Member

Kafoof

Active Member

MrXermon

Member

Kafoof

Active Member

Kafoof

Active Member

MrXermon

Member

degan

Active Member

Kafoof

Active Member

MrXermon

Member

spirit

Distinguished Member

We value your privacy