Hi,
We have a 4 Node cluster running CEPH and consistently upgraded for over a 1 year.
However after the nodes restarted for the new round of updates last night we found our VMs have intermittment packet lose.
In the kernel logs we are seeing STP events and packets received with its own source address. I should also note that by default each host has lldpd and ifupdown2 installed.
I have provided some relevant logs snippets down below:
Package Versions:
APT History:
Logs from each hypervisors:
And a example of the network interface file (No changes made)
I generally upgrade this cluster once every 2 weeks as its proved to be very reliable. We never had network issues with this cluster in the past until we applied this recent update.
Any advice would be highly appreciated. Thank you in advance!
We have a 4 Node cluster running CEPH and consistently upgraded for over a 1 year.
However after the nodes restarted for the new round of updates last night we found our VMs have intermittment packet lose.
In the kernel logs we are seeing STP events and packets received with its own source address. I should also note that by default each host has lldpd and ifupdown2 installed.
I have provided some relevant logs snippets down below:
Package Versions:
Code:
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-12 (running version: 6.2-12/b287dd27)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 14.2.11-pve1
ceph-fuse: 14.2.11-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 0.8.21-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-2
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1
Code:
Start-Date: 2020-09-22 09:46:47
Commandline: apt-get dist-upgrade
Install: proxmox-archive-keyring:amd64 (1.0, automatic)
Upgrade: pve-qemu-kvm:amd64 (5.1.0-1, 5.1.0-2), proxmox-backup-client:amd64 (0.8.15-1, 0.8.16-1), proxmox-ve:amd64 (6.2-1, 6.2-2)
End-Date: 2020-09-22 09:46:52
Start-Date: 2020-09-30 20:28:15 (This is the update that caused the issues)
Commandline: apt-get dist-upgrade
Install: pve-kernel-5.4.65-1-pve:amd64 (5.4.65-1, automatic)
Upgrade: pve-kernel-5.4:amd64 (6.2-6, 6.2-7), linux-libc-dev:amd64 (4.19.132-1, 4.19.146-1), pve-docs:amd64 (6.2-5, 6.2-6), pve-firewall:amd64 (4.1-2, 4.1-3), pve-container:amd64 (3.2-1, 3.2-2), proxmox-backup-client:amd64 (0.8.16-1, 0.8.21-1), libx11-6:amd64 (2:1.6.7-1, 2:1.6.7-1+deb10u1), ifupdown2:amd64 (3.0.0-1+pve2, 3.0.0-1+pve3), pve-manager:amd64 (6.2-11, 6.2-12), libx11-data:amd64 (2:1.6.7-1, 2:1.6.7-1+deb10u1), pve-kernel-helper:amd64 (6.2-6, 6.2-7), libx11-xcb1:amd64 (2:1.6.7-1, 2:1.6.7-1+deb10u1), base-files:amd64 (10.3+deb10u5, 10.3+deb10u6)
Logs from each hypervisors:
Code:
## Host 1 of 4
Oct 1 18:10:17 pm01 kernel: [ 2144.050932] vmbr2: port 2(tap105i0) entered disabled state
Oct 1 18:10:18 pm01 kernel: [ 2145.096068] device tap105i0 entered promiscuous mode
Oct 1 18:10:18 pm01 kernel: [ 2145.114930] vmbr2: port 2(tap105i0) entered blocking state
Oct 1 18:10:18 pm01 kernel: [ 2145.114934] vmbr2: port 2(tap105i0) entered disabled state
Oct 1 18:10:18 pm01 kernel: [ 2145.115167] vmbr2: port 2(tap105i0) entered blocking state
Oct 1 18:10:18 pm01 kernel: [ 2145.115170] vmbr2: port 2(tap105i0) entered forwarding state
(Virtual machines are running here)
## Host 2 of 4
Oct 1 17:48:32 pm02 kernel: [ 565.249730] vmbr1: received packet on bond1.21 with own address as source address (addr:24:6e:96:13:9c:98, vlan:0)
Oct 1 17:48:32 pm02 kernel: [ 565.249776] vmbr4: received packet on bond1.24 with own address as source address (addr:24:6e:96:13:9c:98, vlan:0)
Oct 1 18:09:33 pm02 kernel: [ 1826.819177] vmbr3: received packet on bond1.23 with own address as source address (addr:24:6e:96:13:9c:98, vlan:0)
Oct 1 18:12:25 pm02 snmpd[1923]: error on subcontainer 'ia_addr' insert (-1)
Oct 1 18:12:25 pm02 snmpd[1923]: error on subcontainer 'ia_addr' insert (-1)
Oct 1 18:12:25 pm02 snmpd[1923]: error on subcontainer 'ia_addr' insert (-1)
Oct 1 18:12:25 pm02 snmpd[1923]: error on subcontainer 'ia_addr' insert (-1)
Oct 1 18:12:25 pm02 snmpd[1923]: error on subcontainer 'ia_addr' insert (-1)
Oct 1 18:12:25 pm02 snmpd[1923]: error on subcontainer 'ia_addr' insert (-1)
Oct 1 18:12:25 pm02 snmpd[1923]: error on subcontainer 'ia_addr' insert (-1)
Oct 1 18:12:25 pm02 snmpd[1923]: error on subcontainer 'ia_addr' insert (-1)
## Host 3 of 4
Oct 1 17:53:06 pm03 kernel: [ 483.507321] vmbr4: received packet on bond1.24 with own address as source address (addr:24:6e:96:11:64:c8, vlan:0)
## Host 4 of 4
Oct 1 18:05:09 pm04 kernel: [ 287.186288] vmbr4: received packet on bond1.24 with own address as source address (addr:24:6e:96:13:b6:d8, vlan:0)
And a example of the network interface file (No changes made)
Code:
### NETWORK INTERFACE FILE
auto eno3
iface eno3 inet manual
#Intel i350 LOM - 1gbe - Port 1
auto eno4
iface eno4 inet manual
#Intel i350 LOM - 1gbe - Port 2
auto eno1
iface eno1 inet manual
#Intel x520 LOM - 10gbe - Port 1
auto eno2
iface eno2 inet manual
#Intel x520 LOM - 10gbe - Port 2
auto enp4s0f0
iface enp4s0f0 inet manual
#Intel x520 PCIe - 10gbe - Port 1
auto enp4s0f1
iface enp4s0f1 inet manual
#Intel x520 PCIe - 10gbe - Port 2
auto enp132s0
iface enp132s0 inet manual
#Mellanox ConnectX-3 Pro - 40gbe - Port 1
auto enp132s0d1
iface enp132s0d1 inet manual
#Mellanox ConnectX-3 Pro - 40gbe - Port 2
auto bond0
iface bond0 inet manual
bond-slaves eno3 eno4
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer2+3
bond-lacp-rate 1
bond-min-links 1
#Bond for inband management
auto bond1
iface bond1 inet manual
bond-slaves eno1 eno2 enp4s0f0 enp4s0f1
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer2+3
bond-lacp-rate 1
bond-min-links 1
#Bond for VM data networks
auto bond2
iface bond2 inet manual
bond-slaves enp132s0 enp132s0d1
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer2+3
mtu 9000
bond-lacp-rate 1
bond-min-links 1
#Bond for storage network
auto vmbr0
iface vmbr0 inet static
address 192.2.29.106/24
gateway 192.2.29.254
bridge-ports bond0
bridge-stp off
bridge-fd 0
#129_inband_management
auto vmbr1
iface vmbr1 inet manual
bridge-ports bond1.21
bridge-stp off
bridge-fd 0
#21_dmz_sub_zone1
auto vmbr2
iface vmbr2 inet manual
bridge-ports bond1.22
bridge-stp off
bridge-fd 0
#22_dmz_sub_zone2 (web front ends)
auto vmbr3
iface vmbr3 inet manual
bridge-ports bond1.23
bridge-stp off
bridge-fd 0
#23_dmz_sub_zone3 (back end services)
auto vmbr4
iface vmbr4 inet manual
bridge-ports bond1.24
bridge-stp off
bridge-fd 0
#24_dmz_sub_zone4
auto vmbr5
iface vmbr5 inet manual
bridge-ports bond1.25
bridge-stp off
bridge-fd 0
#25_dmz_sub_zone5
auto vmbr6
iface vmbr6 inet static
address 192.168.205.2/24
bridge-ports bond2.205
bridge-stp off
bridge-fd 0
mtu 9000
#205_dmz_cluster_link
auto vmbr7
iface vmbr7 inet static
address 192.168.206.2/24
bridge-ports bond2.206
bridge-stp off
bridge-fd 0
mtu 9000
#206_dmz_ceph_storage
auto vmbr8
iface vmbr8 inet static
address 192.168.207.2/24
bridge-ports bond2.207
bridge-stp off
bridge-fd 0
mtu 9000
#207_dmz_fs_storage
I generally upgrade this cluster once every 2 weeks as its proved to be very reliable. We never had network issues with this cluster in the past until we applied this recent update.
Any advice would be highly appreciated. Thank you in advance!
Last edited: