PVE 7.2.7 suddenly one bond does not up after reboot

May 6, 2021
38
1
13
Bern, Switzerland
Hi folks
We are running a six member cluster with a several network connections trough enterprise grade switching.
  • GbE (copper) for HA
  • 2x2 Port SFP28 to two Switches as Bond (LACP 802.3ad)
    • one bond for Data/Mgmt
    • one bond for Storage
This setup worked well until about two or three weeks. After a restart of a node (Maintenance reboot), the bond1 will suddenly not come up automatically anymore. Needs to be upped manually. This is not a big issue, but needs attention on Maintenance.
Any hint, why this occurs?
The setup is identically, HPE Hardware, Network Card BCM57414
  • card 1 / port 1 --> bond0 through switch 1
  • card 1 / port 2 --> bond1 through switch 1
  • card 2 / port 1 --> bond0 through switch 2
  • card 2 / port 2 --> bond1 through switch2
 
Any hint, why this occurs?
for now not really - but check the journal since the reboot occurred (`journalctl -b` ) - usually this gives a hint at what's not working

if you don't find the issue:
* post the journal (anonymize only what you must)
* post your /etc/network/interfaces

I hope this helps!
 
  • Like
Reactions: Jackobli
Hi folks
We are running a six member cluster with a several network connections trough enterprise grade switching.
  • GbE (copper) for HA
  • 2x2 Port SFP28 to two Switches as Bond (LACP 802.3ad)
    • one bond for Data/Mgmt
    • one bond for Storage
This setup worked well until about two or three weeks. After a restart of a node (Maintenance reboot), the bond1 will suddenly not come up automatically anymore. Needs to be upped manually. This is not a big issue, but needs attention on Maintenance.
Any hint, why this occurs?
The setup is identically, HPE Hardware, Network Card BCM57414
  • card 1 / port 1 --> bond0 through switch 1
  • card 1 / port 2 --> bond1 through switch 1
  • card 2 / port 1 --> bond0 through switch 2
  • card 2 / port 2 --> bond1 through switch2
perhaps the network device naming of one nic has changed? (I've see one case during an update in the last time)
But in this case you should get an error on manually start too…
Do you have an auto entry for all devices?

Udo
 
Will need to reboot the nodes to see, if it happens and to go into the journalctl -b, at the moment I don't see any useful infos in the logs of the 2 machines that have been rebooted.
Just very early:
Code:
Aug 02 17:04:07 MY-HOSTNAME kernel: bnxt_en 0000:85:00.0 (unnamed net_device) (uninitialized): Device requests max timeout of 100 seconds, may trigger hung task watchdog
Aug 02 17:04:07 MY-HOSTNAME kernel: bnxt_en 0000:85:00.0 eth0: Broadcom BCM57414 NetXtreme-E 10Gb/25Gb Ethernet found at mem ab210000, node addr e4:3d:1a:0e:13:20
Aug 02 17:04:07 MY-HOSTNAME kernel: bnxt_en 0000:85:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
Aug 02 17:04:07 MY-HOSTNAME kernel: bnxt_en 0000:85:00.1 (unnamed net_device) (uninitialized): Device requests max timeout of 100 seconds, may trigger hung task watchdog
The hung message appears two additional times.
After, the enslaving of the devices seems ok
Code:
Aug 02 17:04:10 MY-HOSTNAME kernel: bnxt_en 0000:85:00.1 ens1f1np1: NIC Link is Up, 25000 Mbps (NRZ) full duplex, Flow control: ON - receive & transmit
Aug 02 17:04:10 MY-HOSTNAME kernel: bnxt_en 0000:85:00.1 ens1f1np1: FEC autoneg off encoding: Clause 74 BaseR
Aug 02 17:04:10 MY-HOSTNAME kernel: bond1: (slave ens1f1np1): Enslaving as a backup interface with an up link
Aug 02 17:04:10 MY-HOSTNAME kernel: bnxt_en 0000:03:00.0 ens2f0np0: NIC Link is Up, 25000 Mbps (NRZ) full duplex, Flow control: ON - receive & transmit
Aug 02 17:04:10 MY-HOSTNAME kernel: bnxt_en 0000:03:00.0 ens2f0np0: FEC autoneg off encoding: Clause 74 BaseR
Aug 02 17:04:10 MY-HOSTNAME audit[1575]: AVC apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/pci0000:00/0000:00:03.1/0000:03:00.0/net/ens2f0>
Aug 02 17:04:10 MY-HOSTNAME audit[1575]: AVC apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/pci0000:00/0000:00:03.1/0000:03:00.0/net/ens2f0>
Aug 02 17:04:10 MY-HOSTNAME audit[1575]: AVC apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/pci0000:00/0000:00:03.1/0000:03:00.0/net/ens2f0>
Aug 02 17:04:10 MY-HOSTNAME audit[1575]: AVC apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/pci0000:00/0000:00:03.1/0000:03:00.0/net/ens2f0>
Aug 02 17:04:10 MY-HOSTNAME audit[1575]: AVC apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/pci0000:00/0000:00:03.1/0000:03:00.0/net/ens2f0>
Aug 02 17:04:10 MY-HOSTNAME kernel: bond1: (slave ens2f0np0): Enslaving as a backup interface with an up link
Aug 02 17:04:10 MY-HOSTNAME audit[1575]: AVC apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/pci0000:80/0000:80:03.1/0000:85:00.1/net/ens1f1>
Aug 02 17:04:10 MY-HOSTNAME audit[1575]: AVC apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/pci0000:00/0000:00:03.1/0000:03:00.0/net/ens2f0>
Aug 02 17:04:10 MY-HOSTNAME kernel: IPv6: ADDRCONF(NETDEV_CHANGE): bond1: link becomes ready
Aug 02 17:04:10 MY-HOSTNAME audit[1575]: AVC apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/pci0000:80/0000:80:03.1/0000:85:00.1/net/ens1f1>
Aug 02 17:04:10 MY-HOSTNAME audit[1575]: AVC apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/pci0000:00/0000:00:03.1/0000:03:00.0/net/ens2f0>
Aug 02 17:04:10 MY-HOSTNAME audit[1575]: AVC apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/virtual/net/bond1/type" pid=1575 comm="sssd" re>
Aug 02 17:04:10 MY-HOSTNAME audit[1575]: AVC apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/virtual/net/bond1/type" pid=1575 comm="sssd" re>
Aug 02 17:04:10 MY-HOSTNAME systemd-udevd[1186]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Aug 02 17:04:10 MY-HOSTNAME audit[1575]: AVC apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/virtual/net/bond1/type" pid=1575 comm="sssd" re>
Aug 02 17:04:10 MY-HOSTNAME kernel: 8021q: 802.1Q VLAN Support v1.8
Aug 02 17:04:10 MY-HOSTNAME kernel: 8021q: adding VLAN 0 to HW filter on device ens10f0
Aug 02 17:04:10 MY-HOSTNAME kernel: 8021q: adding VLAN 0 to HW filter on device bond1
Aug 02 17:04:10 MY-HOSTNAME audit[1575]: AVC apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/virtual/net/bond1.600/type" pid=1575 comm="sssd>
Aug 02 17:04:10 MY-HOSTNAME systemd-udevd[1186]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable


Network
Code:
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto ens1f0np0
iface ens1f0np0 inet manual

auto ens1f1np1
iface ens1f1np1 inet manual
    mtu 9000

auto ens2f0np0
iface ens2f0np0 inet manual
    mtu 9000

auto ens2f1np1
iface ens2f1np1 inet manual

auto ens10f0
iface ens10f0 inet static
    address 10.xxx.xxx.xxx/27

iface ens10f0 inet6 static
    address 2a00:xxxx:xxxx:xxxx::xxxx/64

iface ens10f1 inet manual

iface ens10f2 inet manual

iface ens10f3 inet manual

auto bond0
iface bond0 inet manual
    bond-slaves ens1f0np0 ens2f1np1
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer2+3

auto bond1
iface bond1 inet manual
    bond-slaves ens1f1np1 ens2f0np0
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer2+3
    mtu 9000.

auto bond1.600
iface bond1.600 inet static
    address 10.xxx.xxx.xxx/27
    mtu 9000

iface bond1.600 inet6 static
    address 2a00:xxxx:xxxx:xxxx::xxxx/64

iface vmbr1 inet manual
    bridge-ports bond1
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
    mtu 9000

iface vmbr0 inet manual
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094

auto vmbr0.100
iface vmbr0.100 inet static
    address 10.xxx.xxx.xxx/25
    gateway 10.xxx.xxx.1

iface vmbr0.100 inet6 static
    address 2a00:xxxx:xxxx:xxxx::xxxx/64

As I said, that happens only to bond1, bond0 is coming up normally.
The only difference in between is that bond1 is on jumbo frames.
And the behavior is new, it did work until soon. And we didn't change anything on the network. Just the regular patches from Debian/ProxMox.
 
the other variation is that you are tagging vlan 600 and an ip address explictly on bond1.600 where vlan 100 is configured differently.

the documentation suggests
auto bond0 iface bond0 inet manual bond-slaves eno1 eno2 bond-miimon 100 bond-mode 802.3ad bond-xmit-hash-policy layer2+3 iface bond0.5 inet manual auto vmbr0v5 iface vmbr0v5 inet static address 10.10.10.2/24 gateway 10.10.10.1 bridge-ports bond0.5 bridge-stp off bridge-fd 0
 
Ok, yesterday we patched all nodes and they came up without a flaw.
Our network engineer told, that he will check if there is perhaps something different on the switches which the system engineers didn't found.
We'll keep an eye on it and will follow up.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!