[SOLVED] Bridge not starting and interfaces down at boot after CPU & RAM upgrade

cmrt

New Member
Feb 11, 2022
7
1
3
Hi all,

I've upgraded my PVE server today from a Ryzen 3400G with 16GB RAM to a 5700G with 64GB. After rebooting, none of the VMs came up, because at boot vmbr0 doesn't exist.

At the same time, all the interfaces are DOWN (I don't have the exact syntax at the moment, but it's what you get with ip link set <interface> down not a NO_CARRIER error).

Issuing a systemctl restart networking fixes the network problem, after which I can manually start all the VMs; I'd like to go back to the autostart though :)

After changing the hardware, the BIOS reset, but I went through it and reenabled IOMMU and a couple other things. I had a few network-dependent services that were starting (well, failing to start) at boot like an NFS fstab entry; I removed all of them, and the "only problem" I have is that nothing actually comes up.
I have pfSense virtualized running my own network, so without it I'm toast until/unless I can get a keyboard and screen plugged in.

I saw a lot of threads of people with similar issues, but none of them seems to be the one I'm having: they're either a full hang on boot (which I never experienced), an issue with /etc/network/interfaces missing a parameter (which I don't) or someone whose vmbr0 existed but had an issue when renaming interfaces (not my case).

One thing that I think my system has, different from the three I linked, is a fiber NIC (Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s]) but I'm sure I'm not the only one with it in their server.

I tried a few different things, none of which worked:
- booting from 2 older kernels (5.13 and 5.11)
- installing pve-kernel-5.15 and trying that
- removing all the Mellanox stuff that I installed before realizing I didn't actually need it (it wasn't doing anything on the old system, so I left it)
- disable services that were clearly failing at boot
- at least a couple more things that I forgot I did

I'll attach some logs and common things I've seen asked in other threads, if anyone has experience with this type of issue I'd appreciate the help :) I'm trying to avoid a reinstall, but I will if it's my only choice (it's a pain to reconfigure my network to use another router though , so I'd prefer to avoid it :/)

Thanks!

pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.15.19-1-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-5.15: 7.1-10
pve-kernel-helper: 7.1-10
pve-kernel-5.13: 7.1-7
pve-kernel-5.15.19-1-pve: 5.15.19-1
pve-kernel-5.13.19-4-pve: 5.13.19-9
pve-kernel-5.11.22-7-pve: 5.11.22-12
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.1-1
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-5
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-1
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1

/etc/network/interfaces
Code:
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface enp8s0 inet manual

auto enp1s0
iface enp1s0 inet manual
    mtu 9000
#Port 1 (LAN)

auto enp1s0d1
iface enp1s0d1 inet manual
#Port 2 (WAN)

iface enp9s0 inet manual
    mtu 9000

auto vmbr0
iface vmbr0 inet static
    address 192.168.44.10/24
    gateway 192.168.44.1
    bridge-ports enp1s0 enp9s0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 32,66,69,99,116,537
    mtu 9000
#LAN Bridge - ADD VLANS in interfaces

auto vmbr1
iface vmbr1 inet manual
    bridge-ports enp1s0d1
    bridge-stp off
    bridge-fd 0
#WAN Bridge

auto vlan99
iface vlan99 inet static
    address 192.168.99.10/24
    vlan-raw-device enp9s0
#Management VLAN 99

Attached is the output of journalctl -b truncated to when the system finished booting
 

Attachments

You bridged two NICs with your vmbr0 and your vmbr0 got a static IP. Maybe that is causing problems when you got 2 NICs that are not bonded that need to use the same IP?
 
Assuming you're referring to this section of my interfaces file

Code:
auto vmbr0
iface vmbr0 inet static
    address 192.168.44.10/24
    gateway 192.168.44.1
    bridge-ports enp1s0 enp9s0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 32,66,69,99,116,537
    mtu 9000
#LAN Bridge - ADD VLANS in interfaces

that configuration came straight out of Proxmox's own configurator, I didn't touch it. It was working perfectly fine before today, and seems to work ok when I restart the interfaces service.

I double checked, and, as far as I can tell, on Debian that's how bridges should be configured - see the "Configuring bridging in /etc/network/interfaces" section of https://wiki.debian.org/BridgeNetworkConnections

Am I misunderstanding and you meant something else?

In case it helps, BTW, enp9s0 (the onboard NIC) is unplugged, I'm keeping it there as a backup in case the Mellanox dies.
 
So... I gave up.

My nonexistent plans for the afternoon were cancelled, and my node is small enough that it only took a couple hours to reinstall Proxmox and restore all the VMs/CTs from a local backup (not to mention it gave me the push I needed to do some cleanup and abandon VMs that I hadn't turned on in so long they had cobwebs in the GUI).

The system is back to working as it should - the only change I made is that the bridge now only contains one interface (I'll connect enp9s0 if an when I ever need it. If it's not broken...)

I'd have loved to troubleshoot and fix this, but I have travel planned and didn't want to risk it :(

Before wiping (a.k.a. last night, while replying here and panicking) I wrote a script that:
1. checks if vmbr0 exists and, if it does, restarts networking
2. runs mount -a (in case you have failed network mounts, like I did) then
3. cycles through all the .conf files for VMs and CTs, finds the ones with onboot: 1 and starts them manually.

It also doesn't respect the boot order in Proxmox, but you can add a couple of qm start <VMID> and pct start <CTID> after the mount and before the loops, so those services are started early (I removed mine, but that's why I check status before starting. Not that it's needed, really, because start will tell you "VM <ID> already running", but ¯\_(ツ)_/¯ ).

Logs to syslog (using logger by default, unless you start it as ./script.sh debug, in which case it dumps everything to the console.

It's meant to be run from root's crontab every minute and aborts immediately if it detects vmbr0, so it's more of a "one shot emergency restart" kind of thing, but if someone has the same problem and can't afford (yet) rebuilding the node, I hope it helps you. Note that if any of the VMs or CTs fail to start, but the bridge comes back, the script won't do anything on the next restart (but if the bridge is up, so should be the network, and if that's working you can probably SSH into the server and troubleshoot).

Hope it helps, but mostly I hope nobody else ever needs it!

Bash:
#!/bin/bash
if [ $(ip a | grep vmbr | wc -l) -gt 0 ]; then
  exit 0
fi

LOGGER="logger"
if [ "${1}" == "debug" ]; then
  LOGGER="echo"
fi

$LOGGER "restarting the network"
systemctl restart networking

# Wait 5 seconds for things to catch up
sleep 5

$LOGGER "mount all the mounts"
mount -a

# VMs

VM_DIR="/etc/pve/qemu-server/"
VM_FILES=$(ls -1 $VM_DIR)

for f in $VM_FILES; do
  VM_ID=$(echo $f | sed 's/\.conf//')
  if [ $(cat "${VM_DIR}${f}" | grep 'onboot: 1' | wc -l) -eq 0 ]; then
    $LOGGER "skipping VM ${VM_ID} (no onboot)"
    continue
  fi
  if [ "$(qm status ${VM_ID})" == "status: running" ]; then
    $LOGGER "skipping VM ${VM_ID} (running)"
    continue
  fi
  $LOGGER "starting VM ${VM_ID}"
  qm start $VM_ID
done

# Containers

CONTAINER_DIR="/etc/pve/lxc/"
CONTAINER_FILES=$(ls -1 $CONTAINER_DIR)

for f in $CONTAINER_FILES; do
  CT_ID=$(echo $f | sed 's/\.conf//')
  if [ $(cat "${CONTAINER_DIR}${f}" | grep 'onboot: 1' | wc -l) -eq 0 ]; then
    $LOGGER "skipping container ${CT_ID} (no onboot)"
    continue
  fi
  if [ "$(pct status ${CT_ID})" == "status: running" ]; then
    $LOGGER "skipping container ${CT_ID} (running)"
    continue
  fi
  $LOGGER "starting container ${CT_ID}"
  pct start $CT_ID
done
 
Assuming you're referring to this section of my interfaces file

Code:
auto vmbr0
iface vmbr0 inet static
    address 192.168.44.10/24
    gateway 192.168.44.1
    bridge-ports enp1s0 enp9s0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 32,66,69,99,116,537
    mtu 9000
#LAN Bridge - ADD VLANS in interfaces

that configuration came straight out of Proxmox's own configurator, I didn't touch it. It was working perfectly fine before today, and seems to work ok when I restart the interfaces service.

I double checked, and, as far as I can tell, on Debian that's how bridges should be configured - see the "Configuring bridging in /etc/network/interfaces" section of https://wiki.debian.org/BridgeNetworkConnections

Am I misunderstanding and you meant something else?

In case it helps, BTW, enp9s0 (the onboard NIC) is unplugged, I'm keeping it there as a backup in case the Mellanox dies.
If you want to use both interfaces, create a bond in the proxmox network GUI with enp1s0 & enp9s0 as the bond-slaves and then use that bond* as the bridge-ports in the vmbr0 section above. You have 7 types of bond to choose from, Active Backup and 802.3ad (LACP) are frequently used. If you use LACP the device or switch on the other end of that connection must also have LACP setup on the receiving ports. No such requirement with active backup.

See the Example: Use a bond as bridge port in the link below to achieve what I think you want to do, although you want the bond setup as active backup.
https://pve.proxmox.com/wiki/Network_Configuration#_linux_bond