Network fails at boot

mwrothbe

New Member
Jan 17, 2025
6
0
1
Lately, when my Poxmox9.1.5 system boots (cold or warm,) networking fails. None of the CTs or VMs load because the vmbr is unavailable and the Proxmox host ends up on some random dhcp address as opposed the one assigned in /etc/network/interfaces. If I go in and do a 'systemctl restart network', everything works. This started happening a day or so ago, and I'm not sure if it coincides with any updates or changes on my end. Any thoughts on what I can do to fix this? If there is some power interruption, I'd like everything to just come back on-line without manual interaction. Attached is the ifupdown2 debug log if it helps. Thanks!
 

Attachments

Last edited:
I do the following on some servers

Locate in /etc/network/ifupdown2/ifupdown2.conf, the variable: ‘link_master_slave=1’ and change it to ‘link_master_slave=0’
 
Does your server contain any Broadcom based NICs?
We always make sure to disable RDMA on these NICs if not in use, in our experience this setting causes everything from unstable networking to outright kernel panics / half-boots.
 
Does your server contain any Broadcom based NICs?
We always make sure to disable RDMA on these NICs if not in use, in our experience this setting causes everything from unstable networking to outright kernel panics / half-boots.
Nope. 2 port onboard Intel x710 and a 4 port Intel x710 NIC. I can try disabling RDMA and see what happens, but I don't have the other symptoms you mention. Everything had been working fine for like a year with this HW config, up to a couple days ago. Now I consistently have to restart the network service in Proxmox after reboot. Not sure where I should be looking for errors for hints of root cause (or I don't yet have experience to interpret the logs I am looking at.)
 
This is a boot-time race condition. The NIC eno1np0 isn't initialized by the kernel yet when ifupdown2 runs, but it's available moments later, which is why systemctl restart networking fixes everything.

The log confirms this: all NICs on PCIe bus 6 (eno2np1, eno3np0) and bus 112 (ens8191f*) enumerate in the netlink dump, but eno1np0 is completely absent. It's likely on a different PCIe bus/slot that takes longer to probe.

The "random DHCP address" is probably because eno1np0 is your actual uplink to the LAN, the bridge gets its static IP but has no physical path to the gateway without that port, and the NIC may later grab a DHCP lease outside the bridge context.
Check for recent kernel/firmware updates that shifted initialization timing:
# Recent apt activity
grep -E 'pve-kernel|linux-image|firmware|ifupdown' /var/log/apt/history.log
# Or more broadly
cat /var/log/apt/history.log | tail -80

Also check dmesg for how late eno1np0 initializes:

dmesg | grep -i eno1np0
dmesg | grep -i 'pci.*net\|eth\|rename'

Fix the Race Condition, make the networking service wait for the device via a systemd drop-in:

systemctl edit networking.service

Add:
[Unit]
Wants=sys-subsystem-net-devices-eno1np0.device
After=sys-subsystem-net-devices-eno1np0.device

This tells systemd to wait for the kernel to register eno1np0 before starting networking.service. If the NIC appears within the default systemd device timeout (typically 90s), networking proceeds normally. If the NIC is truly dead/missing, the boot will eventually continue without it (same as today, but with a delay).

Alternative, if you'd rather not delay the whole networking service, add a wait loop directly in /etc/network/interfaces on the vmbr0 stanza:

iface vmbr0 inet static
...
pre-up /bin/bash -c 'for i in $(seq 1 30); do [ -d /sys/class/net/eno1np0 ] && break; sleep 1; done; true'

This polls for up to 30 seconds for eno1np0 to appear before the bridge tries to enslave it. The trailing ; true ensures the bridge still comes up (degraded) if the NIC never shows.

Verify after applying either fix, test with a reboot. You should see eno1np0 in the bridge and a clean exit status 0 in /var/log/ifupdown2/.

Most likely a pve-kernel or pve-firmware update changed PCIe/NIC driver probe timing. This is a common class of issue after kernel bumps, NICs that were "fast enough" before suddenly aren't. The systemd device dependency is the robust, proper fix because it doesn't rely on timing assumptions.
 
  • Like
Reactions: unsichtbarre
This is a boot-time race condition. The NIC eno1np0 isn't initialized by the kernel yet when ifupdown2 runs, but it's available moments later, which is why systemctl restart networking fixes everything.

The log confirms this: all NICs on PCIe bus 6 (eno2np1, eno3np0) and bus 112 (ens8191f*) enumerate in the netlink dump, but eno1np0 is completely absent. It's likely on a different PCIe bus/slot that takes longer to probe.

The "random DHCP address" is probably because eno1np0 is your actual uplink to the LAN, the bridge gets its static IP but has no physical path to the gateway without that port, and the NIC may later grab a DHCP lease outside the bridge context.
Check for recent kernel/firmware updates that shifted initialization timing:
# Recent apt activity
grep -E 'pve-kernel|linux-image|firmware|ifupdown' /var/log/apt/history.log
# Or more broadly
cat /var/log/apt/history.log | tail -80

Also check dmesg for how late eno1np0 initializes:

dmesg | grep -i eno1np0
dmesg | grep -i 'pci.*net\|eth\|rename'

Fix the Race Condition, make the networking service wait for the device via a systemd drop-in:

systemctl edit networking.service

Add:
[Unit]
Wants=sys-subsystem-net-devices-eno1np0.device
After=sys-subsystem-net-devices-eno1np0.device

This tells systemd to wait for the kernel to register eno1np0 before starting networking.service. If the NIC appears within the default systemd device timeout (typically 90s), networking proceeds normally. If the NIC is truly dead/missing, the boot will eventually continue without it (same as today, but with a delay).

Alternative, if you'd rather not delay the whole networking service, add a wait loop directly in /etc/network/interfaces on the vmbr0 stanza:

iface vmbr0 inet static
...
pre-up /bin/bash -c 'for i in $(seq 1 30); do [ -d /sys/class/net/eno1np0 ] && break; sleep 1; done; true'

This polls for up to 30 seconds for eno1np0 to appear before the bridge tries to enslave it. The trailing ; true ensures the bridge still comes up (degraded) if the NIC never shows.

Verify after applying either fix, test with a reboot. You should see eno1np0 in the bridge and a clean exit status 0 in /var/log/ifupdown2/.

Most likely a pve-kernel or pve-firmware update changed PCIe/NIC driver probe timing. This is a common class of issue after kernel bumps, NICs that were "fast enough" before suddenly aren't. The systemd device dependency is the robust, proper fix because it doesn't rely on timing assumptions.
Thanks for the detailed response! I believe you are right. I do see a "Timed out waiting for device sys_subsystem-net-device-eno1np0.device - /sys/subsystem/net/devices/eno1np0." in the console during Proxmox boot. I removed eno1np0 in the vmbr definition as it really doesn't need to be there, but removing it hasn't changed the symptom. Also, I just tried the systemd drop you suggest above, and that also doesn't change the symptom.

eno1np0 is the on board netport for the BMC, and I'm using it to remote control and view the output during BIOS/boot. I have a dedicated lan cable attached to it from my router so I can control things when system is off but powered on, so it didn't need to be in the vmbr at all. Also, @Deerom, it seems I lied and this netport is in fact a Broadcom chip. Based on some other posts wrt broadcom chips, I tried adding:
Code:
echo "blacklist bnxt_re" >> /etc/modprobe.d/blacklist-bnxt_re.conf
update-initramfs -u
but no change.

With everyone's help I feel like I'm making progress, but not sure what to try next. Any additional input is appreciated!
 
now I'm baffled. I removed eno1np0 from the Networking GUI and rebooted, but no change. I still see "Timed out waiting for device sys_subsystem-net-device-eno1np0.device - /sys/subsystem/net/devices/eno1np0." in the console during Proxmox boot and no network connection unless I systemctl restart network.
 
Does anyone know where ifupdown2-pre.service could be getting its device info from at startup if not from /etc/network/interfaces (attached) or the "source" directory in interfaces ("/etc/network/interfaces.d/*" which only has an empty "sdn" file in my case?) I removed all references I could find to "eno1np0", but still at boot I see:
Code:
[ TIME ]  Timed out waiting for device sys-subsystem-net-devices-eno1np0.device - /sys/subsystem/net/devices/eno1np0.
[FAILED]  Failed to start ifupdown2-pre.service - Helper to synchronize boot up for ifupdown.
I can't figure out where it's deciding that eno1np0 is a device that it needs to initiate. Attached is my "ip a" result, and eno1np0 isn't there.

A few weeks back I RMAed my board which is I think the cause of this issue. New HW so the onboard network chips changed. It's kind of weird that this symptom of the network failing after boot didn't pop up immediately after board replacement, but it's the only thing that makes sense to me as the likely cause. If anyone has info on how I can flush out references to "eno1np0" to the ifupdown2-pre.service, I'd appreciate it.
Thanks!
 

Attachments