Network won't start on boot

danb35

Renowned Member
Oct 31, 2015
84
6
73
I have a three-node PVE 6.3 cluster running on three nearly-identical (identical except for RAM--two nodes have 48 GB; the third has 96 GB) blades of a Dell PowerEdge C6100, each with 2x Xeon X5650 CPUs, and each with a Chelsio T420-CR 2x 10Gbit NIC. Each pretty consistently fails to bring up either interface on boot. systemctl restart networking fails with an error; ifup vmbr0 reports that another instance of ifup is already running. But I noticed the last time this happened (earlier today), that there were lots (i.e., over 50) of systemd-udevd processes running. That struck me as abnormal, so I ran systemctl restart systemd-udevd. This succeeded, though it took over a minute to do so. Once it did, I ran systemctl restart networking, and it succeeded in under a second, both interfaces came up, and the system was online.

/etc/network/interfaces is identical among the three systems except for IP addresses:
Code:
root@pve1:~# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface enp3s0f4 inet manual

auto enp3s0f4d1
iface enp3s0f4d1 inet static
    address 192.168.5.101/24

iface eno1 inet manual

iface eno2 inet manual

auto vmbr0
iface vmbr0 inet static
    address 192.168.1.3/24
    gateway 192.168.1.1
    bridge-ports enp3s0f4
    bridge-stp off
    bridge-fd 0

dmesg reports missing firmware--Proxmox for some reason have removed this from their packages (it's in the original Debian packages), but it doesn't seem to prevent the interface from coming up manually:
Code:
root@pve1:~# dmesg | grep cxgb
[    2.869044] cxgb4 0000:03:00.4: Direct firmware load for cxgb4/t4fw.bin failed with error -2
[    2.869050] cxgb4 0000:03:00.4: unable to load firmware image cxgb4/t4fw.bin, error -2
[    2.869377] cxgb4 0000:03:00.4: Coming up as MASTER: Initializing adapter
[    3.588928] cxgb4 0000:03:00.4: Direct firmware load for cxgb4/t4-config.txt failed with error -2
[    3.600882] cxgb4 0000:03:00.4: Hash filter with ofld is not supported by FW
[    4.352902] cxgb4 0000:03:00.4: Successfully configured using Firmware Configuration File "Firmware Default", version 0x0, computed checksum 0x0
[    4.560904] cxgb4 0000:03:00.4: max_ordird_qp 255 max_ird_adapter 589824
[    4.608903] cxgb4 0000:03:00.4: Failed to read filter mode/mask via fw api, using indirect-reg-read
[    4.707555] cxgb4 0000:03:00.4: 98 MSI-X vectors allocated, nic 16 per uld 16
[    4.707565] cxgb4 0000:03:00.4: 32.000 Gb/s available PCIe bandwidth (5 GT/s x8 link)
[    4.742324] cxgb4 0000:03:00.4 eth0: eth0: Chelsio T420-CR (0000:03:00.4) 1G/10GBASE-SFP
[    4.742685] cxgb4 0000:03:00.4 eth1: eth1: Chelsio T420-CR (0000:03:00.4) 1G/10GBASE-SFP
[    4.742788] cxgb4 0000:03:00.4: Chelsio T420-CR rev 2
[    4.742790] cxgb4 0000:03:00.4: S/N: PT36121180, P/N: 110112040F0
[    4.742793] cxgb4 0000:03:00.4: Firmware version: 1.16.63.0
[    4.742795] cxgb4 0000:03:00.4: Bootstrap version: 255.255.255.255
[    4.742797] cxgb4 0000:03:00.4: TP Microcode version: 0.1.9.4
[    4.742798] cxgb4 0000:03:00.4: No Expansion ROM loaded
[    4.742800] cxgb4 0000:03:00.4: Serial Configuration version: 0x4271203
[    4.742802] cxgb4 0000:03:00.4: VPD version: 0x1
[    4.742804] cxgb4 0000:03:00.4: Configuration: RNIC MSI-X, Offload capable
[    4.744883] cxgb4 0000:03:00.4 enp3s0f4d1: renamed from eth1
[    4.765240] cxgb4 0000:03:00.4 enp3s0f4: renamed from eth0
[  784.532945] cxgb4 0000:03:00.4 enp3s0f4d1: SR module inserted
[  784.736967] cxgb4 0000:03:00.4 enp3s0f4: SR module inserted
[  785.108039] cxgb4 0000:03:00.4: Interface enp3s0f4d1 is running DCBx-IEEE
[  785.108068] cxgb4 0000:03:00.4 enp3s0f4d1: link up, 10Gbps, full-duplex, Tx/Rx PAUSE
[  785.307924] cxgb4 0000:03:00.4: Interface enp3s0f4 is running DCBx-IEEE
[  785.307938] cxgb4 0000:03:00.4 enp3s0f4: link up, 10Gbps, full-duplex, Tx/Rx PAUSE
[  787.206943] cxgb4 0000:03:00.4: Port 0 link down, reason: Link Down
[  787.206967] cxgb4 0000:03:00.4 enp3s0f4: link down
[  787.806610] cxgb4 0000:03:00.4: Interface enp3s0f4 is running DCBx-IEEE
[  787.806634] cxgb4 0000:03:00.4 enp3s0f4: link up, 10Gbps, full-duplex, Tx/Rx PAUSE
[  787.906568] cxgb4 0000:03:00.4: Port 0 link down, reason: Link Down
[  787.906586] cxgb4 0000:03:00.4 enp3s0f4: link down
[  788.406302] cxgb4 0000:03:00.4: Interface enp3s0f4 is running DCBx-IEEE
[  788.406325] cxgb4 0000:03:00.4 enp3s0f4: link up, 10Gbps, full-duplex, Tx/Rx PAUSE
 
  • Like
Reactions: Type1J
Well, after having mostly gone away for a few months, this problem is back. I've upgraded my cluster to 7.0, restarted each node more than once, and the network has come up successfully. Until today.

But unlike what I posted above, systemctl restart systemd-udevd followed by systemctl restart networking does not fix the problem. At boot, there were (as described above) many systemd-udevd processes running, and running systemctl restart systemd-udevd did address that problem. However, systemctl restart networking did not complete in a matter of seconds, nor did it result in a working network connection--rather, it took two minutes to complete (timed, and a repeatable length of time), and reported that "a dependency job for networking.service failed. See 'journalctl -xe' for details."

In the roughly 200 lines of output from that command after the timestamp of the restart command, I do see a number of errors, but I'd expect them to result from the lack of networking, not to be the cause of it. I'm seeing:
  • Two NFS storage systems are offline
  • pvesr (the replication service?) can't start
    • Each of these is repeated several times over the course of two minutes, at which point ifupdown2-pre.service is noted to have failed
I see nothing else in those two minutes, and all of those messages repeat constantly before and after that two-minute window.

Where else should I be looking? This is a persistent problem, and it makes me very reluctant to restart my nodes as should be done with updates. Surely there's something else that would point to the problem--what is it?
 
could you provide the output of pveversion -v and the full journal from boot until the boot is completed (indicating the time when your "intervention" took place) please?
 
Once I figure out how to get it off the machine with no network access, sure--I should be able to use SneakerNet with a USB stick.

For the "full journal from boot", would that just be the output of journalctl? I expect it will be quite large.
 
yes, journalctl -b
 
Thanks. Here's the output of pveversion -v:
Code:
proxmox-ve: 7.0-2 (running kernel: 5.11.22-3-pve)
pve-manager: 7.0-10 (running version: 7.0-10/d2f465d3)
pve-kernel-5.11: 7.0-6
pve-kernel-helper: 7.0-6
pve-kernel-5.4: 6.4-4
pve-kernel-5.11.22-3-pve: 5.11.22-6
pve-kernel-5.11.22-2-pve: 5.11.22-4
pve-kernel-5.11.22-1-pve: 5.11.22-2
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-4.15: 5.4-16
pve-kernel-4.15.18-27-pve: 4.15.18-55
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 16.2.5-pve1
ceph-fuse: 16.2.5-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.2.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-5
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-9
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.7-1
proxmox-backup-file-restore: 2.0.7-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-8
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.2-4
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-11
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

After boot, I ran date && time systemctl restart systemd-udevd && time systemctl restart networking. The console output appears below, and the subsequent output of journalctl -b is attached. As shown on the screen shot below, I ran that command at 10:37:53, and the two commands combined ran for 3 minutes, 30 seconds.
8DDE9165-A449-4694-A612-16F4521BBE1D_1_105_c.jpeg
 

Attachments

  • journalctl.txt
    163.8 KB · Views: 10
Code:
Aug 03 10:35:37 pve3 systemd-udevd[1157]: IPI0001:00: Worker [1184] processing SEQNUM=2826 is taking a long time
Aug 03 10:35:37 pve3 systemd-udevd[1157]: dmi-ipmi-si.0: Worker [1227] processing SEQNUM=3102 is taking a long time
Aug 03 10:35:37 pve3 systemd-udevd[1157]: IPI0001:00: Worker [1224] processing SEQNUM=2982 is taking a long time
Aug 03 10:36:37 pve3 systemd[1]: ifupdown2-pre.service: Main process exited, code=exited, status=1/FAILURE

so ifupdown2 and thus networking fails because udev is blocked on something (IPMI related?). those udev workers seem to be blocked for long:

Code:
Aug 03 10:38:32 pve3 kernel: INFO: task systemd-udevd:1184 blocked for more than 120 seconds.
Aug 03 10:38:32 pve3 kernel:       Tainted: P          IO      5.11.22-3-pve #1
Aug 03 10:38:32 pve3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 03 10:38:32 pve3 kernel: task:systemd-udevd   state:D stack:    0 pid: 1184 ppid:  1157 flags:0x00004004
Aug 03 10:38:32 pve3 kernel: Call Trace:
Aug 03 10:38:32 pve3 kernel:  __schedule+0x2ca/0x880
Aug 03 10:38:32 pve3 kernel:  schedule+0x4f/0xc0
Aug 03 10:38:32 pve3 kernel:  __get_guid+0xf8/0x130 [ipmi_msghandler]
Aug 03 10:38:32 pve3 kernel:  ? wait_woken+0x80/0x80
Aug 03 10:38:32 pve3 kernel:  __bmc_get_device_id+0xe2/0xa40 [ipmi_msghandler]
Aug 03 10:38:32 pve3 kernel:  ipmi_add_smi+0x3e3/0x590 [ipmi_msghandler]
Aug 03 10:38:32 pve3 kernel:  try_smi_init+0x5d3/0x6a0 [ipmi_si]
Aug 03 10:38:32 pve3 kernel:  init_ipmi_si+0xd2/0x162 [ipmi_si]
Aug 03 10:38:32 pve3 kernel:  ? 0xffffffffc0e6d000
Aug 03 10:38:32 pve3 kernel:  do_one_initcall+0x48/0x1d0
Aug 03 10:38:32 pve3 kernel:  ? kmem_cache_alloc_trace+0xf6/0x200
Aug 03 10:38:32 pve3 kernel:  ? do_init_module+0x28/0x290
Aug 03 10:38:32 pve3 kernel:  do_init_module+0x62/0x290
Aug 03 10:38:32 pve3 kernel:  load_module+0x25ca/0x2820
Aug 03 10:38:32 pve3 kernel:  ? security_kernel_post_read_file+0x5c/0x70
Aug 03 10:38:32 pve3 kernel:  __do_sys_finit_module+0xc2/0x120
Aug 03 10:38:32 pve3 kernel:  __x64_sys_finit_module+0x1a/0x20
Aug 03 10:38:32 pve3 kernel:  do_syscall_64+0x38/0x90
Aug 03 10:38:32 pve3 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 03 10:38:32 pve3 kernel: RIP: 0033:0x7f5b6b7899b9
Aug 03 10:38:32 pve3 kernel: RSP: 002b:00007fff84febe58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
Aug 03 10:38:32 pve3 kernel: RAX: ffffffffffffffda RBX: 0000555c5d7e8270 RCX: 00007f5b6b7899b9
Aug 03 10:38:32 pve3 kernel: RDX: 0000000000000000 RSI: 00007f5b6b914e2d RDI: 0000000000000010
Aug 03 10:38:32 pve3 kernel: RBP: 0000000000020000 R08: 0000000000000000 R09: 0000555c5d7d3130
Aug 03 10:38:32 pve3 kernel: R10: 0000000000000010 R11: 0000000000000246 R12: 00007f5b6b914e2d
Aug 03 10:38:32 pve3 kernel: R13: 0000000000000000 R14: 0000555c5d812f80 R15: 0000555c5d7e8270

I suggest disabling the ipmi-related modules ("ipmi_si") and updating the initramfs afterwards
 
  • Like
Reactions: Type1J and danb35
I suggest disabling the ipmi-related modules ("ipmi_si") and updating the initramfs afterwards
That appears to have resolved the issue. Following https://linuxconfig.org/how-to-blacklist-a-module-on-ubuntu-debian-linux/, I created /etc/modprobe.d/blacklist.conf and added "blacklist ipmi_si" there. Then ran update-initramfs -u and rebooted. The system booted more quickly than it has recently, and was on the network immediately. A couple of remaining questions:

1. There are three other IPMI-related modules loaded on the system:
Code:
root@pve3:~# lsmod | grep ipmi
ipmi_ssif              36864  0
ipmi_devintf           20480  0
ipmi_msghandler       110592  2 ipmi_devintf,ipmi_ssif
Should I blacklist them as well? I didn't do anything with them this time around since they didn't appear to be causing the problem, based on what you saw (and now based on what actually happened in the reboot).

2. Is there anything else I should be doing here?

Thanks again for the help, looks like this might finally be corrected.
 
  • Like
Reactions: Type1J
if they don't cause any apparent issues I wouldn't blacklist them. if you don't need IPMI access from within the host, then there is nothing further to be done.
 
  • Like
Reactions: IPP and danb35
Great, thanks. I'm a little reluctant to consider this "solved", as there was a period of months in which it worked fine, but this looks like an identified problem and solution.
 
likely some issue with the module and your systems firmware - a kernel update might have introduced an incompatibility, and the actual bug could be on either end (kernel or IPMI firmware).
 
  • Like
Reactions: danb35
20220220_200509.jpg
I get this measage before install and after booting (once installed). I have to `systemctl restart network` to make networking work.
 
please open a new thread with the full boot journal and exact kernel version you are booting..
 
View attachment 34437
I get this measage before install and after booting (once installed). I have to `systemctl restart network` to make networking work.
Hi @Type1J i had this problem too with my Minisforum HM80 (Ryzen 7 4800U), fresh installed.
I got it solved by entering the CLI physically, restart the network.
Code:
systemctl restart networking
Then i updated through the WebUI and there was also a Kernel update.
5.13.19-4 -> 5.13.19-9 (7.1-5 -> 7.1-7)
Hope this helps, if not best you really open a new thread.
 
Tody on a:
proxmox-backup-server 2.4.3-1 running version: 2.4.3
(But I had already successfully carried out this workaround on Proxmox 7.x)

I have the same error. after one reboot, the network interface was down.
for me the solution was, on the local machine (not remote ;-)
At first, be sure ifupdown2 is installed.
I purge every older packages ifupdown (without the "2").
And reinstall the new one with the "2":
Code:
apt install --reinstall ifupdown2
This reinstall create the simlinks to the autostart new

@spirit:
after this, my
Code:
systemctl status ifupdown2-pre.service

is not mask, it is active. for my system ifupdown2-pre.service was not the problem ...

regards,
maxprox
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!