[SOLVED] Host network not starting and host failing to reboot (PVE 7.1-7)

milsav92

New Member
Dec 3, 2021
7
4
3
31
Hi,

Since kernel 5.13.19-2-pve one of the pve hosts hangs on boot for 2 minutes while starting ifupdown2-pre.service and systemd-udev-settle.service .
When booted system has no networking and has to be started with systemctl start networking.service .
Also when issuing a restart command system fails to reboot and has to be restarted with magic SysRq combination.

The system uses the default lvm setup on a nvme drive, no zfs storage and a ssd which is passed to a vm running on the host.

Memtest has been run and found no issues.

If started using the 5.13.19-1-pve kernel there are no issues.

There were no hardware or software changes between 5.13.19-1-pve and 5.13.19-2-pve installation.
 

Attachments

  • 5.13.19-1-basic-info.txt
    1.4 KB · Views: 11
  • 5.13.19-1-boot-ifupdown.txt
    3 KB · Views: 5
  • 5.13.19-1-boot-elapsed-time.txt
    390 bytes · Views: 2
  • 5.13.19-2-basic-info.txt
    1.4 KB · Views: 2
  • 5.13.19-2-boot-ifupdown.txt
    4.6 KB · Views: 4
  • 5.13.19-2-boot-elapsed-time-and-manual-network-start.txt
    5.4 KB · Views: 1
  • 5.13.19-2-reboot.txt
    1.7 KB · Views: 5
Last edited:
Exactly the same has happened to me, and I mean exactly, to my home pve. I've not had time to get to the bottom of it yet but here's my suspicions and possible steps forward (when I get time). Let me know if you've had any progress.
My hardware is a pn50 mini pc, rhyzen 7 CPU, 32G ram, 500G M2 nvme drive and 500G external USB ssd drive. More importantly, as well as the onboard ethernet nic, it has an internal WiFi nic (which I've never used anyway). My gut is screaming your WiFi nic is your problem.
Two reasons really.
One is because when I look at a similar setup I have at work (on which I haven't and currently have no intention of doing this same update) the update consists of a Debian networking update and a few pve updates. Like you once I've started networking manually proxmox, as far as I can tell, works just fine. So to me that indicates the networking update is probably the root of my problem.
Second, when I manually start networking, although the ethernet nic comes up the WiFi nic remains down. My question to you here is do you have more than one nic with differing chipsets?
I've found similar situations from the past and one had a solution and explanation that rings true here. If I remember correctly, the story goes that at startup there's an if-pre-up sequence that checks interfaces and either a rogue driver or rogue hardware fails to respond in a timely manner (our long startup time?) and as a result this pre-up sequence fails and networking is not started. Manually starting networking doesn't involve this pre-up. Their solution was to 'mask' this pre-up so that it doesn't take place (can't remember the command sequence though).
I'll try that as a last resort as it seems to me it's a bit like removing all your car's warning lights so that you can turn that ignition key. I'm most inclined to just open the box and see if I can remove the WiFi nic.
When I get time to try taking this further, I'll update.
 
When booted system has no networking
I suspect you have encountered the same bug / limitation as me. Proxmox v7.1 behaves very badly if started without external network services (in my case a Proxmox VM runs the router providing DHCP, DNS, route to Internet). For some reason adding a manual DHCP entry for Proxmox in the VM router masks the limitation, a work around not required in Proxmox v7.0, See https://forum.proxmox.com/threads/h...-vm-with-pass-through-nic.100091/#post-435373
 
Last edited:
Exactly the same has happened to me, and I mean exactly, to my home pve. I've not had time to get to the bottom of it yet but here's my suspicions and possible steps forward (when I get time). Let me know if you've had any progress.
My hardware is a pn50 mini pc, rhyzen 7 CPU, 32G ram, 500G M2 nvme drive and 500G external USB ssd drive. More importantly, as well as the onboard ethernet nic, it has an internal WiFi nic (which I've never used anyway). My gut is screaming your WiFi nic is your problem.
Two reasons really.
One is because when I look at a similar setup I have at work (on which I haven't and currently have no intention of doing this same update) the update consists of a Debian networking update and a few pve updates. Like you once I've started networking manually proxmox, as far as I can tell, works just fine. So to me that indicates the networking update is probably the root of my problem.
Second, when I manually start networking, although the ethernet nic comes up the WiFi nic remains down. My question to you here is do you have more than one nic with differing chipsets?
I've found similar situations from the past and one had a solution and explanation that rings true here. If I remember correctly, the story goes that at startup there's an if-pre-up sequence that checks interfaces and either a rogue driver or rogue hardware fails to respond in a timely manner (our long startup time?) and as a result this pre-up sequence fails and networking is not started. Manually starting networking doesn't involve this pre-up. Their solution was to 'mask' this pre-up so that it doesn't take place (can't remember the command sequence though).
I'll try that as a last resort as it seems to me it's a bit like removing all your car's warning lights so that you can turn that ignition key. I'm most inclined to just open the box and see if I can remove the WiFi nic.
When I get time to try taking this further, I'll update.
Yeah my hardware is pretty similar to yours (down to the unused WiFi nic) so I suspect that it is related to the AMD gpu (see the following thread https://forum.proxmox.com/threads/proxmox-wont-boot-with-latest-kernel-5-13-19-2-pve.100825/ ). One Intel nuc has no problem, masking the service doesn't solve the problem and failing to poweroff/reboot points me to the AMD gpu kernel problem. When I have the time I'll test the 5.15 kernel and see if it solves the problem.

I suspect you have encountered the same bug / limitation as me. Proxmox v7.1 behaves very badly if started without external network services (in my case a Proxmox VM runs the router providing DHCP, DNS, route to Internet). For some reason adding a manual DHCP entry for Proxmox in the VM router masks the limitation, a work around not required in Proxmox v7.0, See https://forum.proxmox.com/threads/h...-vm-with-pass-through-nic.100091/#post-435373
Thanks, I'll check it out but I suspect that this is more of an AMD gpu kernel problem.
 
Last edited:
I've tried the 5.15 kernel and can confirm that it solves the problems. However 5.15.5 has some problems related to drive cache https://bugzilla.kernel.org/show_bug.cgi?id=215137 so I'll wait for at least 5.15.6.

As I see it there are 3 options to solve these problems:
  1. Use kernel 5.15 with
    Bash:
    apt update && apt install pve-kernel-5.15
  2. Use grub to set to boot to a previous version (this is not stable and can change on a new kernel install or your boot order can be different)
    Bash:
    grub-reboot "1>2"
    and then once/if a fixed kernel is released use
    Bash:
    grub-editenv /boot/grub/grubenv unset next_entry
    to reset grub to use the default kernel version.
  3. Install the working 5.13 kernel and remove the problematic kernel
    Bash:
    apt install pve-kernel-5.13=7.1-4
    apt remove pve-headers-5.13.19-2-pve pve-kernel-5.13.19-2-pve
    apt autoremove
    hold the updates until a fixed version is released
    Bash:
    apt-mark hold 'pve-kernel-5.13'
    and once/if a fixed version is released release the hold and upgrade
    Bash:
    apt-mark unhold 'pve-kernel-5.13'
I've gone with the option number 3.
 
You're absolutely right. Mine too stalls at the 'amdgpu' message, I don't see the connection between gpu's and networking though, maybe there's something else going on after - I'm a noob when it comes to linux. I did try blacklisting iwlwifi and disabling the wifi nic in bios to no avail. I stopped short of removing the nic as there were too many teeny tiny screws and nuts. After reading that linked post I too have followed in your steps and gone option 3 and it starts/reboots as sweet as a nut.
As I've already said, I'm a linux noob and didn't even know the apt-mark command existed. I presume the gist is any 'apt upgrade' doesn't happen for the kernel. So sorry, one more question. Should I hold out on all upgrades until the next kernel is released or is it safe(ish) to continue upgrading any of other updates?
Btw, thanks for your efforts.
 
You're absolutely right. Mine too stalls at the 'amdgpu' message, I don't see the connection between gpu's and networking though, maybe there's something else going on after - I'm a noob when it comes to linux. I did try blacklisting iwlwifi and disabling the wifi nic in bios to no avail. I stopped short of removing the nic as there were too many teeny tiny screws and nuts. After reading that linked post I too have followed in your steps and gone option 3 and it starts/reboots as sweet as a nut.
As I've already said, I'm a linux noob and didn't even know the apt-mark command existed. I presume the gist is any 'apt upgrade' doesn't happen for the kernel. So sorry, one more question. Should I hold out on all upgrades until the next kernel is released or is it safe(ish) to continue upgrading any of other updates?
Btw, thanks for your efforts.
There shouldn't be any problems upgrading other software as long as it doesn't depend on some new kernel capability/abi/api (which generally shouldn't change for the 5.13 branch). Personally I've never had any problems on debian. The only problem could be related to new kernel fixing potential vulnerabilities and bugs (which don't really matter if the problems in this thread are not fixed in a new kernel). Proxmox 7.1 should run fine as it was released with the 5.13 kernel, 7.2 will most likely ship with 5.15 as per https://forum.proxmox.com/threads/opt-in-linux-kernel-5-15-for-proxmox-ve-7-x-available.100936/
 
With the latest kernel update pve-kernel-5.13/stable 7.1-7 there are no more boot, network or shutdown/reboot issues with the amdgpu (in my case) so as far as I'm concerned this issue is solved on my end.
 
I can also say the latest kernel update has cleared any reboot issues I had. Following in milsav92's footsteps I had my kernel held at
Linux 5.13.19-1-pve #1 SMP PVE 5.13.19-3 (Tue, 23 Nov 2021 13:31:19 +010
Today (20/2/22) I released that hold and updated the system. Kernel now reports version
Linux 5.13.19-4-pve #1 SMP PVE 5.13.19-9 (Mon, 07 Feb 2022 11:01:14 +0100)
and reboots with no problems.
Thanks again milsav. :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!