Proxmox host shuts down when trying to install Proxmox 8?

avggeek

Member
Jul 12, 2020
11
0
6
Hello,

I've encountered a very strange issue with one of my Proxmox hosts in a lab cluster. The host has the following hardware configuration:

  • CPU: i5-8600T
  • RAM: 64GB
  • Storage: Samsung 980 NVMe (250 GB)
  • Network: Intel I219-LM / Intel 82599ES 10-Gigabit SFP+
This host is part of a 3 node cluster with almost identical hardware. The primary difference being the Storage device - the other hosts use a Samsung OEM NVMe (PM981a) vs the 980. Until this week, the cluster was running PVE7 but I decided I should upgrade to PVE8.

When I was trying to upgrade this specific host to PVE8 through the apt dist-update method, it abruptly shut down when extracting the pve-qemu package. I reinstalled PVE7, reconfigured the node and then tried to upgrade the node again only to hit the same issue.

Thinking it was a hardware issue, I booted into a Debian Live CD and checked whether the drive was healthy using smartctl -a and smartctl -t long. There was no obvious issue with the smartctl output so I next upgraded the firmware of the NVMe drive to the latest available on the Samsung website. Tried again to install PVE8 but had the same issue where the machine abruptly shuts down.

I've now tried various options with no success:

  • Install Debian 12 and then PVE8 over PXE Boot
    • Result: Machine shuts off when installing kernel package
  • Install PVE8 from ISO (Ventoy disk)
    • Result: Machine shuts off when installing pve-firmware package
  • Update: Install PVE8 from PXE Boot
    • Result: Machine shuts off (Could not see which package was being installed)
Currently the only options that succeed are installing PVE7 and running Debian 12 Live CD.

As far as I can tell, the device is not overheating (CPU temp around 50C/ NVMe around 50C) and indeed it runs PVE7 perfectly fine.

I have to admit I'm completely baffled by this issue. Does anyone have any ideas how I should proceed to try & get PVE8 installed?

Further Updates:

  • I've updated the BIOS to the latest version available on the Lenovo site. Does not seem to have helped.
  • Using the Lenovo Diagnostic tools, I ran an extended self test of the NVMe drive and the drive passed the test.
 
Last edited:
I tried installing the opt-in 6.2 kernel for PVE7 on this host. After installation, when I tried to reboot the host it crashed shortly after the GRUB screen.

I could not find anything in kern.log or syslog so eventually rolled back to the 5.15 kernel.

Still looking for help on this issue.
 
I tried installing the opt-in 6.2 kernel for PVE7 on this host. After installation, when I tried to reboot the host it crashed shortly after the GRUB screen.
Good idea for testing. If you don't want to crash your Proxmox each time, try the latest Ubuntu Live installer (which should have a 6.2 kernel) and see if it boots (but don't install it).
There is probably an incompatibility between the newer kernel and your hardware. Try updating the motherboard BIOS or search this forum (and the internet in general) for known issues with your motherboard or CPU.
 
Good idea for testing. If you don't want to crash your Proxmox each time, try the latest Ubuntu Live installer (which should have a 6.2 kernel) and see if it boots (but don't install it).
There is probably an incompatibility between the newer kernel and your hardware. Try updating the motherboard BIOS or search this forum (and the internet in general) for known issues with your motherboard or CPU.

I did try Debian Bookworm Live which runs Kernel 6.1 and it booted successfully.

I have been searching for any reports of incompatibilities for the CPU or the NVMe drive but haven't had much luck. What's even more confusing is that a 2nd node with the same CPU/Motherboard (but different NVMe) upgraded to PVE8 without any issues.
 
I did try Debian Bookworm Live which runs Kernel 6.1 and it booted successfully.
Unfortunately, Proxmox uses Ubuntu kernels.
I have been searching for any reports of incompatibilities for the CPU or the NVMe drive but haven't had much luck. What's even more confusing is that a 2nd node with the same CPU/Motherboard (but different NVMe) upgraded to PVE8 without any issues.
Yes, that is strange (and important information). Do both motherboard have the same BIOS version and settings? Since Proxmox runs fine on the other (identical) system, I guess it rules out Proxmox and is probably a hardware problem...
 
So I did some more debugging by setting the GRUB config to debug ignore_loglevel nomodeset and tried to boot the host with the opt-in 6.2 kernel on PVE7. By taking a recording of the initial boot screens, I figured out that there was some data being written to the journalctl log and I managed to get that boot log.

Here's a pastebin dump of the boot log with the 6.2 kernel: https://pastebin.com/Sh8v6dcf

TBH I find this log really puzzling. The boot messages are identical between the 6.2 kernel and the 5.15 kernel right upto this line:

Code:
Fri 2023-10-27 15:54:56 +08 proxcore02 systemd-sysusers[278]: Group sasl already exists.
.

At that point, with the 6.2 kernel I see the following message:

Code:
Fri 2023-10-27 15:54:56 +08 proxcore02 systemd-journald[265]: Time spent on flushing to /var/log/journal/75c1aa1cb6554b8b99a0397fdf52361c is 12.105ms for 1216 entries.

At which point the log stops.

With the 5.15 kernel, the log file continues even after the Group sasl already exists. message (As expected).
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!