Proxmox host crashing — best way to investigate?

May 23, 2025
6
0
1
Netherlands
Hi - my Proxmox installation was running OK. A while back I did some updates which included a kernel update, the system updated to proxmox-kernel-6.8 version 6.8.12-13. A while after (around 2-3 days) I noticed that the system (an Intel NUC) was down — power light on but system unreachable. The system was (is) running Proxmox 8 on Debian 12.

Bash:
# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian

The version of Proxmox:

Bash:
# pveversion
pve-manager/8.4.14/b502d23c55afcba1 (running kernel: 6.8.12-15-pve)

Since then around every 3-4 days the system is crashing requiring a reboot (power cycle).

I am not stating categorically that the cause of the crash was/is the kernel update — I don't know at this time what the cause is (hence this post). Running journalctl is not instructive, all it shows is that the system was running (last action running cron.hourly) then the reboot.

Bash:
Sep 25 01:17:01 pve CRON[728052]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 25 01:17:01 pve CRON[728053]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 25 01:17:01 pve CRON[728052]: pam_unix(cron:session): session closed for user root
-- Boot d90d657f01644ceca39e00bb9efd3909 --
Sep 25 09:41:03 pve kernel: Linux version 6.8.12-13-pve (build@proxmox) (gcc (Debian 12.2.0-14+deb12u1) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-13 (2025-07-22T10:00Z) ()
Sep 25 09:41:03 pve kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-13-pve root=/dev/mapper/pve-root ro quiet
Sep 25 09:41:03 pve kernel: KERNEL supported cpus:

My questions are:
  1. What is the best/recommended way to investigate hard crashes like this?
  2. Is it worthwhile installing any extra tools, I see there is a utility called crash, but I am wary of installing extra (kernel) tools like this.
  3. I had previously installed the lm-sensors package to monitor the CPU temperature, could this be the cause of the problem? Anyone using it (successfully)?
  4. Were/are there any known issues with kernel 6.8.12-13? I see that there was an update to 6.8.12-15 which I have applied.
Finally I see that Proxmox has been recently upgraded to version 9 running on Debian 13 — is it worthwhile upgrading (early) to this new version.

Thanks in advance.
 
Finally I see that Proxmox has been recently upgraded to version 9 running on Debian 13 — is it worthwhile upgrading (early) to this new version.
Upgrades are always worth a try to escape from trouble :) Of course, in your specific case, it reads as if could also be a hardware problem, so nothing's guaranteed.

To debug the problem further, there's several options I can think of:
  • Attach a display to the system's video output, and see what the crashed system had to say on its virtual console (if anything)
  • Configure a serial console and see if that is still alive after the system hangs
  • Configure netconsole and set up a netconsole receiver in your LAN, which might be able to capture whatever the crashing kernel's (assuming it is the actual source of the problem) last whimper is saying. Docs for setting that up are here: https://docs.kernel.org/networking/netconsole.html