Proxmox shutting down for no apparent reason

thusband · Jul 11, 2023

I'm running an Intel NUC with Proxmox 8 and this shutting down has been happening randomly for the past several weeks. Maybe once a week it will just shut down and I'll have to power the NUC down and back on. There doesn't seem to be any indication why. I thought it might be a bad power supply so I bought a new one but that hasn't fixed it. I thought the upgrade to 8.0.3 might help but, again, the shutdowns continue. It seems in Syslog there's always something about temperature before it cuts off but I don't think the temp is out of line.

Is there somewhere or something, other than Syslog, I could look at to trouble shoot? The NUC is a few years old and was a refurb. If I were to buy a new device would I be able to restore everything from the backups I've been making? Without knowing the cause of the shutdown would I just be transferring the problem to the new device?

Any advice would be greatly appreciated.

Code:

Jul 10 10:17:01 pve CRON[1733773]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 10 10:17:01 pve CRON[1733774]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 10 10:17:01 pve CRON[1733773]: pam_unix(cron:session): session closed for user root
Jul 10 11:17:01 pve CRON[1742971]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 10 11:17:01 pve CRON[1742972]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 10 11:17:01 pve CRON[1742971]: pam_unix(cron:session): session closed for user root
Jul 10 11:47:56 pve smartd[555]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 113 to 112
Jul 10 12:17:01 pve CRON[1752187]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 10 12:17:01 pve CRON[1752188]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 10 12:17:01 pve CRON[1752187]: pam_unix(cron:session): session closed for user root
Jul 10 12:17:56 pve smartd[555]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 112 to 113
Jul 10 13:17:01 pve CRON[1761386]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 10 13:17:01 pve CRON[1761387]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 10 13:17:01 pve CRON[1761386]: pam_unix(cron:session): session closed for user root
Jul 10 13:47:56 pve smartd[555]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 113 to 112
Jul 10 14:17:01 pve CRON[1770616]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 10 14:17:01 pve CRON[1770617]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 10 14:17:01 pve CRON[1770616]: pam_unix(cron:session): session closed for user root
Jul 10 15:17:01 pve CRON[1779892]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 10 15:17:01 pve CRON[1779893]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 10 15:17:01 pve CRON[1779892]: pam_unix(cron:session): session closed for user root
Jul 10 15:17:56 pve smartd[555]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 112 to 113
Jul 10 15:33:01 pve pvedaemon[1257876]: <root@pam> successful auth for user 'root@pam'
-- Reboot --
Jul 11 06:00:46 pve kernel: Linux version 6.2.16-3-pve (tom@sbuild) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-3 (2023-06-17T05:58Z) ()

t.lamprecht · Jul 11, 2023

Did you upgrade the BIOS/Firmware of your NUC to the latest version? That could help to resolve a few types of instabillities, especially with newer kernels, from personal experience, and from a few other threads here.

If that doesn't help, or if you're already at the latest FW version, I'd recommend trying to exfiltrate more log info.
As it's likely that there are more errors logged, but that they cannot be synced to disk before the host hangs up completely.
One easy way to try is to connect from another host to the NUC via SSH and run journalctl -f there, sometimes the network stack still works, at least longer than the regular disk sync intervals, so one might manage to see an actual error there.

thusband · Jul 11, 2023

t.lamprecht said:
Did you upgrade the BIOS/Firmware of your NUC to the latest version? That could help to resolve a few types of instabillities, especially with newer kernels, from personal experience, and from a few other threads here.

If that doesn't help, or if you're already at the latest FW version, I'd recommend trying to exfiltrate more log info.
As it's likely that there are more errors logged, but that they cannot be synced to disk before the host hangs up completely.
One easy way to try is to connect from another host to the NUC via SSH and run journalctl -f there, sometimes the network stack still works, at least longer than the regular disk sync intervals, so one might manage to see an actual error there.

Thanks. I've never updated the BIOS on the NUC. That's probably going to involve setting up a keyboard, hooking up an HDMI cable to a monitor and a bunch of other stuff I'm not familiar with so I'll have to start digging into it.

cwt · Jul 11, 2023

Updating Bios on a NUC is fairly simple. Download the BIO file for your NUC from Intel, place it on a FAT32 formatted usb stick. Attach the stick along with a keyboard and hdmi cable to your NUC, reboot it, hit F7 several times during startup and select your usb stick from the list. Confirm the update by hitting enter and the NUC will initiate the update process.

thusband · Jul 11, 2023

cwt said:
Updating Bios on a NUC is fairly simple. Download the BIO file for your NUC from Intel, place it on a FAT32 formatted usb stick. Attach the stick along with a keyboard and hdmi cable to your NUC, reboot it, hit F7 several times during startup and select your usb stick from the list. Confirm the update by hitting enter and the NUC will initiate the update process.

Yes, thanks, I kind of wanted to avoid using the HDMI cable but maybe I'll have to.

thusband · Jul 23, 2023

Well after updating the bios it ran fine until last night and then, again, it shut down leaving no hint as to why. I've been playing around with some VMs adding different Linux distros but that shouldn't have caused the shut down. Again Syslog shows nothing. Isn't there some more detailed logging that would indicate where the problem is? It's so frustrating finding everything shutdown first thing in the morning. Do I replace the NUC and start all over again?

edit: perhaps a detailed log from the Intel NUC? The Proxmox Syslog doesn't show anything. Are there more detailed logs from Proxmox?

Again, any help appreciated.

Olli G. · Sep 12, 2023

Hi, any news on that issue? I have the same problem, running on a minisforum nad9. Suddenly the proxmox is freezing. Last log entry was 03:59. Windows VM was still running and restartet at 05:47 (no idea why) and i had to power off again.

thusband · Sep 12, 2023

Unfortunately I haven't. About a month ago someone on Reddit suggested it might be a power management thing and gave me some code to insert in the Proxmox shell,

Code:

systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target

shutdown -r now

I really thought it solved the shutdowns but about a week ago it shut down again. I ran a Memtest86 for 9 passes without any failures so It's not memory. I've started to look around for another device to install Proxmox (maybe a Beelink) as I don't think I'll ever get to the bottom of the problem on this NUC.

Olli G. · Sep 12, 2023

ya a memtest is currentyl running too over here. i changed 2x16 into 2x32 and two days later the freeze occured the first time.

cwt · Sep 12, 2023

Maybe disabling C-States within the BIOS might be worth a try. I had several systems of different chipsets where it helped to eliminate freezes.

thusband · Sep 12, 2023

cwt said:
Maybe disabling C-States within the BIOS might be worth a try. I had several systems of different chipsets where it helped to eliminate freezes.

I've never heard of C-States but searching for it tells me it could be the problem (I've tried everything else). I'll look into it.

Many thanks!

Olli G. · Sep 20, 2023

Any news? My BIOS of my NAD9 is currently not supporting those kind of setting...

thusband · Sep 21, 2023

Olli G. said:
Any news? My BIOS of my NAD9 is currently not supporting those kind of setting...

I haven't disabled C-states but my NUC hasn't shut down for a while. If it does I'll disable C-states.

I'm wondering if the good clean I did when did the memtest had an effect. I opened it up and blasted some air which loosened a bunch of dust.

Payee6908 · Feb 1, 2024

Im having this problem too.
Have proxmox installed on a NUC.
When multiple VMS are fired up, the thing overheats and turns off.

Are people saying a driver update to the NUC fixes this? Sounds like a proxmox software bug/driver related perhaps?

t.lamprecht · Feb 2, 2024

More like the hardware requiring lots of thermal throttling due to not being designed for this kind of workload.

Some modern power-state driver, or low-power governors from the active driver (ACPI or the Intel p-state in this case) might help.
E.g., install the linux-cpupower package and check the output of cpupower frequency-info for available cpufreq governors and choose one that sounds like it would use less power, like probably powersave, which can then be (for the current boot) set via: cpupower frequency-set -g powersave, then test again if that was enough.

Underclocking the CPU might help too, possibly there are some settings also in the firmware/BIOS for targetting low-power (less heat) over performance, as would ensuring that environmental temperature isn't too high, so that the NUC limited cooling mechanism can still work.

Proxmox shutting down for no apparent reason

thusband

Member

t.lamprecht

Proxmox Staff Member

thusband

Member

cwt

Renowned Member

thusband

Member

thusband

Member

Olli G.

Member

thusband

Member

Olli G.

Member

cwt

Renowned Member

thusband

Member

Olli G.

Member

thusband

Member

Payee6908

New Member

t.lamprecht

Proxmox Staff Member

We value your privacy