Proxmox shutdown last night, again.

thusband · Jun 12, 2023

I have Proxmox running on an Intel NUC with just one VM for Home Assistant and last night it shut down. It's done this a couple of times before for, what appears to be, no reason. It's been several weeks since the last time and I've never been able to determine the cause. The NUC's power seems to be on but Proxmox has shut down. I have to turn the NUC off and on to get Proxmox going again. Here's the Syslog from the time of shutdown. Can anything be determined from this?

Any hints greatly appreciated.


Jun 11 14:17:01 pve CRON[748312]: pam_unix(cron:session): session closed for user root
Jun 11 14:36:28 pve smartd[605]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 113 to 112
Jun 11 15:06:28 pve systemd[1]: Starting Cleanup of Temporary Directories...
Jun 11 15:06:28 pve systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
Jun 11 15:06:28 pve systemd[1]: Finished Cleanup of Temporary Directories.
Jun 11 15:17:01 pve CRON[757801]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 11 15:17:01 pve CRON[757802]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 11 15:17:01 pve CRON[757801]: pam_unix(cron:session): session closed for user root
Jun 11 16:17:01 pve CRON[767221]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 11 16:17:01 pve CRON[767222]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 11 16:17:01 pve CRON[767221]: pam_unix(cron:session): session closed for user root
Jun 11 17:17:01 pve CRON[776417]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 11 17:17:01 pve CRON[776418]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 11 17:17:01 pve CRON[776417]: pam_unix(cron:session): session closed for user root
Jun 11 17:36:28 pve smartd[605]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 112 to 111
Jun 11 18:17:01 pve CRON[785613]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 11 18:17:01 pve CRON[785614]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 11 18:17:01 pve CRON[785613]: pam_unix(cron:session): session closed for user root
Jun 11 19:17:01 pve CRON[794885]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 11 19:17:01 pve CRON[794886]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 11 19:17:01 pve CRON[794885]: pam_unix(cron:session): session closed for user root
Jun 11 20:17:01 pve CRON[804248]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 11 20:17:01 pve CRON[804249]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 11 20:17:01 pve CRON[804248]: pam_unix(cron:session): session closed for user root
Jun 11 21:17:01 pve CRON[813556]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 11 21:17:01 pve CRON[813557]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 11 21:17:01 pve CRON[813556]: pam_unix(cron:session): session closed for user root
-- Reboot --
Jun 12 04:52:00 pve kernel: Linux version 5.15.107-2-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.107-2 (2023-05-10T09:10Z) ()
Jun 12 04:52:00 pve kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.107-2-pve root=/dev/mapper/pve-root ro quiet

leesteken · Jun 12, 2023

There is nothing in the log to go on, except that is definitively not a normal graceful shutdown.
Maybe Proxmox could not log the error because there was a problem with the drive (controller)? If so, you could try setting up remote logging to another system on your network.
It could be a hardware issue but replacing parts to find out will take very long if it only happens every few weeks. Maybe run a memtest to make sure? Maybe update BIOS to the latest version?
Could it be an external factor like a very short power interruption or main grid voltage drop?

thusband · Jun 12, 2023

leesteken said:
There is nothing in the log to go on, except that is definitively not a normal graceful shutdown.
Maybe Proxmox could not log the error because there was a problem with the drive (controller)? If so, you could try setting up remote logging to another system on your network.
It could be a hardware issue but replacing parts to find out will take very long if it only happens every few weeks. Maybe run a memtest to make sure? Maybe update BIOS to the latest version?
Could it be an external factor like a very short power interruption or main grid voltage drop?

Thanks, that's what I was afraid of. Same as last time when this happened a few months ago. I thought it was a one off thing since it hasn't happened again until last night.

If it was a power issue wouldn't I see that in other devices? Like a few Raspberry Pis I have and even the clock on the stove?

Good suggestions on the memtest and bios. I'm not sure how to do that but will investigate.

Again, thanks a lot.

leesteken · Jun 12, 2023

thusband said:
Thanks, that's what I was afraid of. Same as last time when this happened a few months ago. I thought it was a one off thing since it hasn't happened again until last night.

Maybe temperature plays a parts as summer is approaching (on the Northern hemisphere)? A crash once is a whole might not be seem like a big issue but it can cause silent disk corruption (without ZFS or BTRFS) and that can cause more subtle problems.

thusband said:
If it was a power issue wouldn't I see that in other devices? Like a few Raspberry Pis I have and even the clock on the stove?

Only if the power went off. If it was just a dip (or other poor conditioning) then lower powered devices might not be impacted. Anyway, I assume that you do not use a UPS at the moment, which rules out a faulty UPS but not poor grid power (but you can judge this best for your area).

thusband said:
Good suggestions on the memtest and bios. I'm not sure how to do that but will investigate.

The Arch wiki has some tips on stress testing that might help to find a weak system component. And check on the Intel website for a BIOS update and flash instructions.

Dulcow · Sep 8, 2023

I have the exact same issue on a NUC12WSHi3. Really puzzling!

Sep 07 22:34:25 pve-nuc12-1 pmxcfs[1646]: [dcdb] notice: data verification successful
-- Reboot --
Sep 08 12:08:39 pve-nuc12-1 kernel: Linux version 6.2.16-12-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-12 (2023-09-04T13:21Z) ()

Have you found the root cause by any chance?

Cheers,

D.

thusband · Sep 8, 2023

Unfortunately I haven't. About a month ago someone on Reddit suggested it might be a power management thing and gave me some code to insert in the Proxmox shell,

Code:

systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target

shutdown -r now

I really thought it solved the shutdowns but about a week ago it shut down again. I ran a Memtest86 for 9 passes without any failures so It's not memory. I've started to look around for another device to install Proxmox (maybe a Beelink) as I Don't think I'll ever get to the bottom of the problem on this NUC.

Dulcow · Sep 8, 2023

Hummm, it does look like an electric issue, the PSU shutting down (safety or something).

It is heatwave at the moment in France but still, temperatures are far from the limits (don't mind the silly ones for NVMe, my 990 Pro isn't recognised properly).

jensie · Nov 9, 2023

@thusband were you able to sort this out in the end ?

thusband · Nov 9, 2023

jensie said:
@thusband were you able to sort this out in the end ?

It looks like it's finally resolved. I went with a new USB powered drive. It hasn't shut down in a couple of months now.

jensie · Nov 10, 2023

thusband said:
It looks like it's finally resolved. I went with a new USB powered drive. It hasn't shut down in a couple of months now.

So basically what you are stating is that the errors had to do with your storage. Did you use ZFS ? Did you have LXC and VMs ?

thusband · Nov 10, 2023

Well, I guess that's what I'm saying but I can't specifically pinpoint the HDD. The old drive wasn't USB powered, this one is. I don't use ZFS and only have one VM. No LXC.

Payee6908 · Feb 1, 2024

Im having this problem too.
Have proxmox installed on a NUC.
When multiple VMS are fired up, the thing overheats and turns off.

Are people saying a driver update to the NUC fixes this? Sounds like a proxmox software bug/driver related perhaps?

swotai · Sep 9, 2024

I have a similar problem,
however I don't think my box is shutting down due to temp. The interesting thing is that it always shutdown exactly on Sunday morning, without fail. I've looked at the logs for couple weeks now. It's always just after pam_unix, as if somethin in those cron jobs are triggering a shutdown.

what's worse, the BIOS setting for S0 state don't restart it afterwards.

gfngfn256 · Sep 9, 2024

swotai said:
The interesting thing is that it always shutdown exactly on Sunday morning

Is any specific backup/replication being run around that period?

Note that a ZFS scrub is scheduled (default) to run every second Sunday each month (starts at 00:24). Is your machine shutting down every Sunday morning or when it shuts down; it is always a Sunday.

swotai · Sep 10, 2024

gfngfn256 said:
Is any specific backup/replication being run around that period?

Note that a ZFS scrub is scheduled (default) to run every second Sunday each month (starts at 00:24). Is your machine shutting down every Sunday morning or when it shuts down; it is always a Sunday.

I got it in late July. Every since setup it has always shutdown on Sunday, usually around morning 7:17 or something, and then this last 2 Sundays it's 8:17 or so.

I am looking at hte very first sunday it shuts down, it runs exactly until pam_unix. It's almost as if something is triggering a shutdown at that time. my Beelink N100 makes a beep sound when it shuts down, and even BIOS setting of S0 to restart after power failure don't work.

I tried looking at crontab and systemctl timers, there's nothing in there...

Code:

Jul 28 04:17:01 home CRON[1002895]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 28 04:17:01 home CRON[1002896]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 28 04:17:01 home CRON[1002895]: pam_unix(cron:session): session closed for user root
Jul 28 04:20:52 home smartd[574]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 33 to 32
Jul 28 04:35:52 home systemd[1]: Starting man-db.service - Daily man-db regeneration...
Jul 28 04:35:52 home systemd[1]: man-db.service: Deactivated successfully.
Jul 28 04:35:52 home systemd[1]: Finished man-db.service - Daily man-db regeneration.
Jul 28 05:17:01 home CRON[1012517]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 28 05:17:01 home CRON[1012518]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 28 05:17:01 home CRON[1012517]: pam_unix(cron:session): session closed for user root
Jul 28 05:50:52 home smartd[574]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 32 to 33
Jul 28 06:17:01 home CRON[1022105]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 28 06:17:01 home CRON[1022106]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jul 28 06:17:01 home CRON[1022105]: pam_unix(cron:session): session closed for user root
Jul 28 06:20:52 home smartd[574]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 33 to 32
Jul 28 06:25:01 home CRON[1023386]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jul 28 06:25:01 home CRON[1023387]: (root) CMD (test -x /usr/sbin/anacron || { cd / && run-parts --report /etc/cron.daily; })
Jul 28 06:25:01 home CRON[1023386]: pam_unix(cron:session): session closed for user root
-- Reboot --

Code:

NEXT                        LEFT          LAST                        PASSED        UNIT                         ACTIVATES                     
Tue 2024-09-10 00:00:00 PDT 1h 47min left Mon 2024-09-09 00:00:13 PDT 22h ago       dpkg-db-backup.timer         dpkg-db-backup.service
Tue 2024-09-10 00:00:00 PDT 1h 47min left Mon 2024-09-09 00:00:13 PDT 22h ago       logrotate.timer              logrotate.service
Tue 2024-09-10 02:04:53 PDT 3h 52min left Mon 2024-09-09 01:44:57 PDT 20h ago       pve-daily-update.timer       pve-daily-update.service
Tue 2024-09-10 03:50:21 PDT 5h 38min left Mon 2024-09-09 14:36:57 PDT 7h ago        apt-daily.timer              apt-daily.service
Tue 2024-09-10 06:59:31 PDT 8h left       Mon 2024-09-09 06:42:13 PDT 15h ago       apt-daily-upgrade.timer      apt-daily-upgrade.service
Tue 2024-09-10 08:21:17 PDT 10h left      Mon 2024-09-09 05:23:57 PDT 16h ago       man-db.timer                 man-db.service
Tue 2024-09-10 11:28:57 PDT 13h left      Mon 2024-09-09 11:28:57 PDT 10h ago       systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Sun 2024-09-15 03:10:33 PDT 5 days left   Sun 2024-09-08 03:10:51 PDT 1 day 19h ago e2scrub_all.timer            e2scrub_all.service
Mon 2024-09-16 00:48:54 PDT 6 days left   Mon 2024-09-09 00:40:57 PDT 21h ago       fstrim.timer                 fstrim.service

Code:

# /etc/crontab: system-wide crontab
# Unlike any other crontab you don't have to run the `crontab'
# command to install the new version when you edit this file
# and files in /etc/cron.d. These files also have username fields,
# that none of the other crontabs do.

SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name command to be executed
17 *    * * *   root    cd / && run-parts --report /etc/cron.hourly
25 6    * * *   root    test -x /usr/sbin/anacron || { cd / && run-parts --report /etc/cron.daily; }
47 6    * * 7   root    test -x /usr/sbin/anacron || { cd / && run-parts --report /etc/cron.weekly; }
52 6    1 * *   root    test -x /usr/sbin/anacron || { cd / && run-parts --report /etc/cron.monthly; }

gfngfn256 · Sep 10, 2024

swotai said:
my Beelink N100 makes a beep sound when it shuts down

I assume a graceful shutdown does not include a beep. Almost sure a HW issue. Check the usual suspects on mini pcs; thermals, ram, storage & PSU.

swotai · Wednesday at 06:12

Thermal seems fine, I've talked to Beelink and followed their instruction to clean out the ram slot. Not sure about storage. How should i test? PSU I have to see if i have one that is compatible... but the adapter doesn't even get hot usually.

gfngfn256 · Wednesday at 11:44

Check your backup schedule in /etc/pve/jobs.cfg to see if some backup job coincides with this period. (Some backups may take a long time).

Check replication status with pvesr status (assuming you use replication).

swotai said:
Not sure about storage

What storage is it using? Size/type/brand.

What (if any) USB ports are being used? Power draw/thermals etc.

Use stress-ng to stress-test CPU,RAM, I/O & HDD.

Good luck.

swotai · Sunday at 22:03

Thanks @gfngfn256 for your advice, nothing so far...

What's curious is this. Yesterday (Saturday) out of curiosity i turned the machine off (click power button, shut it down completely), and then turn it back on. My hypothesis is that if' it's thermal or something related to hardware, thus related to how long the machine is up, then the next unexpected shutdown would be next saturday. But guess what, this morning (Sunday) it shut down again at about the same time again:

This is the Saturday shutdown triggered by me.

And then this morning the same behavior.

This has strongly led me to believe that something in config or cron or scheduled task is triggering this.

To answer some questions:
1) backsup schedule: there's nothing in `/etc/pve/jobs.cfg` in fact no such file exists.
2) Replication: I have not set up replication, and there's nothing in `/etc/pve/replication.cfg`, also `pvser status` turns up nothing.
```
root@home:/etc/pve# pvesr status
JobID Enabled Target LastSync NextSync Duration FailCount State
```
3) no USB port in use. just power cable + lan cable.
4) storage: Foresee 512GB SSD that comes with the beelink machine. It might have been the weaker point. However, don't think it's a culprit here because of the above behavior.
4) let me try stress-ng.

swotai · Sunday at 22:09

I don't suppose anything in VM can trigger this right?

Proxmox shutdown last night, again.

Member

Distinguished Member

Member

Distinguished Member

Member

Member

Member

Member

Member

Member

Member

New Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

New Member