Occasional node reboots - What to look for?

markfree · Aug 7, 2024

I'm seeing some occasional node reboots on my Proxmox mini pc, but I can't figure out why.
Out of the blue, I get a notification that my node and all of its VMs have been restarted.

When I checked the node's system log, I found the reboot message. However, there's no obvious message about an error that caused the reboot.
This is a snippet of the reported log at the time of reboot.

Bash:

Aug 07 14:41:05 mini smartd[1075]: Device: /dev/sda [SAT], is back in ACTIVE or IDLE mode, resuming checks (9 checks skipped)
Aug 07 14:41:05 mini smartd[1075]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 95 to 94
Aug 07 15:11:10 mini smartd[1075]: Device: /dev/sda [SAT], is in STANDBY mode, suspending checks
Aug 07 15:17:01 mini CRON[3311424]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 07 15:17:01 mini CRON[3311425]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Aug 07 15:17:01 mini CRON[3311424]: pam_unix(cron:session): session closed for user root
Aug 07 15:18:06 mini pvedaemon[3255904]: worker exit
Aug 07 15:18:06 mini pvedaemon[1471]: worker 3255904 finished
Aug 07 15:18:06 mini pvedaemon[1471]: starting 1 worker(s)
Aug 07 15:18:06 mini pvedaemon[1471]: worker 3312253 started
Aug 07 15:20:57 mini pveproxy[3242387]: worker exit
Aug 07 15:20:57 mini pveproxy[1494]: worker 3242387 finished
Aug 07 15:20:57 mini pveproxy[1494]: starting 1 worker(s)
Aug 07 15:20:57 mini pveproxy[1494]: worker 3314204 started
-- Reboot --
Aug 07 15:23:22 mini kernel: Linux version 6.8.8-4-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.8-4 (2024-07-26T11:15Z) ()
Aug 07 15:23:22 mini kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.8-4-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on
Aug 07 15:23:22 mini kernel: KERNEL supported cpus:
Aug 07 15:23:22 mini kernel:   Intel GenuineIntel
Aug 07 15:23:22 mini kernel:   AMD AuthenticAMD
Aug 07 15:23:22 mini kernel:   Hygon HygonGenuine
Aug 07 15:23:22 mini kernel:   Centaur CentaurHauls
Aug 07 15:23:22 mini kernel:   zhaoxin   Shanghai 
Aug 07 15:23:22 mini kernel: BIOS-provided physical RAM map:

Do you guys have any idea of what might be causing such a reboot, or where to look at?
Thank you.

esi_y · Aug 7, 2024

markfree said:
I'm seeing some occasional node reboots on my Proxmox mini pc, but I can't figure out why.

It is sometimes helpful to mention the hardware in question (even put it in the title) as someone might have had the same and knows the answer right away in otherwise "mysterious" case.

markfree said:

Bash:

Aug 07 14:41:05 mini smartd[1075]: Device: /dev/sda [SAT], is back in ACTIVE or IDLE mode, resuming checks (9 checks skipped)
Aug 07 14:41:05 mini smartd[1075]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 95 to 94
Aug 07 15:11:10 mini smartd[1075]: Device: /dev/sda [SAT], is in STANDBY mode, suspending checks
Aug 07 15:17:01 mini CRON[3311424]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 07 15:17:01 mini CRON[3311425]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Aug 07 15:17:01 mini CRON[3311424]: pam_unix(cron:session): session closed for user root
Aug 07 15:18:06 mini pvedaemon[3255904]: worker exit
Aug 07 15:18:06 mini pvedaemon[1471]: worker 3255904 finished
Aug 07 15:18:06 mini pvedaemon[1471]: starting 1 worker(s)
Aug 07 15:18:06 mini pvedaemon[1471]: worker 3312253 started
Aug 07 15:20:57 mini pveproxy[3242387]: worker exit
Aug 07 15:20:57 mini pveproxy[1494]: worker 3242387 finished
Aug 07 15:20:57 mini pveproxy[1494]: starting 1 worker(s)
Aug 07 15:20:57 mini pveproxy[1494]: worker 3314204 started
-- Reboot --
Aug 07 15:23:22 mini kernel: Linux version 6.8.8-4-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.8-4 (2024-07-26T11:15Z) ()
Aug 07 15:23:22 mini kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.8-4-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on
Aug 07 15:23:22 mini kernel: KERNEL supported cpus:
Aug 07 15:23:22 mini kernel:   Intel GenuineIntel
Aug 07 15:23:22 mini kernel:   AMD AuthenticAMD
Aug 07 15:23:22 mini kernel:   Hygon HygonGenuine
Aug 07 15:23:22 mini kernel:   Centaur CentaurHauls
Aug 07 15:23:22 mini kernel:   zhaoxin   Shanghai
Aug 07 15:23:22 mini kernel: BIOS-provided physical RAM map:

Can you share full journalctl output (with --since --until for some of these eventful periods)?

Can you confirm that e.g. it's not a faulty PSU (e.g. run a LIVE system off USB for a while)?

markfree · Aug 8, 2024

I see... I appreciate your feedback.

CPU - AMD Ryzen 9 7940HS
32GB of RAM
Kernel - Linux 6.8.8-4-pve
pve-manager - 8.2.4

I could not find any anomalies in resource usage. This is the daily average.

I'm not sure how much log you want me to share, so I attached the journal from today up to an hour after the reboot.

Maybe the pc power... Not sure.
I'm not ruling out a power issue, but there were no power outages today and the pc is behind a working UPS.

esi_y · Aug 8, 2024

markfree said:
CPU - AMD Ryzen 9 7940HS

32GB of RAM

Kernel - Linux 6.8.8-4-pve

pve-manager - 8.2.4

markfree said:
I'm not sure how much log you want me to share, so I attached the journal from today up to an hour after the reboot.

Oh, I thought they were happening more often (that more of them would be captured in a shorter period). Can you share one full boot at the end of which it happened (e.g. journal -b -1)?

markfree said:
Maybe the pc power... Not sure.
I'm not ruling out a power issue, but there were no power outages today and the pc is behind a working UPS.

If it was the power supply itself, you would not have outages and your UPS might be fine, but the supply is randomly dropping voltage which, obviously, would not be in any loglife.

How often do these happen? How long was it running without these issues prior? Have you updated kernel recently? Can you test with a prior one? Can you test with e.g. regular LIVE Boot Debian (it rules out the exact same kernel, also rules out the storage).

Is smartctl -a /dev/nvme... (or whatever else is your boot drive) all fine?

markfree · Aug 10, 2024

Sorry. I wasn't very clear in my initial post.

Reboots occur occasionally, but I could not infer a periodicity.
The last reboot was the day I opened this thread. The previous reboot was on August 4th.
Before that, it only happened on July 27th.
So, there's a different span of days between reboots.

The following graph shows this.

I'm attaching the journal from the previous reboot, from 08/04 to 08/07.

Also, the boot device does not seem to have issues.
This is its SMART information:

Bash:

root@mini:~# smartctl -a /dev/nvme0n1
(...)

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        42 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    3,548,409 [1.81 TB]
Data Units Written:                 1,810,730 [927 GB]
Host Read Commands:                 16,331,443
Host Write Commands:                82,323,145
Controller Busy Time:               2,194
Power Cycles:                       71
Power On Hours:                     698
Unsafe Shutdowns:                   43
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               42 Celsius
Temperature Sensor 2:               46 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

I only update the systems every couple of months or so.
I don't recall if there was a kernel update the last time, but it's currently up to date.

I'll try to test the node with a live boot Debian and report back later.

markfree · Sep 29, 2024

I found the issue... and it was power, after all.

The UPS had a bad battery. So, when a voltage oscillation occurred, it would not make the switch properly, causing the node to reboot.
After changing the battery, it no longer failed.

Search

Search

Occasional node reboots - What to look for?

markfree

Member

esi_y

Renowned Member

markfree

Member

Attachments

esi_y

Renowned Member

markfree

Member

Attachments

markfree

Member