Tips for diagnosing the cause of a host reboot?

surfrock66

Active Member
Feb 10, 2020
38
8
28
40
I have a 3-node cluster, and had a random reboot today. I went back in syslog, and the reboot was logged, but I wasn't sure if this was a logged event BEFORE the reboot, or AFTER. I want to figure out why this is happening so I can stop it. I think this is the second random reboot this week.

This is a R6525 running PVE 8.2.7. Here's the syslog at the time:

Code:
Nov 21 07:51:21 sr66-prox-03 snmpd[4034]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 21 07:52:21 sr66-prox-03 snmpd[4034]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 21 07:53:21 sr66-prox-03 snmpd[4034]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 21 07:54:15 sr66-prox-03 smartd[3340]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 80 to 78
Nov 21 07:54:15 sr66-prox-03 smartd[3340]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 34 to 35
Nov 21 07:54:15 sr66-prox-03 smartd[3340]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 80 to 78
Nov 21 07:54:15 sr66-prox-03 smartd[3340]: Device: /dev/sdg [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 31 to 33
Nov 21 07:54:21 sr66-prox-03 snmpd[4034]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
-- Reboot --
Nov 21 07:58:34 sr66-prox-03 kernel: Linux version 6.8.12-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-2 (2024-09-05T10:03Z) ()
Nov 21 07:58:34 sr66-prox-03 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-2-pve root=UUID=4fbd2c0b-dcd7-44d9-9139-495d8f107f19 ro quiet
Nov 21 07:58:34 sr66-prox-03 kernel: KERNEL supported cpus:
Nov 21 07:58:34 sr66-prox-03 kernel:   Intel GenuineIntel
Nov 21 07:58:34 sr66-prox-03 kernel:   AMD AuthenticAMD
Nov 21 07:58:34 sr66-prox-03 kernel:   Hygon HygonGenuine
Nov 21 07:58:34 sr66-prox-03 kernel:   Centaur CentaurHauls
Nov 21 07:58:34 sr66-prox-03 kernel:   zhaoxin   Shanghai 
Nov 21 07:58:34 sr66-prox-03 kernel: BIOS-provided physical RAM map:

More diagnostic output:

Code:
root@sr66-prox-03:~# last -x | head | tac
runlevel (to lvl 5)   6.8.12-2-pve     Sat Oct 12 09:31 - 07:45 (19+22:13)
root     pts/0        10.4.3.131       Sat Oct 12 09:32 - 15:11  (05:39)
reboot   system boot  6.8.12-2-pve     Fri Nov  1 07:40   still running
runlevel (to lvl 5)   6.8.12-2-pve     Fri Nov  1 07:45 - 06:55 (17+00:10)
reboot   system boot  6.8.12-2-pve     Mon Nov 18 06:54   still running
runlevel (to lvl 5)   6.8.12-2-pve     Mon Nov 18 06:55 - 07:59 (3+01:04)
root     pts/0        10.4.3.131       Tue Nov 19 08:21 - 13:05  (04:43)
reboot   system boot  6.8.12-2-pve     Thu Nov 21 07:58   still running
runlevel (to lvl 5)   6.8.12-2-pve     Thu Nov 21 07:59   still running
root     pts/0        10.4.3.131       Thu Nov 21 08:20   still logged in
Code:

I've got nothing in my idrac logs indicating a hardware-initiated reboot or anything. My hardware system event log only shows a known maintenance event on 11/12 (installing a second power supply as this is a new server and that came in later). The lifecycle log has this entry, which makes me think it was OS initiated:

1732206343403.png


I'm stumped; it looks like the host OS triggered the host reboot, but I need to prove it, find why, and figure out how to stop it; any advice is appreciated.
 
What do you mean, the HA watchdog? I'm not sure, I didn't enable anything intentionally, HA did fence the node successfully and migrate the VM's keeping the outage to about 10 minutes as they came back up, which isn't great.
 
I got the notification from another node that it was trying to fence at 7:57, meaning likely after failure per the timestamps.

1732210463279.png
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!