Tips for diagnosing the cause of a host reboot?

surfrock66 · 2024-11-21T17:26:29+0100

I have a 3-node cluster, and had a random reboot today. I went back in syslog, and the reboot was logged, but I wasn't sure if this was a logged event BEFORE the reboot, or AFTER. I want to figure out why this is happening so I can stop it. I think this is the second random reboot this week.

This is a R6525 running PVE 8.2.7. Here's the syslog at the time:

Code:

Nov 21 07:51:21 sr66-prox-03 snmpd[4034]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 21 07:52:21 sr66-prox-03 snmpd[4034]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 21 07:53:21 sr66-prox-03 snmpd[4034]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 21 07:54:15 sr66-prox-03 smartd[3340]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 80 to 78
Nov 21 07:54:15 sr66-prox-03 smartd[3340]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 34 to 35
Nov 21 07:54:15 sr66-prox-03 smartd[3340]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 80 to 78
Nov 21 07:54:15 sr66-prox-03 smartd[3340]: Device: /dev/sdg [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 31 to 33
Nov 21 07:54:21 sr66-prox-03 snmpd[4034]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
-- Reboot --
Nov 21 07:58:34 sr66-prox-03 kernel: Linux version 6.8.12-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-2 (2024-09-05T10:03Z) ()
Nov 21 07:58:34 sr66-prox-03 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-2-pve root=UUID=4fbd2c0b-dcd7-44d9-9139-495d8f107f19 ro quiet
Nov 21 07:58:34 sr66-prox-03 kernel: KERNEL supported cpus:
Nov 21 07:58:34 sr66-prox-03 kernel:   Intel GenuineIntel
Nov 21 07:58:34 sr66-prox-03 kernel:   AMD AuthenticAMD
Nov 21 07:58:34 sr66-prox-03 kernel:   Hygon HygonGenuine
Nov 21 07:58:34 sr66-prox-03 kernel:   Centaur CentaurHauls
Nov 21 07:58:34 sr66-prox-03 kernel:   zhaoxin   Shanghai 
Nov 21 07:58:34 sr66-prox-03 kernel: BIOS-provided physical RAM map:

More diagnostic output:

Code:

root@sr66-prox-03:~# last -x | head | tac
runlevel (to lvl 5)   6.8.12-2-pve     Sat Oct 12 09:31 - 07:45 (19+22:13)
root     pts/0        10.4.3.131       Sat Oct 12 09:32 - 15:11  (05:39)
reboot   system boot  6.8.12-2-pve     Fri Nov  1 07:40   still running
runlevel (to lvl 5)   6.8.12-2-pve     Fri Nov  1 07:45 - 06:55 (17+00:10)
reboot   system boot  6.8.12-2-pve     Mon Nov 18 06:54   still running
runlevel (to lvl 5)   6.8.12-2-pve     Mon Nov 18 06:55 - 07:59 (3+01:04)
root     pts/0        10.4.3.131       Tue Nov 19 08:21 - 13:05  (04:43)
reboot   system boot  6.8.12-2-pve     Thu Nov 21 07:58   still running
runlevel (to lvl 5)   6.8.12-2-pve     Thu Nov 21 07:59   still running
root     pts/0        10.4.3.131       Thu Nov 21 08:20   still logged in

Code:

I've got nothing in my idrac logs indicating a hardware-initiated reboot or anything. My hardware system event log only shows a known maintenance event on 11/12 (installing a second power supply as this is a new server and that came in later). The lifecycle log has this entry, which makes me think it was OS initiated:

I'm stumped; it looks like the host OS triggered the host reboot, but I need to prove it, find why, and figure out how to stop it; any advice is appreciated.

alexskysilk · 2024-11-21T18:21:40+0100

do you have the watchdog enabled?

--edit of course you do. check pve logs for fencing notifications.

surfrock66 · 2024-11-21T18:23:35+0100

What do you mean, the HA watchdog? I'm not sure, I didn't enable anything intentionally, HA did fence the node successfully and migrate the VM's keeping the outage to about 10 minutes as they came back up, which isn't great.

alexskysilk · 2024-11-21T18:25:07+0100

so the question is, did the fencing notifications occur before or after the node rebooted? describe your networking setup.

surfrock66 · 2024-11-21T18:33:54+0100

I got the notification from another node that it was trying to fence at 7:57, meaning likely after failure per the timestamps.

surfrock66 · 2024-11-21T18:47:05+0100

I took the opportunity to upgrade to 8.3.0 as well, in case I'm hitting a bug.

Search

Search

Tips for diagnosing the cause of a host reboot?

surfrock66

Active Member

alexskysilk

Distinguished Member

surfrock66

Active Member

alexskysilk

Distinguished Member

surfrock66

Active Member

surfrock66

Active Member