I have a 3-node cluster, and had a random reboot today. I went back in syslog, and the reboot was logged, but I wasn't sure if this was a logged event BEFORE the reboot, or AFTER. I want to figure out why this is happening so I can stop it. I think this is the second random reboot this week.
This is a R6525 running PVE 8.2.7. Here's the syslog at the time:
More diagnostic output:
I've got nothing in my idrac logs indicating a hardware-initiated reboot or anything. My hardware system event log only shows a known maintenance event on 11/12 (installing a second power supply as this is a new server and that came in later). The lifecycle log has this entry, which makes me think it was OS initiated:
I'm stumped; it looks like the host OS triggered the host reboot, but I need to prove it, find why, and figure out how to stop it; any advice is appreciated.
This is a R6525 running PVE 8.2.7. Here's the syslog at the time:
Code:
Nov 21 07:51:21 sr66-prox-03 snmpd[4034]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 21 07:52:21 sr66-prox-03 snmpd[4034]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 21 07:53:21 sr66-prox-03 snmpd[4034]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 21 07:54:15 sr66-prox-03 smartd[3340]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 80 to 78
Nov 21 07:54:15 sr66-prox-03 smartd[3340]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 34 to 35
Nov 21 07:54:15 sr66-prox-03 smartd[3340]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 80 to 78
Nov 21 07:54:15 sr66-prox-03 smartd[3340]: Device: /dev/sdg [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 31 to 33
Nov 21 07:54:21 sr66-prox-03 snmpd[4034]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
-- Reboot --
Nov 21 07:58:34 sr66-prox-03 kernel: Linux version 6.8.12-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-2 (2024-09-05T10:03Z) ()
Nov 21 07:58:34 sr66-prox-03 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-2-pve root=UUID=4fbd2c0b-dcd7-44d9-9139-495d8f107f19 ro quiet
Nov 21 07:58:34 sr66-prox-03 kernel: KERNEL supported cpus:
Nov 21 07:58:34 sr66-prox-03 kernel: Intel GenuineIntel
Nov 21 07:58:34 sr66-prox-03 kernel: AMD AuthenticAMD
Nov 21 07:58:34 sr66-prox-03 kernel: Hygon HygonGenuine
Nov 21 07:58:34 sr66-prox-03 kernel: Centaur CentaurHauls
Nov 21 07:58:34 sr66-prox-03 kernel: zhaoxin Shanghai
Nov 21 07:58:34 sr66-prox-03 kernel: BIOS-provided physical RAM map:
More diagnostic output:
Code:
root@sr66-prox-03:~# last -x | head | tac
runlevel (to lvl 5) 6.8.12-2-pve Sat Oct 12 09:31 - 07:45 (19+22:13)
root pts/0 10.4.3.131 Sat Oct 12 09:32 - 15:11 (05:39)
reboot system boot 6.8.12-2-pve Fri Nov 1 07:40 still running
runlevel (to lvl 5) 6.8.12-2-pve Fri Nov 1 07:45 - 06:55 (17+00:10)
reboot system boot 6.8.12-2-pve Mon Nov 18 06:54 still running
runlevel (to lvl 5) 6.8.12-2-pve Mon Nov 18 06:55 - 07:59 (3+01:04)
root pts/0 10.4.3.131 Tue Nov 19 08:21 - 13:05 (04:43)
reboot system boot 6.8.12-2-pve Thu Nov 21 07:58 still running
runlevel (to lvl 5) 6.8.12-2-pve Thu Nov 21 07:59 still running
root pts/0 10.4.3.131 Thu Nov 21 08:20 still logged in
Code:
I've got nothing in my idrac logs indicating a hardware-initiated reboot or anything. My hardware system event log only shows a known maintenance event on 11/12 (installing a second power supply as this is a new server and that came in later). The lifecycle log has this entry, which makes me think it was OS initiated:
I'm stumped; it looks like the host OS triggered the host reboot, but I need to prove it, find why, and figure out how to stop it; any advice is appreciated.