Tips for diagnosing the cause of a host reboot?

surfrock66

Well-Known Member
Feb 10, 2020
51
10
48
41
I have a 3-node cluster, and had a random reboot today. I went back in syslog, and the reboot was logged, but I wasn't sure if this was a logged event BEFORE the reboot, or AFTER. I want to figure out why this is happening so I can stop it. I think this is the second random reboot this week.

This is a R6525 running PVE 8.2.7. Here's the syslog at the time:

Code:
Nov 21 07:51:21 sr66-prox-03 snmpd[4034]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 21 07:52:21 sr66-prox-03 snmpd[4034]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 21 07:53:21 sr66-prox-03 snmpd[4034]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 21 07:54:15 sr66-prox-03 smartd[3340]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 80 to 78
Nov 21 07:54:15 sr66-prox-03 smartd[3340]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 34 to 35
Nov 21 07:54:15 sr66-prox-03 smartd[3340]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 80 to 78
Nov 21 07:54:15 sr66-prox-03 smartd[3340]: Device: /dev/sdg [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 31 to 33
Nov 21 07:54:21 sr66-prox-03 snmpd[4034]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
-- Reboot --
Nov 21 07:58:34 sr66-prox-03 kernel: Linux version 6.8.12-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-2 (2024-09-05T10:03Z) ()
Nov 21 07:58:34 sr66-prox-03 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-2-pve root=UUID=4fbd2c0b-dcd7-44d9-9139-495d8f107f19 ro quiet
Nov 21 07:58:34 sr66-prox-03 kernel: KERNEL supported cpus:
Nov 21 07:58:34 sr66-prox-03 kernel:   Intel GenuineIntel
Nov 21 07:58:34 sr66-prox-03 kernel:   AMD AuthenticAMD
Nov 21 07:58:34 sr66-prox-03 kernel:   Hygon HygonGenuine
Nov 21 07:58:34 sr66-prox-03 kernel:   Centaur CentaurHauls
Nov 21 07:58:34 sr66-prox-03 kernel:   zhaoxin   Shanghai 
Nov 21 07:58:34 sr66-prox-03 kernel: BIOS-provided physical RAM map:

More diagnostic output:

Code:
root@sr66-prox-03:~# last -x | head | tac
runlevel (to lvl 5)   6.8.12-2-pve     Sat Oct 12 09:31 - 07:45 (19+22:13)
root     pts/0        10.4.3.131       Sat Oct 12 09:32 - 15:11  (05:39)
reboot   system boot  6.8.12-2-pve     Fri Nov  1 07:40   still running
runlevel (to lvl 5)   6.8.12-2-pve     Fri Nov  1 07:45 - 06:55 (17+00:10)
reboot   system boot  6.8.12-2-pve     Mon Nov 18 06:54   still running
runlevel (to lvl 5)   6.8.12-2-pve     Mon Nov 18 06:55 - 07:59 (3+01:04)
root     pts/0        10.4.3.131       Tue Nov 19 08:21 - 13:05  (04:43)
reboot   system boot  6.8.12-2-pve     Thu Nov 21 07:58   still running
runlevel (to lvl 5)   6.8.12-2-pve     Thu Nov 21 07:59   still running
root     pts/0        10.4.3.131       Thu Nov 21 08:20   still logged in
Code:

I've got nothing in my idrac logs indicating a hardware-initiated reboot or anything. My hardware system event log only shows a known maintenance event on 11/12 (installing a second power supply as this is a new server and that came in later). The lifecycle log has this entry, which makes me think it was OS initiated:

1732206343403.png


I'm stumped; it looks like the host OS triggered the host reboot, but I need to prove it, find why, and figure out how to stop it; any advice is appreciated.
 
What do you mean, the HA watchdog? I'm not sure, I didn't enable anything intentionally, HA did fence the node successfully and migrate the VM's keeping the outage to about 10 minutes as they came back up, which isn't great.
 
I got the notification from another node that it was trying to fence at 7:57, meaning likely after failure per the timestamps.

1732210463279.png
 
I had another unexpected reboot today, but I don't see the cause. Fencing dealt with my guests. Below is the start of irregular log entries; idrac logged no faults with storage or anything.

Code:
Nov 27 19:33:50 sr66-prox-03 kernel: sd 0:0:19:0: [sda] tag#974 BRCM Debug mfi stat 0x2d, data len requested/completed 0x2000/0x0
Nov 27 19:33:51 sr66-prox-03 Server_Administrator[3329]: 3329 2405 - Storage Service  Severity: Warning, Category: Storage, MessageID: PDR98, Message: Command timeout occurred on Physical Disk 0:1:0 on Controller 0 at Connector 0 at Enclosure 1.No SCSI sense data received.
Nov 27 19:34:22 sr66-prox-03 watchdog-mux[3370]: client watchdog expired - disable watchdog updates
Nov 27 19:34:23 sr66-prox-03 watchdog-mux[3370]: exit watchdog-mux with active connections
Nov 27 19:34:23 sr66-prox-03 pve-ha-crm[4366]: got unexpected error - 'domain-ha'-locked command timed out - aborting
Nov 27 19:34:23 sr66-prox-03 pve-ha-crm[4366]: loop take too long (64 seconds)
Nov 27 19:34:23 sr66-prox-03 systemd-journald[908]: Received client request to sync journal.
Nov 27 19:34:23 sr66-prox-03 kernel: watchdog: watchdog0: watchdog did not stop!
Nov 27 19:34:23 sr66-prox-03 systemd[1]: watchdog-mux.service: Deactivated successfully.
Nov 27 19:34:23 sr66-prox-03 pveproxy[1952003]: proxy detected vanished client connection
Nov 27 19:34:23 sr66-prox-03 systemd[1]: watchdog-mux.service: Consumed 20.271s CPU time.
Nov 27 19:34:23 sr66-prox-03 pveproxy[1950952]: proxy detected vanished client connection
-- Reboot --

Smartctl for sda shows no errors or issues after a long test. If anything, it looks like some smart data is not getting to the OS, which is triggering watchdog to do something? I am stumped and really don't want to keep having random fencing events.
 
I had another unexpected reboot overnight and at this time I have to remove this node from my environment; that being said I don't think I have a hardware issue, but believe this is an OS-initiated reboot and I don't know where from. I thought it could be something from the iDRAC but the logs from the idrac don't support this, it looks like the OS gets a REBOOT signal and just randomly reboots. I captured the logs from the last relevant event which was a dpkg update at midnight, but nothing out of the ordinary shows up in the logs:

Code:
Nov 30 00:00:13 sr66-prox-03 systemd[1]: Starting dpkg-db-backup.service - Daily dpkg database backup service...
Nov 30 00:00:13 sr66-prox-03 systemd[1]: Starting logrotate.service - Rotate log files...
Nov 30 00:00:13 sr66-prox-03 systemd[1]: dpkg-db-backup.service: Deactivated successfully.
Nov 30 00:00:13 sr66-prox-03 systemd[1]: Finished dpkg-db-backup.service - Daily dpkg database backup service.
Nov 30 00:00:13 sr66-prox-03 systemd[1]: Reloading pveproxy.service - PVE API Proxy Server...
Nov 30 00:00:14 sr66-prox-03 pveproxy[666545]: send HUP to 4567
Nov 30 00:00:14 sr66-prox-03 pveproxy[4567]: received signal HUP
Nov 30 00:00:14 sr66-prox-03 pveproxy[4567]: server closing
Nov 30 00:00:14 sr66-prox-03 pveproxy[4567]: server shutdown (restart)
Nov 30 00:00:14 sr66-prox-03 systemd[1]: Reloaded pveproxy.service - PVE API Proxy Server.
Nov 30 00:00:14 sr66-prox-03 systemd[1]: Reloading spiceproxy.service - PVE SPICE Proxy Server...
Nov 30 00:00:14 sr66-prox-03 spiceproxy[666547]: send HUP to 4573
Nov 30 00:00:14 sr66-prox-03 spiceproxy[4573]: received signal HUP
Nov 30 00:00:14 sr66-prox-03 spiceproxy[4573]: server closing
Nov 30 00:00:14 sr66-prox-03 spiceproxy[4573]: server shutdown (restart)
Nov 30 00:00:14 sr66-prox-03 systemd[1]: Reloaded spiceproxy.service - PVE SPICE Proxy Server.
Nov 30 00:00:14 sr66-prox-03 pvefw-logger[362627]: received terminate request (signal)
Nov 30 00:00:14 sr66-prox-03 systemd[1]: Stopping pvefw-logger.service - Proxmox VE firewall logger...
Nov 30 00:00:14 sr66-prox-03 pvefw-logger[362627]: stopping pvefw logger
Nov 30 00:00:15 sr66-prox-03 systemd[1]: pvefw-logger.service: Deactivated successfully.
Nov 30 00:00:15 sr66-prox-03 systemd[1]: Stopped pvefw-logger.service - Proxmox VE firewall logger.
Nov 30 00:00:15 sr66-prox-03 systemd[1]: pvefw-logger.service: Consumed 8.070s CPU time.
Nov 30 00:00:15 sr66-prox-03 systemd[1]: Starting pvefw-logger.service - Proxmox VE firewall logger...
Nov 30 00:00:15 sr66-prox-03 pvefw-logger[666559]: starting pvefw logger
Nov 30 00:00:15 sr66-prox-03 systemd[1]: Started pvefw-logger.service - Proxmox VE firewall logger.
Nov 30 00:00:15 sr66-prox-03 spiceproxy[4573]: restarting server
Nov 30 00:00:15 sr66-prox-03 spiceproxy[4573]: starting 1 worker(s)
Nov 30 00:00:15 sr66-prox-03 spiceproxy[4573]: worker 666561 started
Nov 30 00:00:15 sr66-prox-03 systemd[1]: logrotate.service: Deactivated successfully.
Nov 30 00:00:15 sr66-prox-03 systemd[1]: Finished logrotate.service - Rotate log files.
Nov 30 00:00:15 sr66-prox-03 pveproxy[4567]: Using '/etc/pve/local/pveproxy-ssl.pem' as certificate for the web interface.
Nov 30 00:00:15 sr66-prox-03 pveproxy[4567]: restarting server
Nov 30 00:00:15 sr66-prox-03 pveproxy[4567]: starting 3 worker(s)
Nov 30 00:00:15 sr66-prox-03 pveproxy[4567]: worker 666566 started
Nov 30 00:00:15 sr66-prox-03 pveproxy[4567]: worker 666567 started
Nov 30 00:00:15 sr66-prox-03 pveproxy[4567]: worker 666568 started
Nov 30 00:00:20 sr66-prox-03 spiceproxy[362629]: worker exit
Nov 30 00:00:20 sr66-prox-03 spiceproxy[4573]: worker 362629 finished
Nov 30 00:00:20 sr66-prox-03 pveproxy[658512]: worker exit
Nov 30 00:00:20 sr66-prox-03 pveproxy[663769]: worker exit
Nov 30 00:00:20 sr66-prox-03 pveproxy[658302]: worker exit
Nov 30 00:00:20 sr66-prox-03 pveproxy[4567]: worker 658302 finished
Nov 30 00:00:20 sr66-prox-03 pveproxy[4567]: worker 663769 finished
Nov 30 00:00:20 sr66-prox-03 pveproxy[4567]: worker 658512 finished
Nov 30 00:00:23 sr66-prox-03 sshd[666601]: Accepted publickey for root from 10.10.2.30 port 53464 ssh2: RSA SHA256:bOb80qL7lGV8+7AzxFBbixu5QP/drdEUAmPZ//ih5Y8
Nov 30 00:00:23 sr66-prox-03 sshd[666601]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Nov 30 00:00:23 sr66-prox-03 systemd[1]: Created slice user-0.slice - User Slice of UID 0.
Nov 30 00:00:23 sr66-prox-03 systemd[1]: Starting user-runtime-dir@0.service - User Runtime Directory /run/user/0...
Nov 30 00:00:23 sr66-prox-03 systemd-logind[3377]: New session 204 of user root.
Nov 30 00:00:23 sr66-prox-03 systemd[1]: Finished user-runtime-dir@0.service - User Runtime Directory /run/user/0.
Nov 30 00:00:23 sr66-prox-03 systemd[1]: Starting user@0.service - User Manager for UID 0...
Nov 30 00:00:23 sr66-prox-03 (systemd)[666604]: pam_unix(systemd-user:session): session opened for user root(uid=0) by (uid=0)
Nov 30 00:00:23 sr66-prox-03 systemd[666604]: Queued start job for default target default.target.
Nov 30 00:00:23 sr66-prox-03 systemd[666604]: Created slice app.slice - User Application Slice.
Nov 30 00:00:23 sr66-prox-03 systemd[666604]: Reached target paths.target - Paths.
Nov 30 00:00:23 sr66-prox-03 systemd[666604]: Reached target timers.target - Timers.
Nov 30 00:00:23 sr66-prox-03 systemd[666604]: Listening on dirmngr.socket - GnuPG network certificate management daemon.
Nov 30 00:00:23 sr66-prox-03 systemd[666604]: Listening on gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
Nov 30 00:00:23 sr66-prox-03 systemd[666604]: Listening on gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Nov 30 00:00:23 sr66-prox-03 systemd[666604]: Listening on gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Nov 30 00:00:23 sr66-prox-03 systemd[666604]: Listening on gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Nov 30 00:00:23 sr66-prox-03 systemd[666604]: Reached target sockets.target - Sockets.
Nov 30 00:00:23 sr66-prox-03 systemd[666604]: Reached target basic.target - Basic System.
Nov 30 00:00:23 sr66-prox-03 systemd[666604]: Reached target default.target - Main User Target.
Nov 30 00:00:23 sr66-prox-03 systemd[666604]: Startup finished in 215ms.
Nov 30 00:00:23 sr66-prox-03 systemd[1]: Started user@0.service - User Manager for UID 0.
Nov 30 00:00:23 sr66-prox-03 systemd[1]: Started session-204.scope - Session 204 of User root.
Nov 30 00:00:23 sr66-prox-03 sshd[666601]: pam_env(sshd:session): deprecated reading of user environment enabled
Nov 30 00:00:24 sr66-prox-03 sshd[666601]: Received disconnect from 10.10.2.30 port 53464:11: disconnected by user
Nov 30 00:00:24 sr66-prox-03 sshd[666601]: Disconnected from user root 10.10.2.30 port 53464
Nov 30 00:00:24 sr66-prox-03 sshd[666601]: pam_unix(sshd:session): session closed for user root
Nov 30 00:00:24 sr66-prox-03 systemd-logind[3377]: Session 204 logged out. Waiting for processes to exit.
Nov 30 00:00:24 sr66-prox-03 systemd[1]: session-204.scope: Deactivated successfully.
Nov 30 00:00:24 sr66-prox-03 systemd-logind[3377]: Removed session 204.
Nov 30 00:00:28 sr66-prox-03 sshd[666629]: Accepted publickey for root from 10.10.2.30 port 59338 ssh2: RSA SHA256:bOb80qL7lGV8+7AzxFBbixu5QP/drdEUAmPZ//ih5Y8
Nov 30 00:00:28 sr66-prox-03 sshd[666629]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Nov 30 00:00:28 sr66-prox-03 systemd-logind[3377]: New session 206 of user root.
Nov 30 00:00:28 sr66-prox-03 systemd[1]: Started session-206.scope - Session 206 of User root.
Nov 30 00:00:28 sr66-prox-03 sshd[666629]: pam_env(sshd:session): deprecated reading of user environment enabled
Nov 30 00:00:29 sr66-prox-03 snmpd[4019]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 30 00:00:35 sr66-prox-03 sshd[666629]: Received disconnect from 10.10.2.30 port 59338:11: disconnected by user
Nov 30 00:00:35 sr66-prox-03 sshd[666629]: Disconnected from user root 10.10.2.30 port 59338
Nov 30 00:00:35 sr66-prox-03 sshd[666629]: pam_unix(sshd:session): session closed for user root
Nov 30 00:00:35 sr66-prox-03 systemd[1]: session-206.scope: Deactivated successfully.
Nov 30 00:00:35 sr66-prox-03 systemd[1]: session-206.scope: Consumed 3.848s CPU time.
Nov 30 00:00:35 sr66-prox-03 systemd-logind[3377]: Session 206 logged out. Waiting for processes to exit.
Nov 30 00:00:35 sr66-prox-03 systemd-logind[3377]: Removed session 206.
Nov 30 00:00:36 sr66-prox-03 sshd[666711]: Accepted publickey for root from 10.10.2.30 port 51288 ssh2: RSA SHA256:bOb80qL7lGV8+7AzxFBbixu5QP/drdEUAmPZ//ih5Y8
Nov 30 00:00:36 sr66-prox-03 sshd[666711]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Nov 30 00:00:36 sr66-prox-03 systemd-logind[3377]: New session 207 of user root.
Nov 30 00:00:36 sr66-prox-03 systemd[1]: Started session-207.scope - Session 207 of User root.
Nov 30 00:00:36 sr66-prox-03 sshd[666711]: pam_env(sshd:session): deprecated reading of user environment enabled
Nov 30 00:00:36 sr66-prox-03 sshd[666711]: Received disconnect from 10.10.2.30 port 51288:11: disconnected by user
Nov 30 00:00:36 sr66-prox-03 sshd[666711]: Disconnected from user root 10.10.2.30 port 51288
Nov 30 00:00:36 sr66-prox-03 sshd[666711]: pam_unix(sshd:session): session closed for user root
Nov 30 00:00:36 sr66-prox-03 systemd[1]: session-207.scope: Deactivated successfully.
Nov 30 00:00:36 sr66-prox-03 systemd-logind[3377]: Session 207 logged out. Waiting for processes to exit.
Nov 30 00:00:36 sr66-prox-03 systemd-logind[3377]: Removed session 207.
Nov 30 00:00:47 sr66-prox-03 systemd[1]: Stopping user@0.service - User Manager for UID 0...
Nov 30 00:00:47 sr66-prox-03 systemd[666604]: Activating special unit exit.target...
Nov 30 00:00:47 sr66-prox-03 systemd[666604]: Stopped target default.target - Main User Target.
Nov 30 00:00:47 sr66-prox-03 systemd[666604]: Stopped target basic.target - Basic System.
Nov 30 00:00:47 sr66-prox-03 systemd[666604]: Stopped target paths.target - Paths.
Nov 30 00:00:47 sr66-prox-03 systemd[666604]: Stopped target sockets.target - Sockets.
Nov 30 00:00:47 sr66-prox-03 systemd[666604]: Stopped target timers.target - Timers.
Nov 30 00:00:47 sr66-prox-03 systemd[666604]: Closed dirmngr.socket - GnuPG network certificate management daemon.
Nov 30 00:00:47 sr66-prox-03 systemd[666604]: Closed gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
Nov 30 00:00:47 sr66-prox-03 systemd[666604]: Closed gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Nov 30 00:00:47 sr66-prox-03 systemd[666604]: Closed gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Nov 30 00:00:47 sr66-prox-03 systemd[666604]: Closed gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Nov 30 00:00:47 sr66-prox-03 systemd[666604]: Removed slice app.slice - User Application Slice.
Nov 30 00:00:47 sr66-prox-03 systemd[666604]: Reached target shutdown.target - Shutdown.
Nov 30 00:00:47 sr66-prox-03 systemd[666604]: Finished systemd-exit.service - Exit the Session.
Nov 30 00:00:47 sr66-prox-03 systemd[666604]: Reached target exit.target - Exit the Session.
Nov 30 00:00:47 sr66-prox-03 systemd[1]: user@0.service: Deactivated successfully.
Nov 30 00:00:47 sr66-prox-03 systemd[1]: Stopped user@0.service - User Manager for UID 0.
Nov 30 00:00:47 sr66-prox-03 systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...
Nov 30 00:00:47 sr66-prox-03 systemd[1]: run-user-0.mount: Deactivated successfully.
Nov 30 00:00:47 sr66-prox-03 systemd[1]: user-runtime-dir@0.service: Deactivated successfully.
Nov 30 00:00:47 sr66-prox-03 systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.
Nov 30 00:00:47 sr66-prox-03 systemd[1]: Removed slice user-0.slice - User Slice of UID 0.
Nov 30 00:00:47 sr66-prox-03 systemd[1]: user-0.slice: Consumed 5.518s CPU time.
Nov 30 00:01:29 sr66-prox-03 snmpd[4019]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 30 00:02:29 sr66-prox-03 snmpd[4019]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 30 00:03:29 sr66-prox-03 snmpd[4019]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 30 00:04:29 sr66-prox-03 snmpd[4019]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 30 00:05:29 sr66-prox-03 snmpd[4019]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 30 00:06:29 sr66-prox-03 snmpd[4019]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 30 00:07:24 sr66-prox-03 smartd[3354]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 80 to 82
Nov 30 00:07:24 sr66-prox-03 smartd[3354]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 80 to 81
Nov 30 00:07:29 sr66-prox-03 snmpd[4019]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 30 00:08:29 sr66-prox-03 snmpd[4019]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 30 00:09:29 sr66-prox-03 snmpd[4019]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 30 00:10:29 sr66-prox-03 snmpd[4019]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 30 00:11:29 sr66-prox-03 snmpd[4019]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 30 00:12:29 sr66-prox-03 snmpd[4019]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 30 00:13:29 sr66-prox-03 snmpd[4019]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 30 00:14:29 sr66-prox-03 snmpd[4019]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 30 00:15:29 sr66-prox-03 snmpd[4019]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Nov 30 00:16:29 sr66-prox-03 snmpd[4019]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
-- Reboot --
Nov 30 00:20:49 sr66-prox-03 kernel: Linux version 6.8.12-4-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-4 (2024-11-06T15:04Z) ()
Nov 30 00:20:49 sr66-prox-03 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-4-pve root=UUID=4fbd2c0b-dcd7-44d9-9139-495d8f107f19 ro quiet
Nov 30 00:20:49 sr66-prox-03 kernel: KERNEL supported cpus:
Nov 30 00:20:49 sr66-prox-03 kernel:   Intel GenuineIntel
Nov 30 00:20:49 sr66-prox-03 kernel:   AMD AuthenticAMD
Nov 30 00:20:49 sr66-prox-03 kernel:   Hygon HygonGenuine
Nov 30 00:20:49 sr66-prox-03 kernel:   Centaur CentaurHauls
Nov 30 00:20:49 sr66-prox-03 kernel:   zhaoxin   Shanghai 
Nov 30 00:20:49 sr66-prox-03 kernel: BIOS-provided physical RAM map:

The snmp thing is nothing per this thread: https://forum.proxmox.com/threads/s...-in-proc-net-snmp-237-224.146957/#post-677327

The iDRAC logs indicate this was either button (it isn't) or user initiated:
1733006427879.png

There is no communication except from the other nodes, this is at midnight and I am not awake, and I cannot figure out what is triggering these reboots. Any advice is appreciated.
 
I want to further relay some discovery here as I think I have uncovered the root cause, however I still think there is an issue. We were having reboots every 3 days or less, however after discovering and resolving the issue, I have had over 8 days of uptime.

This system has 2 2.5G NICS, which are bonded in an LACP LAG that go to my switch. As it turns out, the switch had a bad port that would not disconnect, but would randomly change its speed to 100. I believe that these speed changes would cause the "pve-ha-crm[4366]: loop take too long" which most often preceded the reboots.

I moved the LAG to 2 new ports on the switch, and both the speed and the server have been stable since. That being said, I still think this behavior is a bug. While the speed change is significant, (essentially 5G -> 200M) this should NOT result in a fencing and reboot with no description in the log. It should raise an alert, but this should not have been a reboot condition.