Unexpected reboot last night

SamTzu · Jun 9, 2024

I can't explain this reboot on KVM host last night.

Code:

Jun 09 00:24:01 vm2405 CRON[2014686]: (root) CMD (if [ $(date +%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/scrub ]; then /usr/lib/zfs-linux/scrub; fi)
Jun 09 00:24:01 vm2405 zed[2014704]: eid=97 class=scrub_start pool='sdd'
Jun 09 00:24:04 vm2405 zed[2015034]: eid=99 class=scrub_start pool='vdd'
Jun 09 00:24:25 vm2405 zed[2015314]: eid=102 class=scrub_finish pool='sdd'
Jun 09 00:24:30 vm2405 CRON[2014685]: pam_unix(cron:session): session closed for user root
-- Reboot --
Jun 09 00:45:17 vm2405 kernel: Linux version 6.5.13-1-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.5.13-1 (2024-02-05T13:50Z) ()
Jun 09 00:45:17 vm2405 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.13-1-pve root=/dev/mapper/pve-root ro quiet
Jun 09 00:45:17 vm2405 kernel: KERNEL supported cpus:
Jun 09 00:45:17 vm2405 kernel:   Intel GenuineIntel

leesteken · Jun 9, 2024

There is nothing in the log to go on. It was a power interruption (PSU or wall socket) and/or the logs could not be written to disk (drive or connector issue) or some hardware issue (memory or CPU or motherboard). It could be as simple as a voltage drop because of very active drives (vdd scrub) and an worn-down (and/or hot) PSU.
Or something completely different. Unless you can reproduce it, it's impossible to test what it might be.

justinclift · Jun 9, 2024

@SamTzu Don't suppose the box that rebooted is a server with some form of BMC or similar controller?

If it is, that would generally have a record of any hardware problems/faults that caused a reboot.

From a different angle, is the host part of a cluster? If there was some form of network problem causing the host to lose quorum, then it's possible it could have been rebooted by the watchdog.

leesteken · Jun 9, 2024

justinclift said:
@SamTzu Don't suppose the box that rebooted is a server with some form of BMC or similar controller?

If it is, that would generally have a record of any hardware problems/faults that caused a reboot.

Good point!

justinclift said:
From a different angle, is the host part of a cluster? If there was some form of network problem causing the host to lose quorum, then it's possible it could have been rebooted by the watchdog.

Would that not show up in the system log? I don't have much experience, and maybe the OP did not show all of the logs, but I would expect some information from this mechanism.

justinclift · Jun 9, 2024

leesteken said:
Would that not show up in the system log?

When I was testing the watchdog recently in a 2 node cluster, prior to learning how to use a Qdevice, I was seeing watchdog reboots occur that didn't make it to the log on disk.

Using some kind of remote (off host) syslog collector would probably help with that particular log collection case, but even then if the watchdog reboots things fast enough it might not. Not super sure.

SamTzu · Jun 9, 2024

There are 5 KVM hosts on that server hardware.
vm2405 was the only Proxmox that rebooted.
All logs are fed to log server but unfortunately they do not reveal anything suspicious.

leesteken · Jun 9, 2024

SamTzu said:
There are 5 KVM hosts on that server hardware.

Do you mean 5 VMs on Proxmox or 5 Proxmox on your server (using which hypervisor?)?

SamTzu said:
vm2405 was the only Proxmox that rebooted.
All logs are fed to log server but unfortunately they do not reveal anything suspicious.

Are we looking at the logs of a Proxmox or looking at a log of a VM? I assumed you showed Proxmox logs and the system rebooted, but now it sound like only a single VM rebooted. In case of the latter: check the Proxmox system log, for example for Out Of Memory (oom) killing the VM.

SamTzu · Jun 9, 2024

Do you mean 5 VMs on Proxmox or 5 Proxmox on your server (using which hypervisor?)
I mean exactly what I wrote. 5 KVM hosts most of them hosting Proxmox that are nesting LXC containers.

The log snippet I posted was copied from the Proxmox host that rebooted.

leesteken · Jun 9, 2024

SamTzu said:
Do you mean 5 VMs on Proxmox or 5 Proxmox on your server (using which hypervisor?)
I mean exactly what I wrote. 5 KVM hosts most of them hosting Proxmox that are nesting LXC containers.

That was not clear from the first post. Maybe check the logs from the KVM host that runs the Proxmox. Maybe it killed Proxmox for a reason (like OoM)?
Do you run the various Proxmoxes (on KVM hosts) as a cluster or as separate independent nodes?

SamTzu said:
The log snippet I posted was copied from the Proxmox host that rebooted.

Thank you for clearing that up. There is still no clue in that log but that makes a hardware error unlikely (since the hardware did not reboot but a KVM host rebooted). It does however raise questions why the KVM process was restarted. Maybe you can find out by looking at the logging of your KVM hosts?

justinclift · Jun 9, 2024

SamTzu said:
There are 5 KVM hosts on that server hardware.

Weirdly enough, that's actually an unclear statement.

Is that meaning?

You have 5 separate physical machines, and all five are running the same server hardware?
You have 1 physical server, and you have 5 instances of Proxmox running as VM's (or similar) on that one server?
Something else?

Search

Search

Unexpected reboot last night

SamTzu

Renowned Member

leesteken

Distinguished Member

justinclift

Active Member

leesteken

Distinguished Member

justinclift

Active Member

SamTzu

Renowned Member

leesteken

Distinguished Member

SamTzu

Renowned Member

leesteken

Distinguished Member

justinclift

Active Member