[SOLVED] Watchdog rebooted server at random moment - how to debug?

alexc

Active Member
Apr 13, 2015
123
4
38
I use Supermicro server with PVE 6.2. I did the watchdog setup this way:

Code:
/etc/default/pve-ha-manager:
WATCHDOG_MODULE=ipmi_watchdog

Code:
/etc/modprobe.d/ipmi_watchdog.conf:
options ipmi_watchdog action=power_cycle panic_wdt_timeout=10

Code:
/etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"


here is what I see in output:

Code:
# ipmitool mc watchdog get

Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      10 sec
Present Countdown:      9 sec

Server run for months perfectly, and today it rebooted at random moment, with no reasons for.

If there any way to debug what was the core reason for?
 

mira

Proxmox Staff Member
Staff member
Aug 1, 2018
2,105
251
103
Check the syslog (/var/log/syslog[.X.gz]). Are there any corosync and watchdog messages?
 

alexc

Active Member
Apr 13, 2015
123
4
38
No corosync, and watchdog mentioned only as reboot (boot) was in progress so grep returns only these:

Code:
Dec 15 11:49:07 hostname kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.34-1-pve root=/dev/mapper/pve-root ro quiet nmi_watchdog=0
Dec 15 11:49:07 hostname kernel: [    0.594703] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.4.34-1-pve root=/dev/mapper/pve-root ro quiet nmi_watchdog=0
Dec 15 11:49:07 hostname systemd[1]: Started Proxmox VE watchdog multiplexer.
Dec 15 11:49:07 hostname watchdog-mux[1479]: Loading watchdog module 'ipmi_watchdog'
Dec 15 11:49:07 hostname watchdog-mux[1479]: Watchdog driver 'IPMI', version 1

No traces of watchdog actions / warnings / host hangs.
 

mira

Proxmox Staff Member
Staff member
Aug 1, 2018
2,105
251
103
Please provide the complete syslog from ~15 minutes before and after the reboot.
 

alexc

Active Member
Apr 13, 2015
123
4
38
So to say, watchdog was right: RAID card failed and IO become too slow and buggy and stucky, so IPMI based watchdog detected is as hangs and rebooted the server.

Thank you anyway!
 
  • Like
Reactions: aaron

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
3,152
535
118
Thanks for sharing what the problem was! :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!