Random reboot - HW Failure or something else

Feb 28, 2018
8
0
6
42
We just purchased a SuperMicro SuperServer 6029p-tr and put proxmox on it. It was running OK for about a week and then it rebooted in the middle of the night randomly. Then I made a bunch of adjustments below:

1. updated the BIOS
2. adjusted ZFS to reduce ARC_MAX to 8GB and made sure primarycache=metadata rpool/swap and secondarycache=metadata rpool/swap
3. set vm.swappiness to 10
4. ran SuperDiag HW diag tool
5. double check UPS and generator

Yet, it did it again about a week later :( You can see it's definitely an ungraceful shutdown in the last command @ 01:31 AM. 2 Reboots in a row. I'm not seeing anything in kernel or system logs. I mean nothing. Is there any way to know definitively if this is HW or if the is something configured incorrectly.

IDK what to do at this point. I'm kind of at a loss.... Any help would be appreciated :)

last command
root@pve:~# last -n10 -x shutdown reboot
reboot system boot 4.13.13-6-pve Fri Mar 16 12:55 still running
shutdown system down 4.13.13-6-pve Fri Mar 16 11:49 - 12:55 (01:06)
reboot system boot 4.13.13-6-pve Fri Mar 16 01:31 - 11:49 (10:18)
reboot system boot 4.13.13-6-pve Mon Mar 12 13:56 - 11:49 (3+21:53)
shutdown system down 4.13.13-6-pve Mon Mar 12 13:35 - 13:56 (00:21)
reboot system boot 4.13.13-6-pve Mon Mar 12 13:15 - 13:35 (00:19)
shutdown system down 4.13.13-6-pve Mon Mar 12 13:14 - 13:15 (00:01)
reboot system boot 4.13.13-6-pve Mon Mar 12 10:24 - 13:14 (02:49)
shutdown system down 4.13.13-6-pve Mon Mar 12 10:18 - 10:24 (00:06)
reboot system boot 4.13.13-6-pve Mon Mar 12 10:13 - 10:18 (00:05)


kern.log
Mar 15 13:01:37 pve pvedaemon[43253]: <root@pam> successful auth for user 'root@pam'
Mar 15 13:16:38 pve pvedaemon[101049]: <root@pam> successful auth for user 'root@pam'
Mar 15 13:31:39 pve pvedaemon[43253]: <root@pam> successful auth for user 'root@pam'
Mar 16 01:31:30 pve kernel: [ 0.000000] random: get_random_bytes called from start_kernel+0x42/0x501 with crng_init=0
Mar 16 01:31:30 pve kernel: [ 0.000000] Linux version 4.13.13-6-pve (root@nora) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u0)) #1 SMP PVE 4.13.13-41 (Wed, 21 Feb 2018 10:07:54 +0100) ()
Mar 16 01:31:30 pve kernel: [ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.13.13-6-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Mar 16 01:31:30 pve kernel: [ 0.000000] KERNEL supported cpus:
Mar 16 01:31:30 pve kernel: [ 0.000000] Intel GenuineIntel

syslog.1
Mar 16 01:27:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 16 01:27:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 16 01:28:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 16 01:28:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 16 01:29:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 16 01:29:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 16 01:31:30 pve systemd-modules-load[2791]: Inserted module 'iscsi_tcp'
Mar 16 01:31:30 pve kernel: [ 0.000000] random: get_random_bytes called from start_kernel+0x42/0x501 with crng_init=0
Mar 16 01:31:30 pve systemd-modules-load[2791]: Inserted module 'ib_iser'
Mar 16 01:31:30 pve kernel: [ 0.000000] Linux version 4.13.13-6-pve (root@nora) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u0)) #1 SMP PVE 4.13.13-41 (Wed, 21 Feb 2018 10:07:54 +0100) ()
Mar 16 01:31:30 pve systemd-modules-load[2791]: Inserted module 'vhost_net'
Mar 16 01:31:30 pve kernel: [ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.13.13-6-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Mar 16 01:31:30 pve kernel: [ 0.000000] KERNEL supported cpus:
Mar 16 01:31:30 pve kernel: [ 0.000000] Intel GenuineIntel
Mar 16 01:31:30 pve systemd-udevd[2889]: Process '/bin/mount -t fusectl fusectl /sys/fs/fuse/connections' failed with exit code 32.
Mar 16 01:31:30 pve kernel: [ 0.000000] AMD AuthenticAMD
Mar 16 01:31:30 pve kernel: [ 0.000000] Centaur CentaurHauls
Mar 16 01:31:30 pve systemd[1]: Starting Flush Journal to Persistent Storage...

SuperDiag Results
Copyright(c) 1993-2018 Super Micro Computer, Inc.
Execution Time : 16:00:24 03/16/2018
MB Name : X11DPi-NT
MB Serial Number: OM178S024019
[Component Detection]
Start Time: 16:01:01 03/16/2018
Result: Passed
Total Type Count: 11, Passed Count: 11, Failed Count: 0
[Component Diagnostics]
Start Time: 16:01:47 03/16/2018
Result: Passed
Total Type Count: 10, Passed Count: 10, Failed Count: 0
Overall Result: Passed


kernel info

root@pve:~# uname -a
Linux pve 4.13.13-6-pve #1 SMP PVE 4.13.13-41 (Wed, 21 Feb 2018 10:07:54 +0100) x86_64 GNU/Linux
 
Both restarts are in the wee-hours of the morning? Is this when the system is mostly idle or do you have a load running on it?
 
If you haven't explicitly set up the kernel to reboot on panic, then it would normally hang rather than reboot for most hardware faults (with a trace on console). If that server has a management device, you can check the log entries (SEL) using ipmitool. If not, reboot and enter the BIOS to see if anything has been flagged.

In my experience, mysterious reboots are most often due to failing power supplies.
 
Both restarts are in the wee-hours of the morning? Is this when the system is mostly idle or do you have a load running on it?

It's actually mostly idle, only a few VMs on it since it's so new. CPU temps are amazingly low.
CPU1 Temp Normal 28 degrees C
CPU2 Temp Normal 27 degrees C
PCH Temp Normal 47 degrees C
System Temp Normal 26 degrees C
Peripheral Temp Normal 38 degrees C
MB_10G Temp Normal 48 degrees C
VRMCpu1 Temp Normal 33 degrees C
VRMCpu2 Temp Normal 33 degrees C
VRMP1ABC Temp Normal 36 degrees C
VRMP1DEF Temp Normal 29 degrees C
VRMP2ABC Temp Normal 30 degrees C
VRMP2DEF Temp Normal 35 degrees C
FAN1 Normal 3400 R.P.M
FAN2 Normal 3400 R.P.M
FAN3 Normal 3400 R.P.M

Usage Stats
CPU usage
0.21% of 32 CPU(s)
IO delay
0.00%

Load average
0.20,0.10,0.03

RAM usage
22.20% (20.66 GiB of 93.08 GiB)
 
If you haven't explicitly set up the kernel to reboot on panic, then it would normally hang rather than reboot for most hardware faults (with a trace on console). If that server has a management device, you can check the log entries (SEL) using ipmitool. If not, reboot and enter the BIOS to see if anything has been flagged.

In my experience, mysterious reboots are most often due to failing power supplies.

The IPMI has nothing in the logs :( I have redundant PSU, would 1 of them failing cause the whole system to reboot? Or am I just that lucky that both of them are failing.
 
There is a known - but very very rare - issue with some Intel CPUs where upon reaching the lowest power states it cannot "restart", the CPU looks dead and the system will watchdog out. In these cases the CPU just stops and there is no "kernel panic" which leaves no opportunity for the Kernel to do any logging or traceback.

The faults "look" like you lost power - which leads to lots of troubleshooting of PSUs and MB VRMs...

The fact that your restarts are when "mostly idle" and that there is no record of the fault suggest this might be in play (though this is highly speculative).

What are your BIOS settings under CPU settings for P-state/C-state control?
 
That does sounds like your bases are covered on the power side unless the failover / detection circuit is faulty. You can test this by removing and reinserting the power supplies one at a time to confirm the power transfers properly.
 
There is a known - but very very rare - issue with some Intel CPUs where upon reaching the lowest power states it cannot "restart", the CPU looks dead and the system will watchdog out. In these cases the CPU just stops and there is no "kernel panic" which leaves no opportunity for the Kernel to do any logging or traceback.

The faults "look" like you lost power - which leads to lots of troubleshooting of PSUs and MB VRMs...

The fact that your restarts are when "mostly idle" and that there is no record of the fault suggest this might be in play (though this is highly speculative).

What are your BIOS settings under CPU settings for P-state/C-state control?

Well the whole CPU power config is set to Energy Efficient, I changed it to Custom so I could look at the settings.

CPU P State
SpeedStep: Enable
EIST PSD Function: HW_ALL
Turbo Mode: Enabled

CPU C State
Autonomous Core C-State: Disable
CPU C6 report: Auto
Enhanced Halt State(C1E): Enable

CPU T State
Software Controlled T-States: Enable

Also Power Performance Tuning is set to "OS Controls EPB". I am way out of my comfort zone with this CPU power mgmt stuff.
 
Quick and dirty test: disable SpeedStep. You don't want to leave it this way permanently because your idle-power usage will go through the roof, but if your halt/reboots go away you'll have some more evidence that this was probably the issue.

Longer term - turn SpeedStep back on, but find someone to help you use the "cpupower" Linux command to disable the lower C-States/P-States. Disabling the lowest package state (P6) may be enough.
 
  • Like
Reactions: GadgetPig

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!