Kernel crashes on DL360

LnxBil · Feb 16, 2016

Hi,

I write this post to have something for future reference. I found similar problems on the forums, yet not exactly the same one. I debugged a strange reoccurring crash on one of my DL360 with kernel 4.2.6-1-pve. The same crash happens on stock Debian Jessie kernel, but never on older kernels like PVE 3.4 or Debian Wheezy.

The server ran for months and after I rebooted the server (no reboot for several kernel updates), the system does not boot correctly and crashes almost immediately (if not on boot, then max 2 minutes after login prompt). I always use crashdump on all my systems so I had 6 crashdumps yesterday stating the same NMI with corresponding entries in ILO:

Code:

[  139.828501] NMI: PCI system error (SERR) for reason b1 on CPU 0.
[  139.828600] Kernel panic - not syncing: NMI: Not continuing
[  139.828686] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P          IO    4.2.6-1-pve #1
[  139.828839] Hardware name: HP ProLiant DL360 G6, BIOS P64 01/22/2015
[  139.828932]  0000000000000000 f9fd2d75cfd761c9 ffff88040fa05df8 ffffffff818013d8
[  139.829066]  0000000000000000 ffffffff81c8f9ab ffff88040fa05e78 ffffffff817fed2d
[  139.829200]  ffff880400000008 ffff88040fa05e88 ffff88040fa05e28 f9fd2d75cfd761c9
[  139.829333] Call Trace:
[  139.829373]  <NMI>  [<ffffffff818013d8>] dump_stack+0x45/0x57
[  139.829470]  [<ffffffff817fed2d>] panic+0xd0/0x20d
[  139.829544]  [<ffffffff81018cce>] pci_serr_error+0x7e/0x80
[  139.829626]  [<ffffffff81018eee>] default_do_nmi+0xfe/0x100
[  139.829709]  [<ffffffff81018fda>] do_nmi+0xea/0x140
[  139.829783]  [<ffffffff8180a651>] end_repeat_nmi+0x1a/0x1e
[  139.829905]  [<ffffffff814581ff>] ? intel_idle+0xcf/0x140
[  139.830009]  [<ffffffff814581ff>] ? intel_idle+0xcf/0x140
[  139.830089]  [<ffffffff814581ff>] ? intel_idle+0xcf/0x140
[  139.830169]  <<EOE>>  [<ffffffff8168b785>] cpuidle_enter_state+0xb5/0x220
[  139.830277]  [<ffffffff8168b927>] cpuidle_enter+0x17/0x20
[  139.830361]  [<ffffffff810bdbeb>] call_cpuidle+0x3b/0x70
[  139.830440]  [<ffffffff8168b903>] ? cpuidle_select+0x13/0x20
[  139.830524]  [<ffffffff810bdeb7>] cpu_startup_entry+0x297/0x360
[  139.830615]  [<ffffffff817f58bc>] rest_init+0x7c/0x80
[  139.830693]  [<ffffffff81f66029>] start_kernel+0x49a/0x4bb
[  139.830775]  [<ffffffff81f65120>] ? early_idt_handler_array+0x120/0x120
[  139.830873]  [<ffffffff81f654d7>] x86_64_start_reservations+0x2a/0x2c
[  139.831000]  [<ffffffff81f65623>] x86_64_start_kernel+0x14a/0x16d

After hours and hours of work I got it up and stable again by adding this additional kernel parameter (/etc/default/grub and update-grub):

Code:

intel_idle.max_cstate=0

There is a huge bug report on the ubuntu bug tracker or this kind of problems that should occur on all HP hardware with a recent kernel and they offer different solutions for different bugs (including the one I had):

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1417580

Anyone experiencing similar problems?

sumsum · Feb 16, 2016

We currently doing burn in Tests with several dl360 G9 (all Firmware uptodate) before we go in Production with pve 4. all Servers are currently under Full load with KVM VM's. Uptime approx. 1 month and Rock solid.
What Kind of Storage Controller do You use?

LnxBil · Feb 16, 2016

On this specific machine I used SAS2008-based one and 2x MSA60 shelves and also currently unused Emulex 4GBit FC-HBAs.

Have you applied the recent kernel fixes and restarted? I did not encounter the problem on an older incarnation of the pve 4 kernel. Maybe HP fixed the cstate issues on newer machines, the ubuntu report is some month old.

sumsum · Feb 16, 2016

We have not applied the kernel fix yet - waiting for a more clear advise. We are running two E5 v3 Xeon CPU on each Server. The Bug Report was mentioning Xeon® Processor E7 v2 in particular, causing issues. i might be wrong, but it seems that not all Xeon CPU types are impacted by this problem.

debi@n · Feb 17, 2016

can you test if the module hpwdt is loaded?

Code:

lsmod|grep hpwdt

see this for more info : https://forum.proxmox.com/threads/ve-4-0-kernel-panic-on-hp-proliant-servers.24015/

Search

Search

Kernel crashes on DL360

LnxBil

Distinguished Member

sumsum

Renowned Member

LnxBil

Distinguished Member

sumsum

Renowned Member

debi@n

Active Member