Kernel crashes on DL360

LnxBil

Distinguished Member
Feb 21, 2015
8,746
1,379
273
Saarland, Germany
Hi,

I write this post to have something for future reference. I found similar problems on the forums, yet not exactly the same one. I debugged a strange reoccurring crash on one of my DL360 with kernel 4.2.6-1-pve. The same crash happens on stock Debian Jessie kernel, but never on older kernels like PVE 3.4 or Debian Wheezy.

The server ran for months and after I rebooted the server (no reboot for several kernel updates), the system does not boot correctly and crashes almost immediately (if not on boot, then max 2 minutes after login prompt). I always use crashdump on all my systems so I had 6 crashdumps yesterday stating the same NMI with corresponding entries in ILO:

Code:
[  139.828501] NMI: PCI system error (SERR) for reason b1 on CPU 0.
[  139.828600] Kernel panic - not syncing: NMI: Not continuing
[  139.828686] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P          IO    4.2.6-1-pve #1
[  139.828839] Hardware name: HP ProLiant DL360 G6, BIOS P64 01/22/2015
[  139.828932]  0000000000000000 f9fd2d75cfd761c9 ffff88040fa05df8 ffffffff818013d8
[  139.829066]  0000000000000000 ffffffff81c8f9ab ffff88040fa05e78 ffffffff817fed2d
[  139.829200]  ffff880400000008 ffff88040fa05e88 ffff88040fa05e28 f9fd2d75cfd761c9
[  139.829333] Call Trace:
[  139.829373]  <NMI>  [<ffffffff818013d8>] dump_stack+0x45/0x57
[  139.829470]  [<ffffffff817fed2d>] panic+0xd0/0x20d
[  139.829544]  [<ffffffff81018cce>] pci_serr_error+0x7e/0x80
[  139.829626]  [<ffffffff81018eee>] default_do_nmi+0xfe/0x100
[  139.829709]  [<ffffffff81018fda>] do_nmi+0xea/0x140
[  139.829783]  [<ffffffff8180a651>] end_repeat_nmi+0x1a/0x1e
[  139.829905]  [<ffffffff814581ff>] ? intel_idle+0xcf/0x140
[  139.830009]  [<ffffffff814581ff>] ? intel_idle+0xcf/0x140
[  139.830089]  [<ffffffff814581ff>] ? intel_idle+0xcf/0x140
[  139.830169]  <<EOE>>  [<ffffffff8168b785>] cpuidle_enter_state+0xb5/0x220
[  139.830277]  [<ffffffff8168b927>] cpuidle_enter+0x17/0x20
[  139.830361]  [<ffffffff810bdbeb>] call_cpuidle+0x3b/0x70
[  139.830440]  [<ffffffff8168b903>] ? cpuidle_select+0x13/0x20
[  139.830524]  [<ffffffff810bdeb7>] cpu_startup_entry+0x297/0x360
[  139.830615]  [<ffffffff817f58bc>] rest_init+0x7c/0x80
[  139.830693]  [<ffffffff81f66029>] start_kernel+0x49a/0x4bb
[  139.830775]  [<ffffffff81f65120>] ? early_idt_handler_array+0x120/0x120
[  139.830873]  [<ffffffff81f654d7>] x86_64_start_reservations+0x2a/0x2c
[  139.831000]  [<ffffffff81f65623>] x86_64_start_kernel+0x14a/0x16d

After hours and hours of work I got it up and stable again by adding this additional kernel parameter (/etc/default/grub and update-grub):

Code:
intel_idle.max_cstate=0

There is a huge bug report on the ubuntu bug tracker or this kind of problems that should occur on all HP hardware with a recent kernel and they offer different solutions for different bugs (including the one I had):

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1417580

Anyone experiencing similar problems?
 
We currently doing burn in Tests with several dl360 G9 (all Firmware uptodate) before we go in Production with pve 4. all Servers are currently under Full load with KVM VM's. Uptime approx. 1 month and Rock solid.
What Kind of Storage Controller do You use?
 
On this specific machine I used SAS2008-based one and 2x MSA60 shelves and also currently unused Emulex 4GBit FC-HBAs.

Have you applied the recent kernel fixes and restarted? I did not encounter the problem on an older incarnation of the pve 4 kernel. Maybe HP fixed the cstate issues on newer machines, the ubuntu report is some month old.
 
We have not applied the kernel fix yet - waiting for a more clear advise. We are running two E5 v3 Xeon CPU on each Server. The Bug Report was mentioning Xeon® Processor E7 v2 in particular, causing issues. i might be wrong, but it seems that not all Xeon CPU types are impacted by this problem.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!