Hi,
I write this post to have something for future reference. I found similar problems on the forums, yet not exactly the same one. I debugged a strange reoccurring crash on one of my DL360 with kernel 4.2.6-1-pve. The same crash happens on stock Debian Jessie kernel, but never on older kernels like PVE 3.4 or Debian Wheezy.
The server ran for months and after I rebooted the server (no reboot for several kernel updates), the system does not boot correctly and crashes almost immediately (if not on boot, then max 2 minutes after login prompt). I always use crashdump on all my systems so I had 6 crashdumps yesterday stating the same NMI with corresponding entries in ILO:
After hours and hours of work I got it up and stable again by adding this additional kernel parameter (/etc/default/grub and update-grub):
There is a huge bug report on the ubuntu bug tracker or this kind of problems that should occur on all HP hardware with a recent kernel and they offer different solutions for different bugs (including the one I had):
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1417580
Anyone experiencing similar problems?
I write this post to have something for future reference. I found similar problems on the forums, yet not exactly the same one. I debugged a strange reoccurring crash on one of my DL360 with kernel 4.2.6-1-pve. The same crash happens on stock Debian Jessie kernel, but never on older kernels like PVE 3.4 or Debian Wheezy.
The server ran for months and after I rebooted the server (no reboot for several kernel updates), the system does not boot correctly and crashes almost immediately (if not on boot, then max 2 minutes after login prompt). I always use crashdump on all my systems so I had 6 crashdumps yesterday stating the same NMI with corresponding entries in ILO:
Code:
[ 139.828501] NMI: PCI system error (SERR) for reason b1 on CPU 0.
[ 139.828600] Kernel panic - not syncing: NMI: Not continuing
[ 139.828686] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P IO 4.2.6-1-pve #1
[ 139.828839] Hardware name: HP ProLiant DL360 G6, BIOS P64 01/22/2015
[ 139.828932] 0000000000000000 f9fd2d75cfd761c9 ffff88040fa05df8 ffffffff818013d8
[ 139.829066] 0000000000000000 ffffffff81c8f9ab ffff88040fa05e78 ffffffff817fed2d
[ 139.829200] ffff880400000008 ffff88040fa05e88 ffff88040fa05e28 f9fd2d75cfd761c9
[ 139.829333] Call Trace:
[ 139.829373] <NMI> [<ffffffff818013d8>] dump_stack+0x45/0x57
[ 139.829470] [<ffffffff817fed2d>] panic+0xd0/0x20d
[ 139.829544] [<ffffffff81018cce>] pci_serr_error+0x7e/0x80
[ 139.829626] [<ffffffff81018eee>] default_do_nmi+0xfe/0x100
[ 139.829709] [<ffffffff81018fda>] do_nmi+0xea/0x140
[ 139.829783] [<ffffffff8180a651>] end_repeat_nmi+0x1a/0x1e
[ 139.829905] [<ffffffff814581ff>] ? intel_idle+0xcf/0x140
[ 139.830009] [<ffffffff814581ff>] ? intel_idle+0xcf/0x140
[ 139.830089] [<ffffffff814581ff>] ? intel_idle+0xcf/0x140
[ 139.830169] <<EOE>> [<ffffffff8168b785>] cpuidle_enter_state+0xb5/0x220
[ 139.830277] [<ffffffff8168b927>] cpuidle_enter+0x17/0x20
[ 139.830361] [<ffffffff810bdbeb>] call_cpuidle+0x3b/0x70
[ 139.830440] [<ffffffff8168b903>] ? cpuidle_select+0x13/0x20
[ 139.830524] [<ffffffff810bdeb7>] cpu_startup_entry+0x297/0x360
[ 139.830615] [<ffffffff817f58bc>] rest_init+0x7c/0x80
[ 139.830693] [<ffffffff81f66029>] start_kernel+0x49a/0x4bb
[ 139.830775] [<ffffffff81f65120>] ? early_idt_handler_array+0x120/0x120
[ 139.830873] [<ffffffff81f654d7>] x86_64_start_reservations+0x2a/0x2c
[ 139.831000] [<ffffffff81f65623>] x86_64_start_kernel+0x14a/0x16d
After hours and hours of work I got it up and stable again by adding this additional kernel parameter (/etc/default/grub and update-grub):
Code:
intel_idle.max_cstate=0
There is a huge bug report on the ubuntu bug tracker or this kind of problems that should occur on all HP hardware with a recent kernel and they offer different solutions for different bugs (including the one I had):
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1417580
Anyone experiencing similar problems?