VE 4.0 Kernel Panic on HP Proliant servers

mensinck · Oct 19, 2015

We have 2 labs setup with Proxmox VE 4.0 from latest ISO Download.

In one lab we have HP proliant servers with massive kernel panic on Module hpwdt.ko.

Unfortunately we do not have the trace due to HP's dammed ILO :-( but I will give mor Info when catched it up.

We have a ceph cluster with 3 hosts, 3 monitors up and running on this lab and erverything seems to be quite good.

We can start VM's, also migrate them but as soon you activate HA for any VM we receive a kernel panic on the hhwdt.ko module.

We have DL 360 G6 (lates Bios patches) and a DL380 G( running in this lab.

'This are the versions we are running.

proxmox-ve: 4.0-16 (running kernel: 4.2.2-1-pve)
pve-manager: 4.0-50 (running version: 4.0-50/d3a6b7e5)
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-23
qemu-server: 4.0-31
pve-firmware: 1.1-7
libpve-common-perl: 4.0-32
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-27
pve-libspice-server1: 0.12.5-1
vncterm: 1.2-1
pve-qemu-kvm: 2.4-10
pve-container: 1.0-10
pve-firewall: 2.0-12
pve-ha-manager: 1.0-10
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.3-1
lxcfs: 0.9-pve2
cgmanager: 0.37-pve2
criu: 1.6.0-1
zfsutils: 0.6.5-pve4~jessie

Anything known about this kernel panics?

I found some hints googling around.

- blacklisting hpwdt was suggested but not the solution for VE, since we need the watchdog interfaces.
- I also tried grub parameters:
-- noautogroup and
-- intel_idle.max_cstates=0

with no success.

Since we have no debug symbols for the kernel (I did not find any package about this....), I could not use kdump to catch the panic up.

Any advise which could help or anone having problem like this.

mensinck · Oct 21, 2015

Hi all.

I investigated a bit more now and found the following:

Kernel modules loaded are:

iTCO_wdt 16384 0
iTCO_vendor_support 16384 1 iTCO_wdt
hpwdt 16384 1

Watchdog-mux service is using this:

Main PID: 1439 (watchdog-mux) CGroup: /system.slice/watchdog-mux.service
└─1439 /usr/sbin/watchdog-mux

Oct 21 09:25:10 pmx72 watchdog-mux[1439]: Watchdog driver 'HP iLO2+ HW Watchdog Timer', version 0

and a

echo "A" | socat - UNIX-CONNECT:/var/rund/watchdog-mux

will instantly generate the kernel panic.

iLO2 firmware is upgraded to 2.29 (07/16/2015)

Maybe this helps someone to assist.

t.lamprecht · Oct 21, 2015

The watchdog-mux successfully starts and opens the watchdog device (/dev/watchdog)?

Can you test if, with no running watchdog-mux, the watchdog works?

Code:

echo "A" > /dev/watchdog

This should reset the machine after a bit.

A Kernel panic in the hpwdt.ko module, which is the HP ILO2+ Watchdog, sound more like a bug in the firmware/module, we do nothing special in the watchdog-mux besides accessing the watchdog API of the kernel.

t.lamprecht · Oct 21, 2015

Is the kernel panic looking something like:

Code:

Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details.
[...]

After a bit of investigating I found some bug report regarding your machines, e.g.:
https://bugzilla.redhat.com/show_bug.cgi?id=438741
(very old bug, but still)
Because your firmware is up to date it could be a hardware failure.

Deactivating the module and so falling back to the softdog would help. This Issue is not a Proxmox VE one.

mensinck · Oct 21, 2015

Hi t.lamprecht

t.lamprecht said:
Is the kernel panic looking something like:

Code:

Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details. [...]

This is exactly, what I got..

t.lamprecht said:
After a bit of investigating I found some bug report regarding your machines, e.g.:
https://bugzilla.redhat.com/show_bug.cgi?id=438741
(very old bug, but still)
Because your firmware is up to date it could be a hardware failure.

Deactivating the module and so falling back to the softdog would help. This Issue is not a Proxmox VE one.

You are right, I already found this also. Thought it was only related to 3.x kernels.

And blacklisting hpwdt.ko will love the kernel panic.

Doing

Code:

echo "A" > /dev/watchdog

with watchdog-service off (kernel module hpwdt.ko blacklisted), as well as

Code:

 echo "A" | socat - UNIX-CONNECT:/var/run/watchdog-mux.sock

with service activated will reboot the server now.

So we can conclude, this is related to the kernel bug with hpwdt, since

Code:

echo "A" > /dev/watchdog

will produce the kernel panic with hpwdt.ko loaded.

Thank's a lot for investigating.

Regards Lukas

mcbarlo · Oct 21, 2015

I have the same problem with HP DL320e Gen8 v2. If you blacklist watchdog module server not panic but reset immediatelly. In iLO log you will probably find NMI exception with end of error code 2B.

This issue exists when your server runs out of memory and have much I/O load at the same time. If you use ZFS storage you should have 16 GB RAM, 8GB is total minimum.

adamb · Oct 21, 2015

I setup a fresh 3.4 cluster just for testing out the upgrade procedure. 3.4 has been running on the cluster for a week now with no issues (Before this it was running 3.4 for a few months so I know its solid hardware). Followed the steps to upgrade to 4.0 and overall it went well. The only issue is now 1 of my HP servers throws a NMI and panics as soon as it boots into the OS. I have an identical server which is not having the issue at all.

They are both HP DL380 Gen9's.

I even tried updating all the firmware/iLO on the node having issues. I find it hard to believe this could be a hardware issue if there are so many of us seeing the issue. We are an HP shop so I have plenty of brand new boxed 380 shells sitting in the warehouse I can test with. Here is what I am seeing in the iLO logs.

An Unrecoverable System Error (NMI) has occurred (iLO application watchdog timeout NMI, Service Information: 0x0000002B, 0x00000000)

I will try to do a shell replacement in the AM and see how it goes.

sigxcpu · Oct 21, 2015

I don't know if it helps with NMI, but you should try kdump to get more information on what is going bad.

adamb · Oct 21, 2015

sigxcpu said:
I don't know if it helps with NMI, but you should try kdump to get more information on what is going bad.

I agree, I will dig into that to.

adamb · Oct 22, 2015

I wanted to provide an update. After replacing the shell the issue still persisted. However, I found that the cause is my VM and the large amount of RAM I have assigned. I don't feel the issue I am seeing is the same one as others in this thread.

sigxcpu · Oct 22, 2015

Try to limit memory to a single NUMA node. (numactl -H to see how much memory is allocated per node)

pipomambo · Nov 11, 2015

Hello, We have exactly the same issue. We have a cluster on Proxmox V4.0-48 with two Dell R900 and one HP DL380 G9. This occur only on the HP server. With the module hpwdt loaded, a kernel panic happens randomly. Without the module the server reboot. This happens at random, but mostly when we use the live migration. Did you find a workaround ?

adamb · Nov 11, 2015

pipomambo said:
Hello, We have exactly the same issue. We have a cluster on Proxmox V4.0-48 with two Dell R900 and one HP DL380 G9. This occur only on the HP server. With the module hpwdt loaded, a kernel panic happens randomly. Without the module the server reboot. This happens at random, but mostly when we use the live migration. Did you find a workaround ?

Doesn't sound quite like the same issue. The kernal panic I see only happens while the VM is starting and CPU load sky rockets. Maybe they are related but they sound a bit different. If you go back to the 4.1 or 3.9 kernel on the HP does the issue go away?

pipomambo · Nov 11, 2015

I just update the kernel from 4.2.2-1 to 4.2.3-2 to test. The issue occurs most often when we use live migration. In some ways, the VM stop and start... but it's a bit different, you are right.

adamb · Nov 11, 2015

pipomambo said:
I just update the kernel from 4.2.2-1 to 4.2.3-2 to test. The issue occurs most often when we use live migration. In some ways, the VM stop and start... but it's a bit different, you are right.

Still worth trying the older 4.1 or 3.9 kernels. My issue is resolved on the older kernels.

debi@n · Nov 12, 2015

Hello everybody! this is my first post on forum.proxmox. Thank you for this post, and the help. i tested this on HP proliant Servers, ILO+Watchdog on linux produces kernel panic,when you use HA on proxmox. But you can solve doing this: the modules what produces this is hpwdt. you must do on each hp node:

Code:

lsmod|grep hpwdt (you check that module is loaded)

Stop the service watchdog-mux

Code:

 service watchdog-mux stop

Add the module on blacklist:

Code:

nano /etc/modprobe.d/pve-blacklist.conf

Write on file the next:

Code:

  blacklist hpwdt

Save the file and reboot

Code:

reboot

Check again what the module don´t load now.

Code:

 lsmod|grep hpwdt

My configuration: 2 servers Hp proliant + 1 other machine with proxmox 4. HA is working now,

tatyrza · Nov 16, 2015

Hello! I've got HP DL320e Gen8 v2 and Your solution works for me. Thanks for sharing!

aderumier · Nov 16, 2015

ubuntu has also disable it by default.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1432837

But, It could be a problem of ilo configuration when watchdog is enable by hpwdt.
Maybe they are a ilo timeout configuration somewhere in ilo ?

aderumier · Nov 16, 2015

I also found a note here:

https://lkml.org/lkml/2014/4/25/184

"hpwdt can not work as expected if hp-asrd is running simultaneously.+Because both hpwdt and hp-asrd update same iLO watchdog timer."

Do you have an hp-asrd daemon running ? (maybe from some hp management packages ?)

aderumier · Nov 20, 2015

Hi,another way could be to disable motherboard watchdog,to use the hp ilo watchdog by default.

Code:

 edit:  /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="nmi_watchdog=0"
#update-grub
#reboot

VE 4.0 Kernel Panic on HP Proliant servers

Renowned Member

Renowned Member

Proxmox Staff Member

Proxmox Staff Member

Renowned Member

New Member

Famous Member

Well-Known Member

Famous Member

Famous Member

Well-Known Member

Active Member

Famous Member

Active Member

Famous Member

Active Member

Renowned Member

Well-Known Member

Well-Known Member

Well-Known Member