VE 4.0 Kernel Panic on HP Proliant servers

mensinck

Member
Oct 19, 2015
14
1
23
Kiel, germany
We have 2 labs setup with Proxmox VE 4.0 from latest ISO Download.

In one lab we have HP proliant servers with massive kernel panic on Module hpwdt.ko.

Unfortunately we do not have the trace due to HP's dammed ILO :-( but I will give mor Info when catched it up.

We have a ceph cluster with 3 hosts, 3 monitors up and running on this lab and erverything seems to be quite good.

We can start VM's, also migrate them but as soon you activate HA for any VM we receive a kernel panic on the hhwdt.ko module.

We have DL 360 G6 (lates Bios patches) and a DL380 G( running in this lab.

'This are the versions we are running.

proxmox-ve: 4.0-16 (running kernel: 4.2.2-1-pve)
pve-manager: 4.0-50 (running version: 4.0-50/d3a6b7e5)
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-23
qemu-server: 4.0-31
pve-firmware: 1.1-7
libpve-common-perl: 4.0-32
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-27
pve-libspice-server1: 0.12.5-1
vncterm: 1.2-1
pve-qemu-kvm: 2.4-10
pve-container: 1.0-10
pve-firewall: 2.0-12
pve-ha-manager: 1.0-10
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.3-1
lxcfs: 0.9-pve2
cgmanager: 0.37-pve2
criu: 1.6.0-1
zfsutils: 0.6.5-pve4~jessie

Anything known about this kernel panics?

I found some hints googling around.

- blacklisting hpwdt was suggested but not the solution for VE, since we need the watchdog interfaces.
- I also tried grub parameters:
-- noautogroup and
-- intel_idle.max_cstates=0

with no success.

Since we have no debug symbols for the kernel (I did not find any package about this....), I could not use kdump to catch the panic up.

Any advise which could help or anone having problem like this.
 

mensinck

Member
Oct 19, 2015
14
1
23
Kiel, germany
Hi all.

I investigated a bit more now and found the following:

Kernel modules loaded are:

iTCO_wdt 16384 0
iTCO_vendor_support 16384 1 iTCO_wdt
hpwdt 16384 1

Watchdog-mux service is using this:

Main PID: 1439 (watchdog-mux) CGroup: /system.slice/watchdog-mux.service
└─1439 /usr/sbin/watchdog-mux

Oct 21 09:25:10 pmx72 watchdog-mux[1439]: Watchdog driver 'HP iLO2+ HW Watchdog Timer', version 0

and a
echo "A" | socat - UNIX-CONNECT:/var/rund/watchdog-mux

will instantly generate the kernel panic.:(

iLO2 firmware is upgraded to 2.29 (07/16/2015)


Maybe this helps someone to assist.
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
3,265
593
133
South Tyrol/Italy
shop.maurer-it.com
The watchdog-mux successfully starts and opens the watchdog device (/dev/watchdog)?

Can you test if, with no running watchdog-mux, the watchdog works?

Code:
echo "A" > /dev/watchdog

This should reset the machine after a bit.

A Kernel panic in the hpwdt.ko module, which is the HP ILO2+ Watchdog, sound more like a bug in the firmware/module, we do nothing special in the watchdog-mux besides accessing the watchdog API of the kernel.
 
Last edited:

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
3,265
593
133
South Tyrol/Italy
shop.maurer-it.com
Is the kernel panic looking something like:
Code:
Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details.
[...]

After a bit of investigating I found some bug report regarding your machines, e.g.:
https://bugzilla.redhat.com/show_bug.cgi?id=438741
(very old bug, but still)
Because your firmware is up to date it could be a hardware failure.

Deactivating the module and so falling back to the softdog would help. This Issue is not a Proxmox VE one.
 

mensinck

Member
Oct 19, 2015
14
1
23
Kiel, germany
Hi t.lamprecht

Is the kernel panic looking something like:
Code:
Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details.
[...]

This is exactly, what I got..

After a bit of investigating I found some bug report regarding your machines, e.g.:
https://bugzilla.redhat.com/show_bug.cgi?id=438741
(very old bug, but still)
Because your firmware is up to date it could be a hardware failure.

Deactivating the module and so falling back to the softdog would help. This Issue is not a Proxmox VE one.

You are right, I already found this also. Thought it was only related to 3.x kernels.

And blacklisting hpwdt.ko will love the kernel panic.

Doing

Code:
echo "A" > /dev/watchdog
with watchdog-service off (kernel module hpwdt.ko blacklisted), as well as
Code:
 echo "A" | socat - UNIX-CONNECT:/var/run/watchdog-mux.sock
with service activated will reboot the server now.

So we can conclude, this is related to the kernel bug with hpwdt, since
Code:
echo "A" > /dev/watchdog
will produce the kernel panic with hpwdt.ko loaded.

Thank's a lot for investigating.

Regards Lukas
 

mcbarlo

New Member
Oct 10, 2015
10
0
1
I have the same problem with HP DL320e Gen8 v2. If you blacklist watchdog module server not panic but reset immediatelly. In iLO log you will probably find NMI exception with end of error code 2B.

This issue exists when your server runs out of memory and have much I/O load at the same time. If you use ZFS storage you should have 16 GB RAM, 8GB is total minimum.
 

adamb

Renowned Member
Mar 1, 2012
1,165
44
68
I setup a fresh 3.4 cluster just for testing out the upgrade procedure. 3.4 has been running on the cluster for a week now with no issues (Before this it was running 3.4 for a few months so I know its solid hardware). Followed the steps to upgrade to 4.0 and overall it went well. The only issue is now 1 of my HP servers throws a NMI and panics as soon as it boots into the OS. I have an identical server which is not having the issue at all.

They are both HP DL380 Gen9's.

I even tried updating all the firmware/iLO on the node having issues. I find it hard to believe this could be a hardware issue if there are so many of us seeing the issue. We are an HP shop so I have plenty of brand new boxed 380 shells sitting in the warehouse I can test with. Here is what I am seeing in the iLO logs.

An Unrecoverable System Error (NMI) has occurred (iLO application watchdog timeout NMI, Service Information: 0x0000002B, 0x00000000)

I will try to do a shell replacement in the AM and see how it goes.
 
Last edited:

adamb

Renowned Member
Mar 1, 2012
1,165
44
68
I wanted to provide an update. After replacing the shell the issue still persisted. However, I found that the cause is my VM and the large amount of RAM I have assigned. I don't feel the issue I am seeing is the same one as others in this thread.
 

pipomambo

New Member
Nov 11, 2015
2
0
1
Hello, We have exactly the same issue. We have a cluster on Proxmox V4.0-48 with two Dell R900 and one HP DL380 G9. This occur only on the HP server. With the module hpwdt loaded, a kernel panic happens randomly. Without the module the server reboot. This happens at random, but mostly when we use the live migration. Did you find a workaround ?
 

adamb

Renowned Member
Mar 1, 2012
1,165
44
68
Hello, We have exactly the same issue. We have a cluster on Proxmox V4.0-48 with two Dell R900 and one HP DL380 G9. This occur only on the HP server. With the module hpwdt loaded, a kernel panic happens randomly. Without the module the server reboot. This happens at random, but mostly when we use the live migration. Did you find a workaround ?

Doesn't sound quite like the same issue. The kernal panic I see only happens while the VM is starting and CPU load sky rockets. Maybe they are related but they sound a bit different. If you go back to the 4.1 or 3.9 kernel on the HP does the issue go away?
 

pipomambo

New Member
Nov 11, 2015
2
0
1
I just update the kernel from 4.2.2-1 to 4.2.3-2 to test. The issue occurs most often when we use live migration. In some ways, the VM stop and start... but it's a bit different, you are right.
 

adamb

Renowned Member
Mar 1, 2012
1,165
44
68
I just update the kernel from 4.2.2-1 to 4.2.3-2 to test. The issue occurs most often when we use live migration. In some ways, the VM stop and start... but it's a bit different, you are right.

Still worth trying the older 4.1 or 3.9 kernels. My issue is resolved on the older kernels.
 

debi@n

Member
Nov 12, 2015
113
0
16
Málaga,Spain
Hello everybody! this is my first post on forum.proxmox. Thank you for this post, and the help. i tested this on HP proliant Servers, ILO+Watchdog on linux produces kernel panic,when you use HA on proxmox. But you can solve doing this: the modules what produces this is hpwdt. you must do on each hp node:
Code:
lsmod|grep hpwdt (you check that module is loaded)
Stop the service watchdog-mux
Code:
 service watchdog-mux stop
Add the module on blacklist:
Code:
nano /etc/modprobe.d/pve-blacklist.conf
Write on file the next:
Code:
  blacklist hpwdt
Save the file and reboot
Code:
reboot
Check again what the module don´t load now.
Code:
 lsmod|grep hpwdt
My configuration: 2 servers Hp proliant + 1 other machine with proxmox 4. HA is working now, :)
 
Last edited:

aderumier

Member
May 14, 2013
203
18
18
I also found a note here:

https://lkml.org/lkml/2014/4/25/184

"hpwdt can not work as expected if hp-asrd is running simultaneously.+Because both hpwdt and hp-asrd update same iLO watchdog timer."


Do you have an hp-asrd daemon running ? (maybe from some hp management packages ?)


 

aderumier

Member
May 14, 2013
203
18
18
Hi,another way could be to disable motherboard watchdog,to use the hp ilo watchdog by default.
Code:
 edit:  /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="nmi_watchdog=0"
#update-grub
#reboot
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!