PVE Kernel crash on HP Bladeserver after upgrade to PVE 2.2

philister · Nov 29, 2012

Hello,

we're running PVE on 4 HP BL460c Blade Servers, two are G1, one is G6 and one G7. They're all running in a cluster and everything worked fine until about two weeks ago. That was when I updated to PVE 2.2. Since then, the PVE Kernel crashes regularly on the G6 Blade, just shortly after booting up and some klicking around in the Web GUI. The G1 and G7 Blades are running fine using the same Kernel version as the G6.

At first I thought it was a hardware problem, but HP have now replaced basically every Part of the Server and the problem still persists. Today, I reinstalled PVE 2.1 on it, and it seems to be OK again - the Web GUI is much more responsive and the server doesn't crash any more.

I attach my syslog that shows the stack traces of three crashes. It seems like I get a different error every time.

It'd be great if someone had an idea.

Thanks very much.

tom · Nov 29, 2012

just to note, there is a new stable kernel in the pvetest repo, try this one. see http://forum.proxmox.com/threads/12046-KVM-QEMU-1-3-for-Proxmox-VE-2-2-(pvetest)
if it does not work, you can easily boot the old one.

dietmar · Nov 29, 2012

Please can you test with the latest kernel on pvetest repository?

philister · Nov 30, 2012

dietmar said:
Please can you test with the latest kernel on pvetest repository?

Yes, sure.

How can I install just the kernel from the test repo? Or should I just add the test repo to my sources.list and do a dist-upgrade (that would install all the packages from the test repo, wouldn't it?)?

tom · Nov 30, 2012

just what you like. you can also just download the kernel with wget and install with dpkg -i ...

philister · Nov 30, 2012

tom said:
just what you like. you can also just download the kernel with wget and install with dpkg -i ...

I installed Kernel pve 2.6.32-17 with dpkg -i. Unfortunately it didn'd solve the problem. I rebooted the server and just refreshed the Web GUI (F5), then it crashed again, see screen shot.

Any ideas?

philister · Dec 3, 2012

Any more ideas? I hope I won't have to throw the server away ...

It runs just fine with PVE 2.1, but I guess it's not recommended to use different PVE versions within one cluster, is it?

tom · Dec 3, 2012

using different kernel within the same cluster is most times no problem.

philister · Dec 3, 2012

Hello Tom,

"most times no problem" is the same as "sometimes big problem", right?

Is there any chance to get hold of the real problem? I mean, the fact that the latest pve kernel is regularly crashing on a major vondor's hardware. I'm not an expert, but it seems to me like some driver issue.

What do you suggest? Would you continue with the 2.1 kernel in a 2.2 cluster if it was your environment?

Thanks a lot for any advice.

tom · Dec 3, 2012

philister said:
Hello Tom,

"most times no problem" is the same as "sometimes big problem", right?

no, that means that we do not test all possible combination here in our test lab and therefore we cannot guarantee that it works. we test latest kernels, same version in all nodes - the recommended setup.

philister said:
Is there any chance to get hold of the real problem? I mean, the fact that the latest pve kernel is regularly crashing on a major vondor's hardware. I'm not an expert, but it seems to me like some driver issue.

What do you suggest? Would you continue with the 2.1 kernel in a 2.2 cluster if it was your environment?

Thanks a lot for any advice.

As HP does not test their server against Proxmox VE I have no idea if its a driver issue or not or a faulty hardware, triggered by latest kernel. I have no such box here in our lab so I cannot test this here.

philister · Dec 3, 2012

I can imagine that this is a very tough one to solve. HP now replaced all components of the server, so I doubt it's a faulty hardware. What I'd like to ask you is

1. Can I give you any more information so you can maybe fix the problem in the next kernel version? I'd be glad to test dev kernels on our server.
2. In what respect does the pve kernel differ from the standard Debian one?
3. Can you say anything more specific about the error from the logs I sent you?

Thank you very much.

RodinM · Dec 3, 2012

Hi, I've got exactly the same problem

I posted already in the topic http://forum.proxmox.com/threads/11870-Updates-for-Proxmox-VE-2-2: "Problem server is HP ProLiant BL460c G6 Other (working) servers in the cluster are: 1) HP ProLiant BL620c G7 2) HP ProLiant BL460c G7" I hope that I really won't have any problems with my two remaining G1 blades (they are now running vmware ESX). I'm currently migrating my VM's from VMware to the existing cluster with two G7 blades. I had to remove one G6 blade from the cluster because of kernel panics happening short after boot. And I also hope that this will be fixed in next major upgrade. I don't want to create a separate cluster with 2.1 version running on two remaining G6

philister · Dec 3, 2012

RodinM said:
Hi, I've got exactly the same problem

Well, congratulations! ;-)

RodinM said:
I hope that I really won't have any problems with my two remaining G1 blades (they are now running vmware ESX). I'm currently migrating my VM's from VMware to the existing cluster with two G7 blades.

Looks like you're about to do the exactly same job as we did. In our case, the G1 and G7 blades run fine under PVE 2.2, it's only the G6 that has problems.

hansentho · Dec 30, 2012

I experience the same problem with Dell PowerEdge 620, ver 2.2 runs fine on the older Dell servers?? Has anyone solved this Issue or found the reason to the Issue?

philister · Jan 4, 2013

I have no solution so far, unfortunately. I have found out that it's not related to the kernel, though. The problem occurs also when I'm using a kernel from the 2.0 release together with the pve tools from the 2.2 release.

In our case, we just bought two more G7 blades which are running fine, and removed the G6 from the cluster. I'll test every couple of months if the problem is fixed in the latest pve release. Hopefully one day we'll be able to let the G6 join tzhe cluster again.

RodinM · Jan 24, 2013

philister said:
I have no solution so far, unfortunately. I have found out that it's not related to the kernel, though. The problem occurs also when I'm using a kernel from the 2.0 release together with the pve tools from the 2.2 release.

In our case, we just bought two more G7 blades which are running fine, and removed the G6 from the cluster. I'll test every couple of months if the problem is fixed in the latest pve release. Hopefully one day we'll be able to let the G6 join tzhe cluster again.

So will I

mattym · Jan 24, 2013

I believe the G6 blades have on board 10Gbe adapters and there have been a few posts that the Broadcom drivers within the kernel are bust for 10Gbe adapters, bnx2x I think they were. Not sure what chipset NICs are in the G7s. *After having a search the G7 has Emulex chipset on the NICs*

mo_ · Jan 24, 2013

I noticed how "bnx2x" is a loaded module in the crash dump in your initial posting. it is entirely possible and likely, that it is the culprit of your issues as many others have had bnx2x related kernel panics.

I would suggest you to try a more recent version of that kernel module. another user of this forum has generated a patched driver ("patched" means it contains small changes from the original driver that enable it to be compiled on pve systems). you can get it here: http://forum.proxmox.com/threads/12064-DRBD-Assistance?p=66041#post66041

heres the procedure:

Code:

tar -zxf netxtreme2-7.2.20_fix_build_2.6.32-16-pve.tar.gz
cd netxtreme2-7.2.20
apt-get install pve-headers-`uname -r` build-essential 
make
rmmod bnx2x 
insmod ./bnx2x/src/bnx2x.ko
make install
update-initramfs -u

if however the apt-get part causes the kernel to panic (as it did for me), you need a workaround like this one: http://forum.proxmox.com/threads/12530-Kernel-Panic?p=68134#post68134

RodinM · Jan 30, 2013

I did all the steps described in previous post and it helped... Partly.
The G6 Blade server is now functioning (I didn't try to add it to my cluster. I want to test its stability). While I was testing the server I noticed that the syslog shows the same output as it was during kernel panic:

------------[ cut here ]------------
WARNING: at net/core/dev.c:1700 skb_gso_segment+0x1e2/0x2c0() (Tainted: G W --------------- )
Hardware name: ProLiant BL460c G6
tun: caps=(0x801b0049, 0x0) len=2462 data_len=1012 ip_summed=1
Modules linked in: vzethdev vznetdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vzrst nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 vzcpt nfs lockd fscache nfs_acl auth_rpcgss sunrpc nf_conntrack vzdquota vzmon vzdev ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables xt_length xt_hl xt_tcpmss xt_TCPMSS iptable_mangle vhost_net iptable_filter xt_multiport macvtap xt_limit xt_dscp macvlan tun ipt_REJECT kvm_intel ip_tables kvm vzevent ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp fuse libiscsi_tcp libiscsi scsi_transport_iscsi radeon ttm drm_kms_helper drm snd_pcsp snd_pcm ipmi_si i7core_edac tpm_tis ipmi_msghandler tpm snd_timer serio_raw edac_core tpm_bios snd i2c_algo_bit hpilo i2c_core soundcore video snd_page_alloc power_meter shpchp hpwdt output ext3 jbd mbcache hpsa cciss bnx2x mdio [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper veid: 0 Tainted: G W --------------- 2.6.32-17-pve #1
Call Trace:
<IRQ> [<ffffffff8106cf88>] ? warn_slowpath_common+0x88/0xc0
[<ffffffff8106d076>] ? warn_slowpath_fmt+0x46/0x50
[<ffffffff8145d022>] ? skb_gso_segment+0x1e2/0x2c0
[<ffffffff81461fde>] ? dev_hard_start_xmit+0x19e/0x510
[<ffffffff8149604c>] ? ip_local_deliver_finish+0x11c/0x310
[<ffffffff8147d15a>] ? sch_direct_xmit+0x15a/0x1d0
[<ffffffff81462878>] ? dev_queue_xmit+0x528/0x740
[<ffffffff814f95d0>] ? br_dev_queue_push_xmit+0x60/0xc0
[<ffffffff814f9688>] ? br_forward_finish+0x58/0x60
[<ffffffff814f972a>] ? __br_forward+0x9a/0xc0
[<ffffffff814f97b5>] ? br_forward+0x65/0x70
[<ffffffff814fa7e1>] ? br_handle_frame_finish+0x221/0x300
[<ffffffff814faa82>] ? br_handle_frame+0x1c2/0x270
[<ffffffff8145c9fe>] ? __netif_receive_skb+0x45e/0x750
[<ffffffff8145ef18>] ? netif_receive_skb+0x58/0x60
[<ffffffff8145f030>] ? napi_skb_finish+0x50/0x70
[<ffffffff81461729>] ? napi_gro_receive+0x39/0x50
[<ffffffffa0042c6e>] ? bnx2x_rx_int+0xcce/0x1720 [bnx2x]
[<ffffffff8111a42b>] ? perf_pmu_enable+0x2b/0x40
[<ffffffff8111fbc8>] ? perf_event_task_tick+0xa8/0x2f0
[<ffffffff8105f102>] ? select_task_rq_fair+0x9f2/0xaf0
[<ffffffff810f098e>] ? rcu_start_gp+0x1be/0x230
[<ffffffffa004377c>] ? bnx2x_poll+0xbc/0x300 [bnx2x]
[<ffffffff81461843>] ? net_rx_action+0x103/0x2e0
[<ffffffff81075dc3>] ? __do_softirq+0x103/0x260
[<ffffffff8100c30c>] ? call_softirq+0x1c/0x30
[<ffffffff8100df35>] ? do_softirq+0x65/0xa0
[<ffffffff81075bed>] ? irq_exit+0xcd/0xd0
[<ffffffff81530595>] ? do_IRQ+0x75/0xf0
[<ffffffff8100bb13>] ? ret_from_intr+0x0/0x11
<EOI> [<ffffffff812d311e>] ? intel_idle+0xde/0x170
[<ffffffff812d3101>] ? intel_idle+0xc1/0x170
[<ffffffff8109df6d>] ? sched_clock_cpu+0xcd/0x110
[<ffffffff814290f7>] ? cpuidle_idle_call+0xa7/0x140
[<ffffffff81009e63>] ? cpu_idle+0xb3/0x110
[<ffffffff8150f385>] ? rest_init+0x85/0x90
[<ffffffff81c2df6e>] ? start_kernel+0x412/0x41e
[<ffffffff81c2d33a>] ? x86_64_start_reservations+0x125/0x129
[<ffffffff81c2d438>] ? x86_64_start_kernel+0xfa/0x109
---[ end trace 12a48083e16763d9 ]---
vmbr0: port 2(tap150i0) entering disabled state
vmbr0: port 2(tap150i0) entering disabled state

Despite this the server continues working. I can create, run and delete VM's
I'm not sure that I can use such a node in my production cluster

tom · Jan 30, 2013

pls test latest release with latest bnx2x driver, see http://forum.proxmox.com/threads/12...VE-pvetest-repo-including-new-KVM-live-backup

PVE Kernel crash on HP Bladeserver after upgrade to PVE 2.2

Member

Attachments

Proxmox Staff Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Active Member

Member

hansentho

Guest

Member

Active Member

Renowned Member

Active Member

Active Member

Proxmox Staff Member