2.6.32-19-pve lock up problem when using Infiniband

e100 · Mar 26, 2013

I have updated about 50% of my nodes to the latest version running 2.6.32-19-pve.

Since doing so I have had one machine lock up 3 times, and another has locked up twice, the first time was right after booting the new kernel for the first time.
All of the machines having the issues are the ones that have the very latest updates, the others are humming along perfectly like they have been for months.
Nothing in the logs, they just abruptly stop on the 'locked up' machine.
In logs on other servers I see they detect the failure and fence the locked up node.

The hardware in the two servers are very different, one is AMD the other is Intel.
Both do have an Areca 1880 and the same model Mellanox IB pcie card, these are the only common components between the two machines that have had issues.

I am going to get a serial port logger setup and configure the machine to send kernel messages to the serial port in hopes I can capture a some messages indicating what the issue is.

Having repeated downtime is not an option for some of our vms.
Would it be ok to revert to 2.6.32-18-pve temporarily on those machines?

Code:

# pveversion -v
pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-19-pve
proxmox-ve-2.6.32: 2.3-93
pve-kernel-2.6.32-13-pve: 2.6.32-72
pve-kernel-2.6.32-14-pve: 2.6.32-74
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-12-pve: 2.6.32-68
pve-kernel-2.6.32-19-pve: 2.6.32-93
pve-kernel-2.6.32-7-pve: 2.6.32-60
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-18
pve-firmware: 1.0-21
libpve-common-perl: 1.0-49
libpve-access-control: 1.0-26
libpve-storage-perl: 2.3-6
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.4-8
ksm-control-daemon: 1.1-1

e100 · Mar 28, 2013

Re: 2.6.32-19-pve lock up problem

I was able to capture some data over the serial port!

Code:

------------[ cut here ]------------

WARNING: at lib/list_debug.c:51 list_del+0x8d/0xa0() (Not tainted)
Hardware name: X9DRL-3F/iF

list_del corruption. next->prev should be ffff880141d2b4d0, but was dead000000200200

Modules linked in: dm_snapshot ext4 jbd2 sha256_generic aesni_intel cryptd aes_x86_64 aes_generic cbc vzethdev vznetdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vzrst nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 vzcpt nfs lockd fscache nfs_acl auth_rpcgss sunrpc nf_conntrack vzdquota vzmon vzdev ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables xt_length xt_hl xt_tcpmss xt_TCPMSS vhost_net macvtap macvlan iptable_mangle iptable_filter tun xt_multiport xt_limit kvm_intel xt_dscp ipt_REJECT ip_tables kvm dlm configfs drbd acpi_cpufreq mperf cpufreq_powersave cpufreq_conservative cpufreq_ondemand cpufreq_stats freq_table vzevent ib_iser rdma_cm iw_cm ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse bonding 8021q garp ib_ipoib ib_cm ib_sa ipv6 coretemp dm_crypt sb_edac snd_pcsp snd_pcm snd_timer snd tpm_tis soundcore tpm ib_mthca i2c_i801 tpm_bios ioatdma snd_page_alloc edac_core ib_mad dca i2c_core ib_core ext3 jbd mbcache sg isci libsas arcmsr ahci e1000e scsi_transport_sas [last un
loaded: scsi_wait_scan]

Pid: 1386, comm: ipoib veid: 0 Not tainted 2.6.32-19-pve #1

Call Trace:
 [<ffffffff8106d6c8>] ? warn_slowpath_common+0x88/0xc0
 [<ffffffff8106d7b6>] ? warn_slowpath_fmt+0x46/0x50
 [<ffffffff812837ed>] ? list_del+0x8d/0xa0
 [<ffffffffa0295699>] ? ipoib_cm_tx_reap+0xc9/0x510 [ib_ipoib]
 [<ffffffffa02955d0>] ? ipoib_cm_tx_reap+0x0/0x510 [ib_ipoib]
 [<ffffffff81090660>] ? worker_thread+0x190/0x2d0
 [<ffffffff81096c26>] ? kthread+0x96/0xa0
 [<ffffffff8100c1aa>] ? child_rip+0xa/0x20
 [<ffffffff81096b90>] ? kthread+0x0/0xa0
 [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
---[ end trace c873cc51fda5e760 ]---

Pid: 1386, comm: ipoib veid: 0 Tainted: G        W  ---------------    2.6.32-19-pve #1 042stab075_2 Supermicro X9DRL-3F/iF/X9DRL-3F/iF
RIP: 0010:[<ffffffff8128377b>]  [<ffffffff8128377b>] list_del+0x1b/0xa0
RSP: 0018:ffff88106e987db0  EFLAGS: 00010046
RAX: dead000000100100 RBX: ffff880fecd3e750 RCX: 00000000000052b7
RDX: 0000000000000246 RSI: ffff881079590a50 RDI: ffff880fecd3e750
RBP: ffff88106e987dc0 R08: ffff880fecd3e750 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88106e988020
R13: 0000000000000246 R14: ffff8810789285c0 R15: ffff88106e9886e0
FS:  0000000000000000(0000) GS:ffff880069c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000b77b9000 CR3: 0000000ffdc24000 CR4: 00000000000426e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ipoib (pid: 1386, veid: 0, threadinfo ffff88106e986000, task ffff8810733d93a0)
Stack:
 000000010a4a3e19 ffff880fecd3e740 ffff88106e987e30 ffffffffa0295699

<d> 000000000000fd88 ffff8810733d9968 ffff88106e9892e8 ffff88106e988340
<d> 0000000000000000 0000000000000008 ffff88106e987e10 ffffe8ffffa0b9c0

Call Trace:
 [<ffffffffa0295699>] ipoib_cm_tx_reap+0xc9/0x510 [ib_ipoib]
 [<ffffffffa02955d0>] ? ipoib_cm_tx_reap+0x0/0x510 [ib_ipoib]
 [<ffffffff81090660>] worker_thread+0x190/0x2d0
 [<ffffffff81097200>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff810904d0>] ? worker_thread+0x0/0x2d0
 [<ffffffff81096c26>] kthread+0x96/0xa0
 [<ffffffff8100c1aa>] child_rip+0xa/0x20
 [<ffffffff81096b90>] ? kthread+0x0/0xa0
 [<ffffffff8100c1a0>] ? child_rip+0x0/0x20

Code: 4c 8b ad e8 fe ff ff e9 db fd ff ff 90 90 90 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 47 08 4c 8b 00 4c 39 c7 75 39 48 8b 03 <4c> 8b 40 08 4c 39 c3 75 4c 48 8b 53 08 48 89 50 08 48 89 02 48

RIP  [<ffffffff8128377b>] list_del+0x1b/0xa0
 RSP <ffff88106e987db0>

---[ end trace c873cc51fda5e761 ]---

Kernel panic - not syncing: Fatal exception in interrupt

Pid: 1386, comm: ipoib veid: 0 Tainted: G      D W  ---------------    2.6.32-19-pve #1

Call Trace:
 [<ffffffff8151d432>] ? panic+0xa0/0x168
 [<ffffffff81521d32>] ? oops_end+0xf2/0x100
 [<ffffffff8100f28b>] ? die+0x5b/0x90
 [<ffffffff81521862>] ? do_general_protection+0x152/0x160
 [<ffffffff81521025>] ? general_protection+0x25/0x30
 [<ffffffff8128377b>] ? list_del+0x1b/0xa0
 [<ffffffffa0295699>] ? ipoib_cm_tx_reap+0xc9/0x510 [ib_ipoib]
 [<ffffffffa02955d0>] ? ipoib_cm_tx_reap+0x0/0x510 [ib_ipoib]
 [<ffffffff81090660>] ? worker_thread+0x190/0x2d0
 [<ffffffff81097200>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff810904d0>] ? worker_thread+0x0/0x2d0
 [<ffffffff81096c26>] ? kthread+0x96/0xa0
 [<ffffffff8100c1aa>] ? child_rip+0xa/0x20
 [<ffffffff81096b90>] ? kthread+0x0/0xa0
 [<ffffffff8100c1a0>] ? child_rip+0x0/0x20

Found this:
https://jira.hpdd.intel.com/browse/LU-2967

That bug sends me to here, where the issue is confirmed:
https://bugzilla.redhat.com/show_bug.cgi?id=913645

Do we have to wait for an upstream release to get this fixed?

dietmar · Mar 29, 2013

Re: 2.6.32-19-pve lock up problem

e100 said:
Do we have to wait for an upstream release to get this fixed?

Please can you test with: ftp://download.proxmox.com/debian/d.../pve-kernel-2.6.32-19-pve_2.6.32-94_amd64.deb

e100 · Mar 29, 2013

Re: 2.6.32-19-pve lock up problem

dietmar said:
Please can you test with: ftp://download.proxmox.com/debian/d.../pve-kernel-2.6.32-19-pve_2.6.32-94_amd64.deb

This is installed on the same node that generated the stack trace above.
Will be monitoring to see if the problem is resolved.

udo · Apr 5, 2013

Re: 2.6.32-19-pve lock up problem

dietmar said:
Please can you test with: ftp://download.proxmox.com/debian/d.../pve-kernel-2.6.32-19-pve_2.6.32-94_amd64.deb

Hi,
I will also start to test on two nodes with infiniband today.

Udo

e100 · Apr 6, 2013

dietmar said:
Please can you test with: ftp://download.proxmox.com/debian/d.../pve-kernel-2.6.32-19-pve_2.6.32-94_amd64.deb

Running fine for a week now using this kernel.

e100 · Aug 25, 2013

Re: 2.6.32-19-pve lock up problem

dietmar,

Is it possible this problem is back in 2.6.32-23-pve?
Updated servers from 2.6.32-22-pve to 2.6.32-23-pve today and one has locked up already and got fenced after running about 8 hours.

No logs of why it locked up, I will need to hook up my serial logger to catch something.

e100 · Aug 29, 2013

Re: 2.6.32-19-pve lock up problem

If anyone is curious, there is no problem with 2.6.32-23-pve.
My problem has been traced to a bad 16GB ECC Ram module according to mcelog output.

JensDoe · Aug 29, 2013

Re: 2.6.32-19-pve lock up problem

Thanks for providing feedback. I was afraid to upgrade.

Search

Search

2.6.32-19-pve lock up problem when using Infiniband

e100

Famous Member

e100

Famous Member

dietmar

Proxmox Staff Member

e100

Famous Member

udo

Distinguished Member

e100

Famous Member

e100

Famous Member

e100

Famous Member

JensDoe

Renowned Member

We value your privacy