2.6.32-19-pve lock up problem when using Infiniband

e100

Renowned Member
Nov 6, 2010
1,268
46
88
Columbus, Ohio
ulbuilder.wordpress.com
I have updated about 50% of my nodes to the latest version running 2.6.32-19-pve.

Since doing so I have had one machine lock up 3 times, and another has locked up twice, the first time was right after booting the new kernel for the first time.
All of the machines having the issues are the ones that have the very latest updates, the others are humming along perfectly like they have been for months.
Nothing in the logs, they just abruptly stop on the 'locked up' machine.
In logs on other servers I see they detect the failure and fence the locked up node.

The hardware in the two servers are very different, one is AMD the other is Intel.
Both do have an Areca 1880 and the same model Mellanox IB pcie card, these are the only common components between the two machines that have had issues.

I am going to get a serial port logger setup and configure the machine to send kernel messages to the serial port in hopes I can capture a some messages indicating what the issue is.

Having repeated downtime is not an option for some of our vms.
Would it be ok to revert to 2.6.32-18-pve temporarily on those machines?

Code:
# pveversion -v
pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-19-pve
proxmox-ve-2.6.32: 2.3-93
pve-kernel-2.6.32-13-pve: 2.6.32-72
pve-kernel-2.6.32-14-pve: 2.6.32-74
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-12-pve: 2.6.32-68
pve-kernel-2.6.32-19-pve: 2.6.32-93
pve-kernel-2.6.32-7-pve: 2.6.32-60
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-18
pve-firmware: 1.0-21
libpve-common-perl: 1.0-49
libpve-access-control: 1.0-26
libpve-storage-perl: 2.3-6
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.4-8
ksm-control-daemon: 1.1-1
 
Last edited:
Re: 2.6.32-19-pve lock up problem

I was able to capture some data over the serial port!

Code:
------------[ cut here ]------------

WARNING: at lib/list_debug.c:51 list_del+0x8d/0xa0() (Not tainted)
Hardware name: X9DRL-3F/iF

list_del corruption. next->prev should be ffff880141d2b4d0, but was dead000000200200

Modules linked in: dm_snapshot ext4 jbd2 sha256_generic aesni_intel cryptd aes_x86_64 aes_generic cbc vzethdev vznetdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vzrst nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 vzcpt nfs lockd fscache nfs_acl auth_rpcgss sunrpc nf_conntrack vzdquota vzmon vzdev ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables xt_length xt_hl xt_tcpmss xt_TCPMSS vhost_net macvtap macvlan iptable_mangle iptable_filter tun xt_multiport xt_limit kvm_intel xt_dscp ipt_REJECT ip_tables kvm dlm configfs drbd acpi_cpufreq mperf cpufreq_powersave cpufreq_conservative cpufreq_ondemand cpufreq_stats freq_table vzevent ib_iser rdma_cm iw_cm ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse bonding 8021q garp ib_ipoib ib_cm ib_sa ipv6 coretemp dm_crypt sb_edac snd_pcsp snd_pcm snd_timer snd tpm_tis soundcore tpm ib_mthca i2c_i801 tpm_bios ioatdma snd_page_alloc edac_core ib_mad dca i2c_core ib_core ext3 jbd mbcache sg isci libsas arcmsr ahci e1000e scsi_transport_sas [last un
loaded: scsi_wait_scan]

Pid: 1386, comm: ipoib veid: 0 Not tainted 2.6.32-19-pve #1

Call Trace:
 [<ffffffff8106d6c8>] ? warn_slowpath_common+0x88/0xc0
 [<ffffffff8106d7b6>] ? warn_slowpath_fmt+0x46/0x50
 [<ffffffff812837ed>] ? list_del+0x8d/0xa0
 [<ffffffffa0295699>] ? ipoib_cm_tx_reap+0xc9/0x510 [ib_ipoib]
 [<ffffffffa02955d0>] ? ipoib_cm_tx_reap+0x0/0x510 [ib_ipoib]
 [<ffffffff81090660>] ? worker_thread+0x190/0x2d0
 [<ffffffff81096c26>] ? kthread+0x96/0xa0
 [<ffffffff8100c1aa>] ? child_rip+0xa/0x20
 [<ffffffff81096b90>] ? kthread+0x0/0xa0
 [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
---[ end trace c873cc51fda5e760 ]---

Pid: 1386, comm: ipoib veid: 0 Tainted: G        W  ---------------    2.6.32-19-pve #1 042stab075_2 Supermicro X9DRL-3F/iF/X9DRL-3F/iF
RIP: 0010:[<ffffffff8128377b>]  [<ffffffff8128377b>] list_del+0x1b/0xa0
RSP: 0018:ffff88106e987db0  EFLAGS: 00010046
RAX: dead000000100100 RBX: ffff880fecd3e750 RCX: 00000000000052b7
RDX: 0000000000000246 RSI: ffff881079590a50 RDI: ffff880fecd3e750
RBP: ffff88106e987dc0 R08: ffff880fecd3e750 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88106e988020
R13: 0000000000000246 R14: ffff8810789285c0 R15: ffff88106e9886e0
FS:  0000000000000000(0000) GS:ffff880069c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000b77b9000 CR3: 0000000ffdc24000 CR4: 00000000000426e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ipoib (pid: 1386, veid: 0, threadinfo ffff88106e986000, task ffff8810733d93a0)
Stack:
 000000010a4a3e19 ffff880fecd3e740 ffff88106e987e30 ffffffffa0295699

<d> 000000000000fd88 ffff8810733d9968 ffff88106e9892e8 ffff88106e988340
<d> 0000000000000000 0000000000000008 ffff88106e987e10 ffffe8ffffa0b9c0

Call Trace:
 [<ffffffffa0295699>] ipoib_cm_tx_reap+0xc9/0x510 [ib_ipoib]
 [<ffffffffa02955d0>] ? ipoib_cm_tx_reap+0x0/0x510 [ib_ipoib]
 [<ffffffff81090660>] worker_thread+0x190/0x2d0
 [<ffffffff81097200>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff810904d0>] ? worker_thread+0x0/0x2d0
 [<ffffffff81096c26>] kthread+0x96/0xa0
 [<ffffffff8100c1aa>] child_rip+0xa/0x20
 [<ffffffff81096b90>] ? kthread+0x0/0xa0
 [<ffffffff8100c1a0>] ? child_rip+0x0/0x20

Code: 4c 8b ad e8 fe ff ff e9 db fd ff ff 90 90 90 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 47 08 4c 8b 00 4c 39 c7 75 39 48 8b 03 <4c> 8b 40 08 4c 39 c3 75 4c 48 8b 53 08 48 89 50 08 48 89 02 48

RIP  [<ffffffff8128377b>] list_del+0x1b/0xa0
 RSP <ffff88106e987db0>

---[ end trace c873cc51fda5e761 ]---

Kernel panic - not syncing: Fatal exception in interrupt

Pid: 1386, comm: ipoib veid: 0 Tainted: G      D W  ---------------    2.6.32-19-pve #1

Call Trace:
 [<ffffffff8151d432>] ? panic+0xa0/0x168
 [<ffffffff81521d32>] ? oops_end+0xf2/0x100
 [<ffffffff8100f28b>] ? die+0x5b/0x90
 [<ffffffff81521862>] ? do_general_protection+0x152/0x160
 [<ffffffff81521025>] ? general_protection+0x25/0x30
 [<ffffffff8128377b>] ? list_del+0x1b/0xa0
 [<ffffffffa0295699>] ? ipoib_cm_tx_reap+0xc9/0x510 [ib_ipoib]
 [<ffffffffa02955d0>] ? ipoib_cm_tx_reap+0x0/0x510 [ib_ipoib]
 [<ffffffff81090660>] ? worker_thread+0x190/0x2d0
 [<ffffffff81097200>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff810904d0>] ? worker_thread+0x0/0x2d0
 [<ffffffff81096c26>] ? kthread+0x96/0xa0
 [<ffffffff8100c1aa>] ? child_rip+0xa/0x20
 [<ffffffff81096b90>] ? kthread+0x0/0xa0
 [<ffffffff8100c1a0>] ? child_rip+0x0/0x20

Found this:
https://jira.hpdd.intel.com/browse/LU-2967

That bug sends me to here, where the issue is confirmed:
https://bugzilla.redhat.com/show_bug.cgi?id=913645

Do we have to wait for an upstream release to get this fixed?
 
  • Like
Reactions: 1 person
Re: 2.6.32-19-pve lock up problem

dietmar,

Is it possible this problem is back in 2.6.32-23-pve?
Updated servers from 2.6.32-22-pve to 2.6.32-23-pve today and one has locked up already and got fenced after running about 8 hours.

No logs of why it locked up, I will need to hook up my serial logger to catch something.
 
Re: 2.6.32-19-pve lock up problem

Thanks for providing feedback. I was afraid to upgrade. :-)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!