This just started recently. Nothing has changed with the cluster in months. Each host server had an uptime of 80+ days before this took place. On node #2 we saw the message below. Due to this node#2 fenced node#1. Everything comes back up clean with no issues but I can't come up with a reason for the kernel oops. Hoping someone can point me in the right direction. I am honestly thinking its hardware as I have 7 other cluster just like this one without this issue. I wonder if something flakey is going on with my 10GB adapter on node #1.
root@lanprox1:~# pveversion -v
pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-19-pve
proxmox-ve-2.6.32: 2.3-93
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-19-pve: 2.6.32-93
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-18
pve-firmware: 1.0-21
libpve-common-perl: 1.0-49
libpve-access-control: 1.0-26
libpve-storage-perl: 2.3-6
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.4-8
ksm-control-daemon: 1.1-1
Code:
INFO: task rgmanager:185909 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rgmanager D ffff88186d8e6e60 0 185909 2810 0 0x00000000
ffff880c36b67c80 0000000000000086 ffffffff810b4900 ffff880c36b67c08
ffffffff00000000 ffffea0018924000 ffff88186f4f6ec0 ffff881858a146c0
ffffea0018924000 ffff880c36b67ca8 ffff88186d8e7428 000000000001e9c0
Call Trace:
[<ffffffff810b4900>] ? get_futex_key+0x180/0x2b0
[<ffffffff8127a4de>] ? number+0x2ee/0x320
[<ffffffff815207c5>] rwsem_down_failed_common+0x95/0x1d0
[<ffffffff81520956>] rwsem_down_read_failed+0x26/0x30
[<ffffffff8127e794>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff8151fe44>] ? down_read+0x24/0x30
[<ffffffff8100bcae>] ? invalidate_interrupt1+0xe/0x20
[<ffffffffa047a077>] dlm_user_request+0x47/0x1b0 [dlm]
[<ffffffff8118477b>] ? __kmalloc+0xfb/0x270
[<ffffffff81184acf>] ? kmem_cache_alloc_trace+0x1df/0x200
[<ffffffffa04870e0>] device_write+0x5e0/0x730 [dlm]
[<ffffffff81198b48>] vfs_write+0xb8/0x1a0
[<ffffffff81199441>] sys_write+0x51/0x90
[<ffffffff8152137e>] ? do_device_not_available+0xe/0x10
[<ffffffff8100b102>] system_call_fastpath+0x16/0x1b
root@lanprox1:~# pveversion -v
pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-19-pve
proxmox-ve-2.6.32: 2.3-93
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-19-pve: 2.6.32-93
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-18
pve-firmware: 1.0-21
libpve-common-perl: 1.0-49
libpve-access-control: 1.0-26
libpve-storage-perl: 2.3-6
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.4-8
ksm-control-daemon: 1.1-1