Rgmanager Kenerl Ooops

adamb · Nov 4, 2013

This just started recently. Nothing has changed with the cluster in months. Each host server had an uptime of 80+ days before this took place. On node #2 we saw the message below. Due to this node#2 fenced node#1. Everything comes back up clean with no issues but I can't come up with a reason for the kernel oops. Hoping someone can point me in the right direction. I am honestly thinking its hardware as I have 7 other cluster just like this one without this issue. I wonder if something flakey is going on with my 10GB adapter on node #1.

Code:

INFO: task rgmanager:185909 blocked for more than 120 seconds.                                                                        
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.                                                             
rgmanager     D ffff88186d8e6e60     0 185909   2810    0 0x00000000                                                                  
 ffff880c36b67c80 0000000000000086 ffffffff810b4900 ffff880c36b67c08                                                                  
 ffffffff00000000 ffffea0018924000 ffff88186f4f6ec0 ffff881858a146c0                                                                  
 ffffea0018924000 ffff880c36b67ca8 ffff88186d8e7428 000000000001e9c0                                                                  
Call Trace:                                                                                                                           
 [<ffffffff810b4900>] ? get_futex_key+0x180/0x2b0                                                                                     
 [<ffffffff8127a4de>] ? number+0x2ee/0x320                                                                                            
 [<ffffffff815207c5>] rwsem_down_failed_common+0x95/0x1d0                                                                             
 [<ffffffff81520956>] rwsem_down_read_failed+0x26/0x30                                                                                
 [<ffffffff8127e794>] call_rwsem_down_read_failed+0x14/0x30                                                                           
 [<ffffffff8151fe44>] ? down_read+0x24/0x30                                                                                           
 [<ffffffff8100bcae>] ? invalidate_interrupt1+0xe/0x20                                                                                
 [<ffffffffa047a077>] dlm_user_request+0x47/0x1b0 [dlm]                                                                               
 [<ffffffff8118477b>] ? __kmalloc+0xfb/0x270                                                                                          
 [<ffffffff81184acf>] ? kmem_cache_alloc_trace+0x1df/0x200                                                                            
 [<ffffffffa04870e0>] device_write+0x5e0/0x730 [dlm]                                                                                  
 [<ffffffff81198b48>] vfs_write+0xb8/0x1a0                                                                                            
 [<ffffffff81199441>] sys_write+0x51/0x90                                                                                             
 [<ffffffff8152137e>] ? do_device_not_available+0xe/0x10                                                                              
 [<ffffffff8100b102>] system_call_fastpath+0x16/0x1b

root@lanprox1:~# pveversion -v
pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-19-pve
proxmox-ve-2.6.32: 2.3-93
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-19-pve: 2.6.32-93
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-18
pve-firmware: 1.0-21
libpve-common-perl: 1.0-49
libpve-access-control: 1.0-26
libpve-storage-perl: 2.3-6
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.4-8
ksm-control-daemon: 1.1-1

dietmar · Nov 5, 2013

what is the output of

# fence_tool ls

adamb · Nov 5, 2013

Node #1
root@lanprox1:~# fence_tool ls
fence domain
member count 2
victim count 0
victim now 0
master nodeid 2
wait state none
members 1 2

Node #2
root@lanprox2:~# fence_tool ls
fence domain
member count 2
victim count 0
victim now 0
master nodeid 2
wait state none
members 1 2

Search

Search

Rgmanager Kenerl Ooops

adamb

Famous Member

dietmar

Proxmox Staff Member

adamb

Famous Member