Rgmanager Kenerl Ooops

adamb

Famous Member
Mar 1, 2012
1,329
77
113
This just started recently. Nothing has changed with the cluster in months. Each host server had an uptime of 80+ days before this took place. On node #2 we saw the message below. Due to this node#2 fenced node#1. Everything comes back up clean with no issues but I can't come up with a reason for the kernel oops. Hoping someone can point me in the right direction. I am honestly thinking its hardware as I have 7 other cluster just like this one without this issue. I wonder if something flakey is going on with my 10GB adapter on node #1.

Code:
INFO: task rgmanager:185909 blocked for more than 120 seconds.                                                                        
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.                                                             
rgmanager     D ffff88186d8e6e60     0 185909   2810    0 0x00000000                                                                  
 ffff880c36b67c80 0000000000000086 ffffffff810b4900 ffff880c36b67c08                                                                  
 ffffffff00000000 ffffea0018924000 ffff88186f4f6ec0 ffff881858a146c0                                                                  
 ffffea0018924000 ffff880c36b67ca8 ffff88186d8e7428 000000000001e9c0                                                                  
Call Trace:                                                                                                                           
 [<ffffffff810b4900>] ? get_futex_key+0x180/0x2b0                                                                                     
 [<ffffffff8127a4de>] ? number+0x2ee/0x320                                                                                            
 [<ffffffff815207c5>] rwsem_down_failed_common+0x95/0x1d0                                                                             
 [<ffffffff81520956>] rwsem_down_read_failed+0x26/0x30                                                                                
 [<ffffffff8127e794>] call_rwsem_down_read_failed+0x14/0x30                                                                           
 [<ffffffff8151fe44>] ? down_read+0x24/0x30                                                                                           
 [<ffffffff8100bcae>] ? invalidate_interrupt1+0xe/0x20                                                                                
 [<ffffffffa047a077>] dlm_user_request+0x47/0x1b0 [dlm]                                                                               
 [<ffffffff8118477b>] ? __kmalloc+0xfb/0x270                                                                                          
 [<ffffffff81184acf>] ? kmem_cache_alloc_trace+0x1df/0x200                                                                            
 [<ffffffffa04870e0>] device_write+0x5e0/0x730 [dlm]                                                                                  
 [<ffffffff81198b48>] vfs_write+0xb8/0x1a0                                                                                            
 [<ffffffff81199441>] sys_write+0x51/0x90                                                                                             
 [<ffffffff8152137e>] ? do_device_not_available+0xe/0x10                                                                              
 [<ffffffff8100b102>] system_call_fastpath+0x16/0x1b


root@lanprox1:~# pveversion -v
pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-19-pve
proxmox-ve-2.6.32: 2.3-93
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-19-pve: 2.6.32-93
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-18
pve-firmware: 1.0-21
libpve-common-perl: 1.0-49
libpve-access-control: 1.0-26
libpve-storage-perl: 2.3-6
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.4-8
ksm-control-daemon: 1.1-1
 
Node #1
root@lanprox1:~# fence_tool ls
fence domain
member count 2
victim count 0
victim now 0
master nodeid 2
wait state none
members 1 2

Node #2
root@lanprox2:~# fence_tool ls
fence domain
member count 2
victim count 0
victim now 0
master nodeid 2
wait state none
members 1 2
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!