rgmanager running per cli but not pve

We've not seen the old issue.

rgmanager is not installed as a package on our cluster:
Code:
Package: rgmanager                       
New: yes
State: not installed
Version: 3.0.12-2
Priority: optional
Section: admin
Maintainer: Debian HA Maintainers <debian-ha-maintainers@lists.alioth.debian.org>
Uncompressed Size: 975 k
Depends: libc6 (>= 2.3.2), libccs3 (>= 3.0.12), libcman3 (>= 3.0.12), libdlm3 (>= 3.0.12), libldap-2.4-2 (>= 2.4.7),
         liblogthread3 (>= 3.0.12), libncurses5 (>= 5.7+20100313), libslang2 (>= 2.0.7-1), libxml2 (>= 2.7.4), cman (=
         3.0.12-2), iproute, iputils-arping, iputils-ping, nfs-kernel-server, nfs-common, perl, gawk, net-tools
Conflicts: nfs-user-server
Description: Red Hat cluster suite - clustered resource group manager
 This package is part of the Red Hat Cluster Suite, a complete high-availability solution. 
 
 Resource Group Manager provides high availability of critical server applications in the event of planned or
 unplanned system downtime.

yet it is running:
Code:
fbc241  ~ # ps -ef|grep rgmanager
root        2800       1  0 Aug18 ?        00:00:00 rgmanager
root        2802    2800  0 Aug18 ?        00:23:10 rgmanager
 
Last edited:
Maybe you can test again?

Bug 105 is not fixed, the issue persists.

Before breaking network:
Code:
# clustat
Cluster Status for Inhouse @ Wed Sep 19 09:46:34 2012
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 vm1                                                                 1 Online, Local, rgmanager
 vm2                                                                 2 Online, rgmanager
 vm3                                                                 3 Online, rgmanager

 Service Name                                                     Owner (Last)                                                     State         
 ------- ----                                                     ----- ------                                                     -----         
 service:masterIP                                                 vm1                                                              started

After all nodes removed from network:
Code:
# clustat
Timed out waiting for a response from Resource Group Manager
Cluster Status for Inhouse @ Wed Sep 19 09:50:15 2012
Member Status: Inquorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 vm1                                                                 1 Online, Local
 vm2                                                                 2 Offline
 vm3                                                                 3 Offline

Restore network communications, note that no rgmanager is shown:
Code:
# clustat
Timed out waiting for a response from Resource Group Manager
Cluster Status for Inhouse @ Wed Sep 19 09:52:07 2012
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 vm1                                                                 1 Online, Local
 vm2                                                                 2 Online
 vm3                                                                 3 Online

rgmanager is still running:
Code:
# ps ax|grep rgmanager
   1965 ?        S<Ls   0:00 rgmanager
   1967 ?        S<l    0:00 rgmanager

Some nodes log kernel message after period of time:
Code:
INFO: task rgmanager:3552 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rgmanager     D ffff88043602d380     0  3552   1778    0 0x00000000
 ffff880433ea7c70 0000000000000086 00000000006315a0 ffff880433ea7bf8
 ffffffff8104976e ffff880433ea7c18 ffff8804335d73a0 00000000006315a0
 0000000000000001 ffff88043602d920 ffff880433ea7fd8 ffff880433ea7fd8
Call Trace:
 [<ffffffff8104976e>] ? flush_tlb_page+0x5e/0xa0
 [<ffffffff81155eae>] ? do_wp_page+0x4fe/0x9c0
 [<ffffffff81527cf5>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff81527e86>] rwsem_down_read_failed+0x26/0x30
 [<ffffffff8104f02d>] ? check_preempt_curr+0x6d/0x90
 [<ffffffff8127c4f4>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffff81527374>] ? down_read+0x24/0x30
 [<ffffffff8100bd0e>] ? invalidate_interrupt0+0xe/0x20
 [<ffffffffa04775f7>] dlm_user_request+0x47/0x240 [dlm]
 [<ffffffff81180a19>] ? __kmalloc+0xf9/0x270
 [<ffffffff81180b4f>] ? __kmalloc+0x22f/0x270
 [<ffffffffa0484ed9>] device_write+0x5f9/0x7d0 [dlm]
 [<ffffffff81194b78>] vfs_write+0xb8/0x1a0
 [<ffffffff81195591>] sys_write+0x51/0x90
 [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b

Only way to get rgmanager working again is to restart each node.
rgmanager can not be stopped, restarted or killed once it gets into this state.

Very simple to reproduce, just turn off your network switch that carries the cluster traffic, wait to loose quorum and turn it back on.
 
Very simple to reproduce, just turn off your network switch that carries the cluster traffic, wait to loose quorum and turn it back on.

Besides, the error scenario is strange. A HA cluster works as long as at least one partition has quorum. If you loose quorum on all nodes you are likely to run into many problems, and most times you need to manually recover the cluster. That being said, you should avoid loosing quorum on all nodes (use redundant network).
 
I agree that a redundant network is desired.

If you loose quorum on all nodes it should not cause a problem. Nothing should change, no quorum = no changes.

So if nothing changes when loosing quorum everywhere why is it impossible to regain quorum and continue working?

When the communications are restored quorum is regained.
The only issue is that rgmanager deadlocks. It is even impossible to restart rgmanager manually.
How can I manually recover when the rgmanager daemon will not let me manually stop it?

Having to reboot EVERY node in the cluster to recover from this issue is very disruptive and not much of a solution.
 
Quorum is needed to decide what nodes get fenced.
With no nodes having quorum nothing gets fenced.

You talk about the very special case when you loose quorum at 'exactly' the same time (within a view ms) one all nodes?
Besides, rgmanager stops all services when it looses quorum (see 'man rgmanager').
 
You talk about the very special case when you loose quorum at 'exactly' the same time (within a view ms) one all nodes?
Besides, rgmanager stops all services when it looses quorum (see 'man rgmanager').

In this instance rgmanager does not seem to stop anything, it locks up instead.
If it should stop things when quorum is lost then that needs fixed, right now it just locks up and does not stop anything.

Also, it is not necessary to turn off the switch, that is just a simple way to trigger the issue.
Start pulling network cables one by one until you loose quorum.

Pulling two network cables out of three node cluster will result in same lock up condition, I know because I have done this.
I do not know if the same issue happens with two cables pulled from a 4 node cluster. or three from a 5 node cluster. Someone else will need to test this.

I agree redundant network switches/connections *should* be used, I already have multiple clusters setup this way already.
But with small clusters and small budgets people will use a single switch.
Someone will accidentally unplug the switch some day, or when editing some vlans mess up and bring the network down, people make mistakes and crap happens.

It would be beneficial to gracefully recover from this situation rather than having to restart every single node causing a disruption to every single VM that is running in the entire cluster.
 
It would be beneficial to gracefully recover from this situation rather than having to restart every single node causing a disruption to every single VM that is running in the entire cluster.

I will try to debug that when i have some spare time.
 
Hi all,

Exactely same problem here.

Switches turned off.

rmganager not killable, not stoppable.

root@px1:~# pveversion -v
pve-manager: 2.2-26 (pve-manager/2.2/c1614c8c)
running kernel: 2.6.32-16-pve
proxmox-ve-2.6.32: 2.2-80
pve-kernel-2.6.32-16-pve: 2.6.32-80
pve-kernel-2.6.32-14-pve: 2.6.32-74
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-1
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-28
qemu-server: 2.0-64
pve-firmware: 1.0-21
libpve-common-perl: 1.0-37
libpve-access-control: 1.0-25
libpve-storage-perl: 2.0-34
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.2-7
ksm-control-daemon: 1.1-1


Christophe.
 
Of course.
Each and every device is redundant : bonding, RAID 5 & BBU + spare, switches, redundant power supply on servers and SAN, and so on.
But both switches are behind the same UPS : bad luck!
And rgmanager is dead : i hope it is not by design!

Christophe.
 
But both switches are behind the same UPS : bad luck!
And rgmanager is dead : i hope it is not by design!

Again, you need to make you network redundant.

If you loose quorum on all nodes you need to manually restart the cluster.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!