Has bug 105 been tested to confirm the issue is fixed?
Bug 105 is not closed and the only comments on it are mine.
Maybe you can test again?
Has bug 105 been tested to confirm the issue is fixed?
Bug 105 is not closed and the only comments on it are mine.
Package: rgmanager
New: yes
State: not installed
Version: 3.0.12-2
Priority: optional
Section: admin
Maintainer: Debian HA Maintainers <debian-ha-maintainers@lists.alioth.debian.org>
Uncompressed Size: 975 k
Depends: libc6 (>= 2.3.2), libccs3 (>= 3.0.12), libcman3 (>= 3.0.12), libdlm3 (>= 3.0.12), libldap-2.4-2 (>= 2.4.7),
liblogthread3 (>= 3.0.12), libncurses5 (>= 5.7+20100313), libslang2 (>= 2.0.7-1), libxml2 (>= 2.7.4), cman (=
3.0.12-2), iproute, iputils-arping, iputils-ping, nfs-kernel-server, nfs-common, perl, gawk, net-tools
Conflicts: nfs-user-server
Description: Red Hat cluster suite - clustered resource group manager
This package is part of the Red Hat Cluster Suite, a complete high-availability solution.
Resource Group Manager provides high availability of critical server applications in the event of planned or
unplanned system downtime.
fbc241 ~ # ps -ef|grep rgmanager
root 2800 1 0 Aug18 ? 00:00:00 rgmanager
root 2802 2800 0 Aug18 ? 00:23:10 rgmanager
Maybe you can test again?
# clustat
Cluster Status for Inhouse @ Wed Sep 19 09:46:34 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
vm1 1 Online, Local, rgmanager
vm2 2 Online, rgmanager
vm3 3 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:masterIP vm1 started
# clustat
Timed out waiting for a response from Resource Group Manager
Cluster Status for Inhouse @ Wed Sep 19 09:50:15 2012
Member Status: Inquorate
Member Name ID Status
------ ---- ---- ------
vm1 1 Online, Local
vm2 2 Offline
vm3 3 Offline
# clustat
Timed out waiting for a response from Resource Group Manager
Cluster Status for Inhouse @ Wed Sep 19 09:52:07 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
vm1 1 Online, Local
vm2 2 Online
vm3 3 Online
# ps ax|grep rgmanager
1965 ? S<Ls 0:00 rgmanager
1967 ? S<l 0:00 rgmanager
INFO: task rgmanager:3552 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rgmanager D ffff88043602d380 0 3552 1778 0 0x00000000
ffff880433ea7c70 0000000000000086 00000000006315a0 ffff880433ea7bf8
ffffffff8104976e ffff880433ea7c18 ffff8804335d73a0 00000000006315a0
0000000000000001 ffff88043602d920 ffff880433ea7fd8 ffff880433ea7fd8
Call Trace:
[<ffffffff8104976e>] ? flush_tlb_page+0x5e/0xa0
[<ffffffff81155eae>] ? do_wp_page+0x4fe/0x9c0
[<ffffffff81527cf5>] rwsem_down_failed_common+0x95/0x1d0
[<ffffffff81527e86>] rwsem_down_read_failed+0x26/0x30
[<ffffffff8104f02d>] ? check_preempt_curr+0x6d/0x90
[<ffffffff8127c4f4>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff81527374>] ? down_read+0x24/0x30
[<ffffffff8100bd0e>] ? invalidate_interrupt0+0xe/0x20
[<ffffffffa04775f7>] dlm_user_request+0x47/0x240 [dlm]
[<ffffffff81180a19>] ? __kmalloc+0xf9/0x270
[<ffffffff81180b4f>] ? __kmalloc+0x22f/0x270
[<ffffffffa0484ed9>] device_write+0x5f9/0x7d0 [dlm]
[<ffffffff81194b78>] vfs_write+0xb8/0x1a0
[<ffffffff81195591>] sys_write+0x51/0x90
[<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
Very simple to reproduce, just turn off your network switch that carries the cluster traffic, wait to loose quorum and turn it back on.
If you loose quorum on all nodes it should not cause a problem. Nothing should change, no quorum = no changes.
Quorum is needed to decide what nodes get fenced.no quorum means fencing is starting, and all service should stop to be on the safe side?
Quorum is needed to decide what nodes get fenced.
With no nodes having quorum nothing gets fenced.
You talk about the very special case when you loose quorum at 'exactly' the same time (within a view ms) one all nodes?
Besides, rgmanager stops all services when it looses quorum (see 'man rgmanager').
It would be beneficial to gracefully recover from this situation rather than having to restart every single node causing a disruption to every single VM that is running in the entire cluster.
Switches turned off.
rmganager not killable, not stoppable.
But both switches are behind the same UPS : bad luck!
And rgmanager is dead : i hope it is not by design!