rgmanager deadlock

e100

Renowned Member
Nov 6, 2010
1,268
46
88
Columbus, Ohio
ulbuilder.wordpress.com
If you do not have redundant switches and network ports for your cluster communications this is an issue you will want to be aware of.

I am communicating over bonded interfaces over redundant switches with redundant power supplies connected to redundant power.
So hopefully this issue never happens to me in real life.

But we all know that sometimes things go wrong so I wanted to know what will happen if my switches both failed at the same time.

So I disconnected all the network cables.
First obvious thing is that quorum is lost, that was expected.
But then some unexpected things happened when I reconnected all the cables.

Quorum came back, but rgmanager is missing!:
Code:
# clustat
Timed out waiting for a response from Resource Group Manager
Cluster Status for kmitestcluster @ Mon Feb 27 12:28:35 2012

Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 vm5                                                                 1 Online
 vm6                                                                 2 Online, Local
 disaster                                                            3 Offline

Then after a few minutes I get kernel messages!
Code:
INFO: task rgmanager:3378 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rgmanager     D ffff8806310f2c80     0  3378      1    0 0x00000000
 ffff88062d7db9d0 0000000000000046 0000000000000000 ffff88062d7db948
 000000018109b3dd ffff880000000000 ffff88063fc0c300 ffff88004179e200
 0000000000000069 ffff8806310f3220 ffff88062d7dbfd8 ffff88062d7dbfd8
Call Trace:
 [<ffffffff8104d31d>] ? check_preempt_curr+0x6d/0x90
 [<ffffffff814ff235>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff814ff3c6>] rwsem_down_read_failed+0x26/0x30
 [<ffffffff8126b8c4>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffff814fe8b4>] ? down_read+0x24/0x30
 [<ffffffffa052872d>] dlm_clear_proc_locks+0x3d/0x2a0 [dlm]
 [<ffffffff811a63ef>] ? destroy_inode+0x4f/0x60
 [<ffffffff811a26a5>] ? __d_free+0x45/0x60
 [<ffffffffa0533c66>] device_close+0x66/0xc0 [dlm]
 [<ffffffff8118cea5>] __fput+0xf5/0x280
 [<ffffffff8118d055>] fput+0x25/0x30
 [<ffffffff811885dd>] filp_close+0x5d/0x90
 [<ffffffff8106dbbf>] put_files_struct+0x7f/0xf0
 [<ffffffff8106dc83>] exit_files+0x53/0x70
 [<ffffffff8106f86d>] do_exit+0x1ad/0x920
 [<ffffffff81070038>] do_group_exit+0x58/0xd0
 [<ffffffff81086606>] get_signal_to_deliver+0x1f6/0x470
 [<ffffffff8100a335>] do_signal+0x75/0x800
 [<ffffffff8125e131>] ? cpumask_any_but+0x31/0x50
 [<ffffffff810b2ddb>] ? sys_futex+0x7b/0x170
 [<ffffffff8100ab50>] do_notify_resume+0x90/0xc0
 [<ffffffff8100b451>] int_signal+0x12/0x17


I attempted all sorts of things to recover from this but the bottom line is that rgmanager can not be stopped so there is not way to recover other than rebooting every single node one by one.
Turn off as many services as possible.
Then run reboot twice, once will not work since it will hang trying to stop rgmanager.
 
And fencing is working? Any hints in /var/log/cluster/*

Fencing works, we are using APC PDU.
In this case, since quorum is lost, I would not expect fencing to kick in.

I looked through the logs and did not see any clues.
Very easy to reproduce, just disrupt the cluster communication between all nodes until quorum is lost then restore the communication.
 
I've come across the same issue whilst testing with a 2 node cluster, it seems if you lose network on either node, you end up having to reboot? :S
 
Yeah...our production set up has 22 nodes, have just been trying out 2.0 with some spare servers in the office - spose I'll have to find an extra one.
 
Hi,
i have the same problem "deadlock rgmanager" if all nodes lose connection between themes.
when there will be a patch for debian ?
Greetings
Giuseppe
 
Any progress on this thread?
I have the exact same problem.
I don't think that it is expected behaviour, when you have to restart your whole cluster in order to get rgmanager working again.

I'd be grateful for any hints.
 
I verified with developers, this is 'expected' behavior.

See https://bugzilla.proxmox.com/show_bug.cgi?id=105

Dietmar, correct that this is the 'expected' behavior, but i think that rgmanager need a subsystem of auto recuperation with a set of logical operations for get a successful recovery in this case (be solid as a rock).

For other hand, i don't know if i will have some problems, but soon i will have to move all my PVE servers that are in a PVE Cluster, of a locality to other locality, and obviously PVE servers will be turned off for do this move.

Some recommendations for the moment of turn it on?
Notes:
- My PVE servers in cluster PVE have versions 2.3, 3.1 and 3.2
- The Backup of VMs are with NFS and between these same servers (crusaders backups)

Best regards
Cesar
 
proxmox use corosync, not openais.


the change in rhel7 is corosync + pacemaker , instead corosync + rgmanager

Spirit, thanks for the answer, but i have in my PVE nodes these programs:
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3

What is the mission of each one?
 
Spirit, thanks for the answer, but i have in my PVE nodes these programs:
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3

What is the mission of each one?

OH yes, sorry, openais is used too. I found an explain on internet:


"clustering include two part:1.cluster resource management
2.infrastructure with massaging layer
legacy heartbeat is broken into heartbeat message layer and pacemaker so pacemaker is CRM.
and we have two option on message layer:heartbeat,openais. openais/corosync is preferred as:http://comments.gmane.org/gmane.linux.highavailability.user/32355
There are, however, features in Pacemaker that require OpenAIS which will work only with Corosync, not Heartbeat. Those features are concerned with the distributed lock managers used by cLVM (but not regular LVM), GFS/GFS2, and OCFS2. If you need that functionality, you must select OpenAIS/Corosync. If you do not, you're free to choose.
as: http://www.clusterlabs.org/wiki/FAQ
Originally Corosync and OpenAIS were the same thing. Then they split into two parts... the core messaging and membership capabilities are now called Corosync, and OpenAIS retained the layer containing the implementation of the AIS standard.
Pacemaker itself only needs the Corosync piece in order to function, however some of the applications it can manage (such as OCFS2 and GFS2) require the OpenAIS layer as well.
so i went to openais/corosync and integrate it with pacemaker.
"
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!