Cluster broken after netsplit - rgmanager fails to stop, dlm_controld too

nmmn · Feb 10, 2015

Hello,

we've a 2-node cluster running privately and went for a 4-node cluster recently. Initial setup went smooth, DRBD and fencing devices are all fine (and working!). Last week we decided to change the vlan for the eth1 interface on all machines simultaneously. E.g. we caused a very, very short netsplit on the whole cluster.

Corosync noticed this but recovered very soon. But the toxic duo rgmanager and/or dlm_controld seems unforgivable. The rgmanager is off on all machines since then:

Code:

clustat
Cluster Status for Proxmox @ Tue Feb 10 12:10:17 2015
Member Status: Quorate


 Member Name                                               ID   Status
 ------ ----                                               ---- ------
 server-01                                                      1 Online
 server-02                                                      2 Online
 server-03                                                      3 Online, Local
 server-04                                                      4 Online

Stopping the rgmanager via init script fails everytime and hangs. Just possible with killall -9. Same thing with dlm_controld:

Code:

service cman stop
Stopping cluster: 
   Leaving fence domain... [  OK  ]
   Stopping dlm_controld... 
[FAILED]

On startup we got the following message:

Code:

dlm: rgmanager: group join failed -512 0

And

Code:

dlm_controld process_uevent online@ error -17 errno 11

We've started rgmanager in foreground and with debugging switch on. Gives no output at all.

Reboots of the node don't fix this and we're wondering why this is a supposed solution. We've several clusters running with Corosync + Pacemaker and a reboot was never ever a solution for cluster issues.

Any hints are greatly appreciated.

nmmn · Feb 13, 2015

Problem still exists. Anybody any ideas?

starnetwork · Dec 29, 2015

Hi, did you find any solution?

Search

Search

Cluster broken after netsplit - rgmanager fails to stop, dlm_controld too

nmmn

Renowned Member

nmmn

Renowned Member

starnetwork

Renowned Member

We value your privacy