Cluster broken after netsplit - rgmanager fails to stop, dlm_controld too

nmmn

Renowned Member
Feb 10, 2015
19
3
68
Hello,

we've a 2-node cluster running privately and went for a 4-node cluster recently. Initial setup went smooth, DRBD and fencing devices are all fine (and working!). Last week we decided to change the vlan for the eth1 interface on all machines simultaneously. E.g. we caused a very, very short netsplit on the whole cluster.

Corosync noticed this but recovered very soon. But the toxic duo rgmanager and/or dlm_controld seems unforgivable. The rgmanager is off on all machines since then:

Code:
clustat
Cluster Status for Proxmox @ Tue Feb 10 12:10:17 2015
Member Status: Quorate


 Member Name                                               ID   Status
 ------ ----                                               ---- ------
 server-01                                                      1 Online
 server-02                                                      2 Online
 server-03                                                      3 Online, Local
 server-04                                                      4 Online

Stopping the rgmanager via init script fails everytime and hangs. Just possible with killall -9. Same thing with dlm_controld:

Code:
service cman stop
Stopping cluster: 
   Leaving fence domain... [  OK  ]
   Stopping dlm_controld... 
[FAILED]

On startup we got the following message:

Code:
dlm: rgmanager: group join failed -512 0

And

Code:
dlm_controld process_uevent online@ error -17 errno 11

We've started rgmanager in foreground and with debugging switch on. Gives no output at all.

Reboots of the node don't fix this and we're wondering why this is a supposed solution. We've several clusters running with Corosync + Pacemaker and a reboot was never ever a solution for cluster issues.

Any hints are greatly appreciated.