Hello,
we've a 2-node cluster running privately and went for a 4-node cluster recently. Initial setup went smooth, DRBD and fencing devices are all fine (and working!). Last week we decided to change the vlan for the eth1 interface on all machines simultaneously. E.g. we caused a very, very short netsplit on the whole cluster.
Corosync noticed this but recovered very soon. But the toxic duo rgmanager and/or dlm_controld seems unforgivable. The rgmanager is off on all machines since then:
Stopping the rgmanager via init script fails everytime and hangs. Just possible with killall -9. Same thing with dlm_controld:
On startup we got the following message:
And
We've started rgmanager in foreground and with debugging switch on. Gives no output at all.
Reboots of the node don't fix this and we're wondering why this is a supposed solution. We've several clusters running with Corosync + Pacemaker and a reboot was never ever a solution for cluster issues.
Any hints are greatly appreciated.
we've a 2-node cluster running privately and went for a 4-node cluster recently. Initial setup went smooth, DRBD and fencing devices are all fine (and working!). Last week we decided to change the vlan for the eth1 interface on all machines simultaneously. E.g. we caused a very, very short netsplit on the whole cluster.
Corosync noticed this but recovered very soon. But the toxic duo rgmanager and/or dlm_controld seems unforgivable. The rgmanager is off on all machines since then:
Code:
clustat
Cluster Status for Proxmox @ Tue Feb 10 12:10:17 2015
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
server-01 1 Online
server-02 2 Online
server-03 3 Online, Local
server-04 4 Online
Stopping the rgmanager via init script fails everytime and hangs. Just possible with killall -9. Same thing with dlm_controld:
Code:
service cman stop
Stopping cluster:
Leaving fence domain... [ OK ]
Stopping dlm_controld...
[FAILED]
On startup we got the following message:
Code:
dlm: rgmanager: group join failed -512 0
And
Code:
dlm_controld process_uevent online@ error -17 errno 11
We've started rgmanager in foreground and with debugging switch on. Gives no output at all.
Reboots of the node don't fix this and we're wondering why this is a supposed solution. We've several clusters running with Corosync + Pacemaker and a reboot was never ever a solution for cluster issues.
Any hints are greatly appreciated.