Permanently lose quroum

Dmitry Panoff

New Member
Nov 7, 2012
12
0
1
Donetsk, DPR
Hi, aLL

I have 3 servers in cluster, connected by Nortel BayStack 5510 switch. All servers have 2 NICs, joined in trunk on switch. Multicast is allowed and works. Hardware - servers, switch, connections, power etc. - are OK
Permanently, once a week, maybe once in two weeks, one of the cluster nodes loses quorum. Sometimes all nodes loses quorum. When this happens, first thing to do is:

Code:
service pvestatd stop
service pvedaemon stop
service cman stop
service pve-cluster stop
sleep 3

and start them again:

Code:
service pve-cluster start
service cman start
service pvestatd start
service pvedaemon start
Usually, this helps: nodes find themselves, have quorum and everything goes OK.

But sometimes one of the node (not the same - different one) doesn't get quorum such way. All I have to do is completely reboot "problem" node. During boot process it gets quorum and rejoins to cluster. And everything goes OK till next quorum-lose.
Quorum lose happens during night backups, day work, simple unuse on weekends - no system in this. No errors in log - corosync says about sudden lose quorum.

Tried to stop services, wait several minutes (aproximately like rebooting server) and start'em again - doesn't help.

So, the questions are: why doesn't the simple restart of services work, and complete reboot - do? Do I need restart some additional services, or do some things to get quorum on problem node without rebooting it? 'Cause rebooting node is bad idea...