Hi, aLL
I have 3 servers in cluster, connected by Nortel BayStack 5510 switch. All servers have 2 NICs, joined in trunk on switch. Multicast is allowed and works. Hardware - servers, switch, connections, power etc. - are OK
Permanently, once a week, maybe once in two weeks, one of the cluster nodes loses quorum. Sometimes all nodes loses quorum. When this happens, first thing to do is:
and start them again:
Usually, this helps: nodes find themselves, have quorum and everything goes OK.
But sometimes one of the node (not the same - different one) doesn't get quorum such way. All I have to do is completely reboot "problem" node. During boot process it gets quorum and rejoins to cluster. And everything goes OK till next quorum-lose.
Quorum lose happens during night backups, day work, simple unuse on weekends - no system in this. No errors in log - corosync says about sudden lose quorum.
Tried to stop services, wait several minutes (aproximately like rebooting server) and start'em again - doesn't help.
So, the questions are: why doesn't the simple restart of services work, and complete reboot - do? Do I need restart some additional services, or do some things to get quorum on problem node without rebooting it? 'Cause rebooting node is bad idea...
I have 3 servers in cluster, connected by Nortel BayStack 5510 switch. All servers have 2 NICs, joined in trunk on switch. Multicast is allowed and works. Hardware - servers, switch, connections, power etc. - are OK
Permanently, once a week, maybe once in two weeks, one of the cluster nodes loses quorum. Sometimes all nodes loses quorum. When this happens, first thing to do is:
Code:
service pvestatd stop
service pvedaemon stop
service cman stop
service pve-cluster stop
sleep 3
and start them again:
Code:
service pve-cluster start
service cman start
service pvestatd start
service pvedaemon start
But sometimes one of the node (not the same - different one) doesn't get quorum such way. All I have to do is completely reboot "problem" node. During boot process it gets quorum and rejoins to cluster. And everything goes OK till next quorum-lose.
Quorum lose happens during night backups, day work, simple unuse on weekends - no system in this. No errors in log - corosync says about sudden lose quorum.
Tried to stop services, wait several minutes (aproximately like rebooting server) and start'em again - doesn't help.
So, the questions are: why doesn't the simple restart of services work, and complete reboot - do? Do I need restart some additional services, or do some things to get quorum on problem node without rebooting it? 'Cause rebooting node is bad idea...