Hi, aLL
I have 3 servers joined in cluster. Each server has 2 NIC, joined to bond0 on each server. Servers connected with Nortel BayStack 5510 switch, server cards from each server joined to Multilink Trunk. Switch allows multicast.
Usually, once or twice a week servers lose quorum. One, or two nodes lose it. It happens during daytime work, night backups, weekend no-work - no system in quorum lose at all.
Hardware are ok (2 Intel servers, 1 Dell PowerEdge server), network is ok (no errors on switch log), no power failures. Date are set from local NTP-server and the same on all nodes.
When quorum lose happens, first thing to do is restart services in such manner:
Sometimes this helps and nodes get quorum. And cluster works until next quorum faulure.
But usually only thing to do is reboot problem nodes. During boot they get quorum and cluster gets ready.
:~# pvecm status
No errors in logs - simply sudden lose quorum, "re-creating" cluster with one node:
corosync.log:
This thread doesn't help: http://forum.proxmox.com/threads/10376-Interesting-Observations-and-solution-Cluster-issues-(quorum)
So, the questions are: if reboot helps and services restart not, what additional service do i need to restart to get quorum? Or what to do to get quorum without node rebooting. Because rebooting nodes is bad-bad idea...
P.S. Proxmox-2.2-26/c1614c8c.
I have 3 servers joined in cluster. Each server has 2 NIC, joined to bond0 on each server. Servers connected with Nortel BayStack 5510 switch, server cards from each server joined to Multilink Trunk. Switch allows multicast.
Usually, once or twice a week servers lose quorum. One, or two nodes lose it. It happens during daytime work, night backups, weekend no-work - no system in quorum lose at all.
Hardware are ok (2 Intel servers, 1 Dell PowerEdge server), network is ok (no errors on switch log), no power failures. Date are set from local NTP-server and the same on all nodes.
When quorum lose happens, first thing to do is restart services in such manner:
Code:
service pvestatd stop
service pvedaemon stop
service cman stop
service pve-cluster stop
sleep 2
service pve-cluster start
service cman start
service pvestatd start
service pvedaemon start
Sometimes this helps and nodes get quorum. And cluster works until next quorum faulure.
But usually only thing to do is reboot problem nodes. During boot they get quorum and cluster gets ready.
:~# pvecm status
Code:
Version: 6.2.0
Config Version: 6
Cluster Name: sdpi
Cluster Id: 1649
Cluster Member: Yes
Cluster Generation: 17280
Membership state: Cluster-Member
Nodes: 1
Expected votes: 2
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: virt3
Node ID: 3
Multicast addresses: 239.192.6.119
Node addresses: 192.168.0.213
No errors in logs - simply sudden lose quorum, "re-creating" cluster with one node:
corosync.log:
Code:
...
Nov 08 08:41:58 corosync [TOTEM ] Retransmit List: 2f713 2f715 2f716 2f717 2f718 2f719 2f6f8 2f70a 2f70b 2f6
Nov 08 08:42:08 corosync [TOTEM ] A processor failed, forming new configuration.
Nov 08 08:42:20 corosync [CLM ] CLM CONFIGURATION CHANGE
Nov 08 08:42:20 corosync [CLM ] New Configuration:
Nov 08 08:42:20 corosync [CLM ] <---->r(0) ip(192.168.0.213).
Nov 08 08:42:20 corosync [CLM ] Members Left:
Nov 08 08:42:20 corosync [CLM ] <---->r(0) ip(192.168.0.211).
Nov 08 08:42:20 corosync [CLM ] <---->r(0) ip(192.168.0.212).
Nov 08 08:42:20 corosync [CLM ] Members Joined:
Nov 08 08:42:20 corosync [QUORUM] Members[2]: 2 3
Nov 08 08:42:20 corosync [CMAN ] quorum lost, blocking activity
Nov 08 08:42:20 corosync [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 08 08:42:20 corosync [QUORUM] Members[1]: 3
Nov 08 08:42:20 corosync [CLM ] CLM CONFIGURATION CHANGE
Nov 08 08:42:20 corosync [CLM ] New Configuration:
Nov 08 08:42:20 corosync [CLM ] <---->r(0) ip(192.168.0.213).
Nov 08 08:42:20 corosync [CLM ] Members Left:
Nov 08 08:42:20 corosync [CLM ] Members Joined:
Nov 08 08:42:20 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 08 08:42:20 corosync [CPG ] chosen downlist: sender r(0) ip(192.168.0.213) ; members(old:3 left:2)
Nov 08 08:42:20 corosync [MAIN ] Completed service synchronization, ready to provide service.
Nov 08 09:35:24 corosync [SERV ] Unloading all Corosync service engines.
This thread doesn't help: http://forum.proxmox.com/threads/10376-Interesting-Observations-and-solution-Cluster-issues-(quorum)
So, the questions are: if reboot helps and services restart not, what additional service do i need to restart to get quorum? Or what to do to get quorum without node rebooting. Because rebooting nodes is bad-bad idea...
P.S. Proxmox-2.2-26/c1614c8c.