cluster not ready - no quorum?

hotwired007

Member
Sep 19, 2011
533
7
16
UK
i have a 3 node cluster with shared storage.

after a HA test it all went pearshaped and now i cant do anything with the server cluster.

on all of the servers i get this: cluster not ready - no quorum?

rgmanager will not start and fence_tool ls gives me nothing...

yet on clustat i get all 3 servers online...

any ideas as to what i need to do to fix this?
 
i have rebooted all the servers, checked the network cards and something is going wrong... after they boot up they all appear online and then they all appear ofline again - could this be that they are being blocked somehow? i have checked and multicast is working, im running out of ideas on how to fix!
 
I have updated all the servers again using aptitude update and aptitude full-upgrade

I now have 2 nodes that are quorate (voyager/bellerophon) but my 3rd node (challenger) will not see the other nodes - it almost looks like it has been fenced (See syslog extract) -

*****(From Voyager):
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
voyager 1 Online, Local
bellerophon 2 Online
challenger 3 Offline

This is in the syslog:
May 21 11:11:21 voyager pmxcfs[1510]: [status] notice: cpg_send_message retry 30
May 21 11:11:22 voyager pmxcfs[1510]: [status] notice: cpg_send_message retry 40
May 21 11:11:23 voyager dlm_controld[2760]: daemon cpg_join error retrying
May 21 11:11:23 voyager fenced[2747]: daemon cpg_join error retrying
May 21 11:11:23 voyager pmxcfs[1510]: [status] notice: cpg_send_message retry 50
May 21 11:11:24 voyager pmxcfs[1510]: [status] notice: cpg_send_message retry 60

*****(From Challenger)
Member Status: Inquorate

Member Name ID Status
------ ---- ---- ------
voyager 1 Offline
bellerophon 2 Offline
challenger 3 Online, Local
 
Ity hung whilst doing the command so i rebooted the box - getting this now:

root@voyager:~# /etc/init.d/cman status
cman is not running
root@voyager:~# /etc/init.d/cman start
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... Corosync Cluster Engine is already running
[FAILED]
root@voyager:~# /etc/init.d/cman restart
Stopping cluster:
Leaving fence domain... [ OK ]
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ]
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... Corosync Cluster Engine is already running
[FAILED]
root@voyager:~#
 
ive just rebooted all the nodes again and they've came back online - :S

i've just done a full upgrade on all 3 nodes this morning.

root@voyager:~# pveversion -v
pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)
running kernel: 2.6.32-12-pve
proxmox-ve-2.6.32: 2.1-68
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-12-pve: 2.6.32-68
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-16
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1

root@bellerophon:~# pveversion -v
pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)
running kernel: 2.6.32-12-pve
proxmox-ve-2.6.32: 2.1-68
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-12-pve: 2.6.32-68
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-16
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1
root@bellerophon:~#

root@challenger:~# pveversion -v
pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)
running kernel: 2.6.32-12-pve
proxmox-ve-2.6.32: 2.1-68
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-12-pve: 2.6.32-68
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-16
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1
root@challenger:~#
 
So everything works now?

So far although i'm unsure as to why it all went wrong - all i did was power off the voyager node via an IPMI command to see if the test VM would migrate over - the VM moved perfectly but when i brought the voyager node back online it couldnt see the cluster again, so irebooted all the nodes and thats where it all went wrong.

Whats the best way to test fencing?
 
i used the command to fence a node - fence_tool bellerophon and the node powered down and rebooted but it will not show in the web management that its online now the three nodes have lost quorum again.

although the clustat shows the three nodes online.

- scrap that they've just rejoined and everything is happy again... :S
 
Last edited:
i used the command to fence a node - fence_tool bellerophon and the node powered down and rebooted but it will not show in the web management that its online now the three nodes have lost quorum again.

What? The remaining 2 nodes loose quorum?