cluster not ready - no quorum?

hotwired007 · May 14, 2012

i have a 3 node cluster with shared storage.

after a HA test it all went pearshaped and now i cant do anything with the server cluster.

on all of the servers i get this: cluster not ready - no quorum?

rgmanager will not start and fence_tool ls gives me nothing...

yet on clustat i get all 3 servers online...

any ideas as to what i need to do to fix this?

hotwired007 · May 15, 2012

i have rebooted all the servers, checked the network cards and something is going wrong... after they boot up they all appear online and then they all appear ofline again - could this be that they are being blocked somehow? i have checked and multicast is working, im running out of ideas on how to fix!

hotwired007 · May 21, 2012

bump - any help?

hotwired007 · May 21, 2012

I have updated all the servers again using aptitude update and aptitude full-upgrade

I now have 2 nodes that are quorate (voyager/bellerophon) but my 3rd node (challenger) will not see the other nodes - it almost looks like it has been fenced (See syslog extract) -

*****(From Voyager):
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
voyager 1 Online, Local
bellerophon 2 Online
challenger 3 Offline

This is in the syslog:
May 21 11:11:21 voyager pmxcfs[1510]: [status] notice: cpg_send_message retry 30
May 21 11:11:22 voyager pmxcfs[1510]: [status] notice: cpg_send_message retry 40
May 21 11:11:23 voyager dlm_controld[2760]: daemon cpg_join error retrying
May 21 11:11:23 voyager fenced[2747]: daemon cpg_join error retrying
May 21 11:11:23 voyager pmxcfs[1510]: [status] notice: cpg_send_message retry 50
May 21 11:11:24 voyager pmxcfs[1510]: [status] notice: cpg_send_message retry 60

*****(From Challenger)
Member Status: Inquorate

Member Name ID Status
------ ---- ---- ------
voyager 1 Offline
bellerophon 2 Offline
challenger 3 Online, Local

dietmar · May 21, 2012

What if you restart cman on voyager

# /etc/init.d/cman restart

hotwired007 · May 21, 2012

Ity hung whilst doing the command so i rebooted the box - getting this now:

root@voyager:~# /etc/init.d/cman status
cman is not running
root@voyager:~# /etc/init.d/cman start
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... Corosync Cluster Engine is already running
[FAILED]
root@voyager:~# /etc/init.d/cman restart
Stopping cluster:
Leaving fence domain... [ OK ]
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ]
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... Corosync Cluster Engine is already running
[FAILED]
root@voyager:~#

dietmar · May 21, 2012

What is the output of

# pveversion -v

hotwired007 · May 21, 2012

ive just rebooted all the nodes again and they've came back online - :S

i've just done a full upgrade on all 3 nodes this morning.

root@voyager:~# pveversion -v
pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)
running kernel: 2.6.32-12-pve
proxmox-ve-2.6.32: 2.1-68
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-12-pve: 2.6.32-68
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-16
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1

root@bellerophon:~# pveversion -v
pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)
running kernel: 2.6.32-12-pve
proxmox-ve-2.6.32: 2.1-68
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-12-pve: 2.6.32-68
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-16
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1
root@bellerophon:~#

root@challenger:~# pveversion -v
pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)
running kernel: 2.6.32-12-pve
proxmox-ve-2.6.32: 2.1-68
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-12-pve: 2.6.32-68
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-16
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1
root@challenger:~#

dietmar · May 21, 2012

hotwired007 said:
ive just rebooted all the nodes again and they've came back online - :S

So everything works now?

hotwired007 · May 21, 2012

dietmar said:
So everything works now?

So far although i'm unsure as to why it all went wrong - all i did was power off the voyager node via an IPMI command to see if the test VM would migrate over - the VM moved perfectly but when i brought the voyager node back online it couldnt see the cluster again, so irebooted all the nodes and thats where it all went wrong.

Whats the best way to test fencing?

dietmar · May 22, 2012

hotwired007 said:
Whats the best way to test fencing?

You can use the 'fence_node' command.

hotwired007 · May 23, 2012

i used the command to fence a node - fence_tool bellerophon and the node powered down and rebooted but it will not show in the web management that its online now the three nodes have lost quorum again.

although the clustat shows the three nodes online.

- scrap that they've just rejoined and everything is happy again... :S

dietmar · May 24, 2012

hotwired007 said:
i used the command to fence a node - fence_tool bellerophon and the node powered down and rebooted but it will not show in the web management that its online now the three nodes have lost quorum again.

What? The remaining 2 nodes loose quorum?

hotwired007 · May 24, 2012

i've run a few more tests and its not happened again- not sure if it was a glitch

Search

Search

cluster not ready - no quorum?

hotwired007

Member

hotwired007

Member

hotwired007

Member

hotwired007

Member

dietmar

Proxmox Staff Member

hotwired007

Member

dietmar

Proxmox Staff Member

hotwired007

Member

dietmar

Proxmox Staff Member

hotwired007

Member

dietmar

Proxmox Staff Member

hotwired007

Member

dietmar

Proxmox Staff Member

hotwired007

Member

We value your privacy