quorum timeout adding node to cluster unicast

canelli

New Member
Jul 23, 2013
3
0
1
I have a two node cluster (node name pm0 and pm1 ) with unicast enabled ( transport="udpu" )

Code:
<?xml version="1.0"?>
<cluster name="alidays-cluster" config_version="15">

  <cman keyfile="/var/lib/pve-cluster/corosync.authkey" transport="udpu">
  </cman>

  <clusternodes>
  <clusternode name="pm1" votes="1" nodeid="1"/>
  <clusternode name="pm0" votes="1" nodeid="2"/></clusternodes>

</cluster>

pveversion :
Code:
root@pm1:~# pveversion
pve-manager/2.3/7946f1f1

I setup a new host (pm2) with a fresh installation of pve . When adding this node to the cluster I got the error "Waiting for quorum... Timed-out waiting for cluster"
Code:
root@pm2:/var/log# pvecm add pm1
copy corosync auth key
stopping pve-cluster service
Stopping pve cluster filesystem: pve-cluster.
backup old database
Starting pve cluster filesystem : pve-clustercan't create shared ssh key database '/etc/pve/priv/authorized_keys'
.
Starting cluster:
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... Timed-out waiting for cluster
[FAILED]
waiting for quorum...

the new cluster configuration on pm1:
Code:
<?xml version="1.0"?>
<cluster name="alidays-cluster" config_version="16">

  <cman keyfile="/var/lib/pve-cluster/corosync.authkey" transport="udpu">
  </cman>

  <clusternodes>
  <clusternode name="pm1" votes="1" nodeid="1"/>
  <clusternode name="pm0" votes="1" nodeid="2"/><clusternode name="pm2" votes="1" nodeid="3"/></clusternode
s>

</cluster>

the new node seem to be added to the cluster but when cman start it's fail to join the cluster
Code:
Jul 23 15:23:08 corosync [MAIN  ] Successfully configured openais services to load
Jul 23 15:23:08 corosync [TOTEM ] Initializing transport (UDP/IP Unicast).
Jul 23 15:23:08 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jul 23 15:23:08 corosync [TOTEM ] The network interface [192.168.169.15] is now up.
Jul 23 15:23:08 corosync [QUORUM] Using quorum provider quorum_cman
Jul 23 15:23:08 corosync [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Jul 23 15:23:08 corosync [CMAN  ] CMAN 1352871249 (built Nov 14 2012 06:34:12) started
Jul 23 15:23:08 corosync [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Jul 23 15:23:08 corosync [SERV  ] Service engine loaded: openais cluster membership service B.01.01
Jul 23 15:23:08 corosync [SERV  ] Service engine loaded: openais event service B.01.01
Jul 23 15:23:08 corosync [SERV  ] Service engine loaded: openais checkpoint service B.01.01
Jul 23 15:23:08 corosync [SERV  ] Service engine loaded: openais message service B.03.01
Jul 23 15:23:08 corosync [SERV  ] Service engine loaded: openais distributed locking service B.03.01
Jul 23 15:23:08 corosync [SERV  ] Service engine loaded: openais timer service A.01.01
Jul 23 15:23:08 corosync [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Jul 23 15:23:08 corosync [SERV  ] Service engine loaded: corosync configuration service
Jul 23 15:23:08 corosync [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
Jul 23 15:23:08 corosync [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Jul 23 15:23:08 corosync [SERV  ] Service engine loaded: corosync profile loading service
Jul 23 15:23:08 corosync [QUORUM] Using quorum provider quorum_cman
Jul 23 15:23:08 corosync [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Jul 23 15:23:08 corosync [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Jul 23 15:23:08 corosync [TOTEM ] adding new UDPU member {192.168.169.10}
Jul 23 15:23:08 corosync [TOTEM ] adding new UDPU member {192.168.169.5}
Jul 23 15:23:08 corosync [TOTEM ] adding new UDPU member {192.168.169.15}
Jul 23 15:23:08 corosync [CLM   ] CLM CONFIGURATION CHANGE
Jul 23 15:23:08 corosync [CLM   ] New Configuration:
Jul 23 15:23:08 corosync [CLM   ] Members Left:
Jul 23 15:23:08 corosync [CLM   ] Members Joined:
Jul 23 15:23:08 corosync [CLM   ] CLM CONFIGURATION CHANGE
Jul 23 15:23:08 corosync [CLM   ] New Configuration:
Jul 23 15:23:08 corosync [CLM   ]       r(0) ip(192.168.169.15)
Jul 23 15:23:08 corosync [CLM   ] Members Left:
Jul 23 15:23:08 corosync [CLM   ] Members Joined:
Jul 23 15:23:08 corosync [CLM   ]       r(0) ip(192.168.169.15)
Jul 23 15:23:08 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul 23 15:23:08 corosync [QUORUM] Members[1]: 3
Jul 23 15:23:08 corosync [QUORUM] Members[1]: 3
Jul 23 15:23:08 corosync [CPG   ] chosen downlist: sender r(0) ip(192.168.169.15) ; members(old:0 left:0)
Jul 23 15:23:08 corosync [MAIN  ] Completed service synchronization, ready to provide service.
 
Hi Tom

I added in /etc/hosts on all node ( 2 two node cluster member and to the new host ) .
I rebooted the new host but not the entire cluster ( it's a production environment ! )

Claudio
 
Hi Tom
I solved the error .
First of all: I can't reboot the two live hosts because on a production environment with more then 10 VM runnning , so I restarted only cluster manager on all cluster nodes (pm0 and pm1), operation made with puttycs to execute the same command at the same time .
Then, I added the new node to the cluster and all worked fine .

for further documentation, This is the step I used:

First, reset cluster to the original state:
a) remove previsoulsy added new node from cluster on one of the original node
Code:
root@pm0:~#pvecm delnode pm2
b) stop the cluster manager on the new node
Code:
root@pm2:~#service cman stop
root@pm2:~#service pve-cluster stop
c) delete cluster definition on new node (pm2)
Code:
root@pm2:~#rm /etc/cluster/cluster.conf
root@pm2:~#rm -r /var/lib/pve-cluster/*
and reboot it
e) restart cluster manager on all two nodes , lunching command at the same time with puttycs
Code:
root@pm0:~#service cman stop
root@pm0:~#service pve-cluster stop
root@pm0:~#service pve-cluster start
root@pm0:~#service cman start


At this point I have a running cluster with 2 node (pm0 and pm1 ) and a standalone node pm2
now:
a) check to ensure that on all nodes the /etc/hosts have an entry for each node (the origianal two and the new entry )
on pm0
Code:
root@pm0:~#cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
10.168.169.5    pm0.local           pm0  pvelocalhost
10.168.169.10   pm1.local           pm1
10.168.169.15   pm2.local           pm2
on pm2
Code:
root@pm2:~#cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
10.168.169.5    pm0.local           pm0  
10.168.169.10   pm1.local           pm1
10.168.169.15   pm2.local           pm2 pvelocalhost

b) if modified the /etc/hosts, restart cluster manager on the cluster (see previous)
c) if modified the /etc/hosts on the new node , reboot node
d) from the new node , join the cluster
Code:
root@pm2:~#pvecm  add  pm1
The join completed succesfuly
e) to check the system, reboot the new node (pm2) . All work fine

Claudio
 
thanks for feedback!
 
Hi All,

I know this is an old post but wanted to add a comment here as I've been struggling for about a week until I figured this out.

When adding a new proxmox host to an existing cluster I kept getting "waiting for quorum". I know multicast worked as I tested using omping but didn't realize that using jumbo frames on the interface that you are contacting the cluster node on somehow prevents this from working. I remove jumbo frames from the interface in question and set it back to the default 1500 MTU. After that I could join the cluster without issue.

Hope this helps someone.
Garith Dugmore
 
Enough people have quorum issues, corosync issues, joining issues, etc. that I think these troubleshooting steps should be documented on the wiki.

Oh, wait, I have write access to the wiki. Um. Yeah. OK, I'll start writing something up :-(.

FWIW, I just discovered that on a 1Gbit/sec network, I'm unable to add a 5th cluster node when using UDP unicast. Deleting all the cluster data (per above) and then re-joining after disabling IGMP on my switch completely works.