Timed-out waiting for cluster

decibel83 · Mar 13, 2015

Hi.
I'm trying to create a 4 nodes cluster with Proxmox 3.4.
I successfully created the new cluster on node1 with "pvecm create", but I cannot add the additional nodes to it:

Code:

root@node2:~# /etc/init.d/pve-cluster startStarting pve cluster filesystem : pve-cluster.
root@node2:~# pvecm add 192.168.60.1
node node2 already defined
copy corosync auth key
stopping pve-cluster service
Stopping pve cluster filesystem: pve-cluster.
backup old database
Starting pve cluster filesystem : pve-cluster.
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... Timed-out waiting for cluster
[FAILED]
waiting for quorum...

All nodes are connected together with a "3com baseline switch 2916-sfp plus" smart-managed switch.

All nodes has all entries in /etc/hosts and can ping theirselves:

Code:

root@node1:~# cat /etc/hosts
127.0.0.1    localhost
192.168.60.1    node1.mydomain    node1    pvelocalhost
192.168.60.2    node2.mydomain    node2
192.168.60.3    node3.mydomain    node3
192.168.60.4    node4.mydomain    node4

Could you help me please?
Thank you very much!
Bye

alax · Mar 13, 2015

Getting stuck at that point usually indicates a multicast issue. Check this page: http://pve.proxmox.com/wiki/Multicast_notes#Testing_multicast and ensure that multicast is working.

decibel83 · Mar 13, 2015

Ok, it seems I'm having some multicast problems:

Code:

root@node2:~# asmping 239.192.53.58 192.168.60.1
asmping joined (S,G) = (*,239.192.53.234)
pinging 192.168.60.1 from 192.168.60.2
  unicast from 192.168.60.1, seq=1 dist=0 time=0.214 ms
  unicast from 192.168.60.1, seq=2 dist=0 time=0.184 ms
  unicast from 192.168.60.1, seq=3 dist=0 time=0.180 ms
  unicast from 192.168.60.1, seq=4 dist=0 time=0.168 ms
  unicast from 192.168.60.1, seq=5 dist=0 time=0.177 ms
  unicast from 192.168.60.1, seq=6 dist=0 time=0.185 ms
  unicast from 192.168.60.1, seq=7 dist=0 time=0.177 ms
  unicast from 192.168.60.1, seq=8 dist=0 time=0.179 ms
  unicast from 192.168.60.1, seq=9 dist=0 time=0.173 ms
  unicast from 192.168.60.1, seq=10 dist=0 time=0.186 ms
  unicast from 192.168.60.1, seq=11 dist=0 time=0.145 ms
  unicast from 192.168.60.1, seq=12 dist=0 time=0.171 ms
  unicast from 192.168.60.1, seq=13 dist=0 time=0.175 ms
  unicast from 192.168.60.1, seq=14 dist=0 time=0.148 ms
  unicast from 192.168.60.1, seq=15 dist=0 time=0.176 ms
  unicast from 192.168.60.1, seq=16 dist=0 time=0.175 ms
  unicast from 192.168.60.1, seq=17 dist=0 time=0.174 ms
  unicast from 192.168.60.1, seq=18 dist=0 time=0.177 ms
  unicast from 192.168.60.1, seq=19 dist=0 time=0.176 ms
  unicast from 192.168.60.1, seq=20 dist=0 time=0.174 ms
  unicast from 192.168.60.1, seq=21 dist=0 time=0.181 ms
  unicast from 192.168.60.1, seq=22 dist=0 time=0.170 ms
  unicast from 192.168.60.1, seq=23 dist=0 time=0.175 ms
  unicast from 192.168.60.1, seq=24 dist=0 time=0.177 ms
^C
--- 192.168.60.1 statistics ---
24 packets transmitted, time 23488 ms
unicast:
   24 packets received, 0% packet loss
   rtt min/avg/max/std-dev = 0.145/0.175/0.214/0.020 ms
multicast:
   0 packets received, 100% packet loss

decibel83 · Mar 13, 2015

I checked the switch datasheet to ensure it supports multicast and it seems to support it.
Do you have any other idea?

mir · Mar 13, 2015

IGMP snooping needs to be enabled on the vlan you use for cluster communication. By default your switch has IGMP snooping disabled on all vlans. How to enable -> http://www.manualslib.com/manual/232373/3com-2924-Sfp.html?page=124

decibel83 · Mar 14, 2015

I've already checked that and IGMP Snooping Status is enabled and it's enabled in the VLAN I use for the cluster (VLAN10).
I also tried to recreate the cluster from scratch but I still cannot receive the quorum.
Could you help me?

mir · Mar 14, 2015

Try this: http://pve.proxmox.com/wiki/Multicast_notes#Linux:_Enabling_Multicast_querier_on_bridges

decibel83 · Mar 14, 2015

I enabled that, and now I managed in adding all nodes in the cluster:

Code:

root@node3:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M    788   2015-03-14 11:51:34  node2
   2   M    788   2015-03-14 11:51:34  node1
   3   M    784   2015-03-14 11:51:34  node3
   4   M    792   2015-03-14 11:52:06  node4

But I'm still having some problems from some nodes: from node3, for example, I cannot use the cluster because of some "Broken pipe (596)" and "Error: cluster not ready - no quorum?" errors, but from node4 all seems to work good.

mir · Mar 14, 2015

You have to add the option to all nodes.

decibel83 · Mar 14, 2015

mir said:
You have to add the option to all nodes.

Yes, I added the option to all nodes (sorry I omitted this in my last post).

decibel83 · Mar 14, 2015

Now my cluster is completely dead:

Code:

root@node1:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   X    752                        node2
   2   M    436   2015-03-14 11:51:03  node1
   3   X    788                        node3
   4   X    792                        node4


root@node2:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M    748   2015-03-14 11:50:38  node2
   2   X    752                        node1
   3   X    788                        node3
   4   X    792                        node4


root@node3:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   X      0                        node2
   2   X      0                        node1
   3   M    932   2015-03-14 12:03:40  node3
   4   X      0                        node4


root@node4:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   X      0                        node2
   2   X      0                        node1
   3   X      0                        node3
   4   M    940   2015-03-14 12:05:28  node4

mir · Mar 14, 2015

After adding the option you need to reboot all nodes.

decibel83 · Mar 14, 2015

mir said:
After adding the option you need to reboot all nodes.

When trying I added the option directly from the command line:

Code:

echo 1 > /sys/devices/virtual/net/vmbr0/bridge/multicast_querier

And it was activated:

Code:

root@node4:~# cat /sys/devices/virtual/net/vmbr0/bridge/multicast_querier

1

Although I rebooted all nodes (after adding the post-up options in the interfaces file) and the situation is what I wroted in my post #11.

mir · Mar 14, 2015

Does the multicast test succeed? (omping, ssmping)

mir · Mar 14, 2015

Reading manual about your switch reveals that this particular switch does not include IGMP querier which is the cause that you need to enable the querier on the Linux Bridge.

decibel83 · Mar 14, 2015

mir said:
Does the multicast test succeed? (omping, ssmping)

My multicast address is 239.192.53.58:

Code:

root@node1:~# pvecm status|grep "Multicast addresses"
Multicast addresses: 239.192.53.58

My nodes are node1, node2, node3, node4, all in /etc/hosts on every node on 192.168.60.0/24 network which is vmbr0 bridged to bond0.10.

Multicast test:

Code:

root@node1:~# omping -m 239.192.53.58 node1 node2 node3 node4
node2 : waiting for response msg
node3 : waiting for response msg
node4 : waiting for response msg
node3 : joined (S,G) = (*, 239.192.53.58), pinging
node2 : joined (S,G) = (*, 239.192.53.58), pinging
node4 : joined (S,G) = (*, 239.192.53.58), pinging
node3 :   unicast, seq=1, size=69 bytes, dist=0, time=0.084ms
node3 : multicast, seq=1, size=69 bytes, dist=0, time=0.106ms
node4 :   unicast, seq=1, size=69 bytes, dist=0, time=0.099ms
node4 : multicast, seq=1, size=69 bytes, dist=0, time=0.108ms
node2 :   unicast, seq=1, size=69 bytes, dist=0, time=0.116ms
node2 :   unicast, seq=2, size=69 bytes, dist=0, time=0.183ms
node4 : multicast, seq=2, size=69 bytes, dist=0, time=0.184ms
node4 :   unicast, seq=2, size=69 bytes, dist=0, time=0.178ms
node3 : multicast, seq=2, size=69 bytes, dist=0, time=0.213ms
node3 :   unicast, seq=2, size=69 bytes, dist=0, time=0.201ms
node3 :   unicast, seq=3, size=69 bytes, dist=0, time=0.167ms
node3 : multicast, seq=3, size=69 bytes, dist=0, time=0.180ms
node2 :   unicast, seq=3, size=69 bytes, dist=0, time=0.192ms
node4 : multicast, seq=3, size=69 bytes, dist=0, time=0.182ms
node4 :   unicast, seq=3, size=69 bytes, dist=0, time=0.179ms
node2 :   unicast, seq=4, size=69 bytes, dist=0, time=0.191ms
node3 :   unicast, seq=4, size=69 bytes, dist=0, time=0.181ms
node3 : multicast, seq=4, size=69 bytes, dist=0, time=0.186ms
node4 : waiting for response msg
node2 :   unicast, seq=5, size=69 bytes, dist=0, time=0.185ms
node4 : server told us to stop
node3 : waiting for response msg
node3 : server told us to stop
^C
node2 :   unicast, xmt/rcv/%loss = 5/5/0%, min/avg/max/std-dev = 0.116/0.173/0.192/0.032
node2 : multicast, xmt/rcv/%loss = 5/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
node3 :   unicast, xmt/rcv/%loss = 4/4/0%, min/avg/max/std-dev = 0.084/0.158/0.201/0.051
node3 : multicast, xmt/rcv/%loss = 4/4/0%, min/avg/max/std-dev = 0.106/0.171/0.213/0.046
node4 :   unicast, xmt/rcv/%loss = 3/3/0%, min/avg/max/std-dev = 0.099/0.152/0.179/0.046
node4 : multicast, xmt/rcv/%loss = 3/3/0%, min/avg/max/std-dev = 0.108/0.158/0.184/0.043


root@node2:~# omping -m 239.192.53.58 node1 node2 node3 node4
node1 : waiting for response msg
node3 : waiting for response msg
node4 : waiting for response msg
node3 : joined (S,G) = (*, 239.192.53.58), pinging
node4 : joined (S,G) = (*, 239.192.53.58), pinging
node3 :   unicast, seq=1, size=69 bytes, dist=0, time=0.096ms
node4 :   unicast, seq=1, size=69 bytes, dist=0, time=0.090ms
node4 : multicast, seq=1, size=69 bytes, dist=0, time=0.099ms
node3 : multicast, seq=1, size=69 bytes, dist=0, time=0.118ms
node3 :   unicast, seq=2, size=69 bytes, dist=0, time=0.193ms
node3 : multicast, seq=2, size=69 bytes, dist=0, time=0.201ms
node4 :   unicast, seq=2, size=69 bytes, dist=0, time=0.177ms
node4 : multicast, seq=2, size=69 bytes, dist=0, time=0.185ms
node1 : waiting for response msg
node1 : joined (S,G) = (*, 239.192.53.58), pinging
node3 :   unicast, seq=3, size=69 bytes, dist=0, time=0.180ms
node3 : multicast, seq=3, size=69 bytes, dist=0, time=0.187ms
node4 :   unicast, seq=3, size=69 bytes, dist=0, time=0.180ms
node4 : multicast, seq=3, size=69 bytes, dist=0, time=0.188ms
node1 :   unicast, seq=1, size=69 bytes, dist=0, time=0.099ms
node1 : multicast, seq=1, size=69 bytes, dist=0, time=0.126ms
node1 :   unicast, seq=2, size=69 bytes, dist=0, time=0.184ms
node1 : multicast, seq=2, size=69 bytes, dist=0, time=0.206ms
node3 :   unicast, seq=4, size=69 bytes, dist=0, time=0.177ms
node3 : multicast, seq=4, size=69 bytes, dist=0, time=0.193ms
node4 :   unicast, seq=4, size=69 bytes, dist=0, time=0.193ms
node4 : multicast, seq=4, size=69 bytes, dist=0, time=0.196ms
node1 :   unicast, seq=3, size=69 bytes, dist=0, time=0.191ms
node1 : multicast, seq=3, size=69 bytes, dist=0, time=0.208ms
node3 : multicast, seq=5, size=69 bytes, dist=0, time=0.195ms
node3 :   unicast, seq=5, size=69 bytes, dist=0, time=0.192ms
node4 : waiting for response msg
node1 :   unicast, seq=4, size=69 bytes, dist=0, time=0.207ms
node1 : multicast, seq=4, size=69 bytes, dist=0, time=0.234ms
node4 : server told us to stop
node3 : multicast, seq=6, size=69 bytes, dist=0, time=0.210ms
node3 :   unicast, seq=6, size=69 bytes, dist=0, time=0.208ms
^C
node1 :   unicast, xmt/rcv/%loss = 4/4/0%, min/avg/max/std-dev = 0.099/0.170/0.207/0.048
node1 : multicast, xmt/rcv/%loss = 4/4/0%, min/avg/max/std-dev = 0.126/0.194/0.234/0.047
node3 :   unicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.096/0.174/0.208/0.040
node3 : multicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.118/0.184/0.210/0.033
node4 :   unicast, xmt/rcv/%loss = 4/4/0%, min/avg/max/std-dev = 0.090/0.160/0.193/0.047
node4 : multicast, xmt/rcv/%loss = 4/4/0%, min/avg/max/std-dev = 0.099/0.167/0.196/0.046


root@node3:~# omping -m 239.192.53.58 node1 node2 node3 node4
node1 : waiting for response msg
node2 : waiting for response msg
node4 : waiting for response msg
node4 : joined (S,G) = (*, 239.192.53.58), pinging
node4 :   unicast, seq=1, size=69 bytes, dist=0, time=0.086ms
node4 : multicast, seq=1, size=69 bytes, dist=0, time=0.095ms
node4 :   unicast, seq=2, size=69 bytes, dist=0, time=0.176ms
node4 : multicast, seq=2, size=69 bytes, dist=0, time=0.184ms
node1 : waiting for response msg
node2 : waiting for response msg
node1 : joined (S,G) = (*, 239.192.53.58), pinging
node2 : joined (S,G) = (*, 239.192.53.58), pinging
node4 :   unicast, seq=3, size=69 bytes, dist=0, time=0.185ms
node4 : multicast, seq=3, size=69 bytes, dist=0, time=0.194ms
node1 :   unicast, seq=1, size=69 bytes, dist=0, time=0.080ms
node1 : multicast, seq=1, size=69 bytes, dist=0, time=0.088ms
node2 :   unicast, seq=1, size=69 bytes, dist=0, time=0.106ms
node2 :   unicast, seq=2, size=69 bytes, dist=0, time=0.173ms
node1 : multicast, seq=2, size=69 bytes, dist=0, time=0.215ms
node1 :   unicast, seq=2, size=69 bytes, dist=0, time=0.210ms
node4 : multicast, seq=4, size=69 bytes, dist=0, time=0.197ms
node4 :   unicast, seq=4, size=69 bytes, dist=0, time=0.193ms
node1 :   unicast, seq=3, size=69 bytes, dist=0, time=0.199ms
node1 : multicast, seq=3, size=69 bytes, dist=0, time=0.218ms
node2 :   unicast, seq=3, size=69 bytes, dist=0, time=0.191ms
node4 : waiting for response msg
node1 :   unicast, seq=4, size=69 bytes, dist=0, time=0.188ms
node1 : multicast, seq=4, size=69 bytes, dist=0, time=0.203ms
node2 :   unicast, seq=4, size=69 bytes, dist=0, time=0.191ms
node4 : server told us to stop
^C
node1 :   unicast, xmt/rcv/%loss = 4/4/0%, min/avg/max/std-dev = 0.080/0.169/0.210/0.060
node1 : multicast, xmt/rcv/%loss = 4/4/0%, min/avg/max/std-dev = 0.088/0.181/0.218/0.062
node2 :   unicast, xmt/rcv/%loss = 4/4/0%, min/avg/max/std-dev = 0.106/0.165/0.191/0.040
node2 : multicast, xmt/rcv/%loss = 4/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
node4 :   unicast, xmt/rcv/%loss = 4/4/0%, min/avg/max/std-dev = 0.086/0.160/0.193/0.050
node4 : multicast, xmt/rcv/%loss = 4/4/0%, min/avg/max/std-dev = 0.095/0.168/0.197/0.049

root@node4:~# omping -m 239.192.53.58 node1 node2 node3 node4
node1 : waiting for response msg
node2 : waiting for response msg
node3 : waiting for response msg
node1 : waiting for response msg
node2 : waiting for response msg
node3 : waiting for response msg
node2 : joined (S,G) = (*, 239.192.53.58), pinging
node3 : joined (S,G) = (*, 239.192.53.58), pinging
node2 :   unicast, seq=1, size=69 bytes, dist=0, time=0.097ms
node3 :   unicast, seq=1, size=69 bytes, dist=0, time=0.099ms
node3 : multicast, seq=1, size=69 bytes, dist=0, time=0.104ms
node2 :   unicast, seq=2, size=69 bytes, dist=0, time=0.196ms
node3 : multicast, seq=2, size=69 bytes, dist=0, time=0.200ms
node3 :   unicast, seq=2, size=69 bytes, dist=0, time=0.198ms
node1 : waiting for response msg
node1 : joined (S,G) = (*, 239.192.53.58), pinging
node3 :   unicast, seq=3, size=69 bytes, dist=0, time=0.181ms
node3 : multicast, seq=3, size=69 bytes, dist=0, time=0.185ms
node2 :   unicast, seq=3, size=69 bytes, dist=0, time=0.196ms
node1 :   unicast, seq=1, size=69 bytes, dist=0, time=0.092ms
node1 : multicast, seq=1, size=69 bytes, dist=0, time=0.098ms
^C
node1 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.092/0.092/0.092/0.000
node1 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.098/0.098/0.098/0.000
node2 :   unicast, xmt/rcv/%loss = 3/3/0%, min/avg/max/std-dev = 0.097/0.163/0.196/0.057
node2 : multicast, xmt/rcv/%loss = 3/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
node3 :   unicast, xmt/rcv/%loss = 3/3/0%, min/avg/max/std-dev = 0.099/0.159/0.198/0.053
node3 : multicast, xmt/rcv/%loss = 3/3/0%, min/avg/max/std-dev = 0.104/0.163/0.200/0.052

mir · Mar 14, 2015

Your problem node is node 2. You could try to remove node 2 from the cluster and see if quorum gets established. If it does I would reinstall node 2 and add it to the cluster once again.

decibel83 · Mar 16, 2015

mir said:
Your problem node is node 2. You could try to remove node 2 from the cluster and see if quorum gets established. If it does I would reinstall node 2 and add it to the cluster once again.

I cannot remove the node 2 from the cluster because the whole cluster does not have any quorum:

Code:

root@node1:~# pvecm delnode node2
cluster not ready - no quorum?

I cannot realize why multicast seems to work but PVE cluster not.

Now I try to reinitialise the whole cluster and recreate it from scratch.

decibel83 · Mar 16, 2015

Ok, I reinitialised the whole cluster, recreate a new cluster, added other three nodes to the new cluster and now it seems to work:

Code:

root@node1:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M   1100   2015-03-16 09:18:01  node1
   2   M  48772   2015-03-16 09:18:49  node2
   3   M  48776   2015-03-16 09:19:17  node3
   4   M  48780   2015-03-16 09:19:44  node4

Now let's wait to see if this is definitive.

decibel83 · Mar 16, 2015

Ok, now the cluster is dead another time:

Code:

root@node1:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M   1100   2015-03-16 09:18:01  node1
   2   X  48772                        node2
   3   X  48776                        node3
   4   X  48780                        node4

I cannot realize...

Timed-out waiting for cluster

Renowned Member

New Member

Renowned Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Renowned Member

Renowned Member

We value your privacy