[SOLVED] New cluster : Waiting for quorum...

hakim

Well-Known Member
Oct 4, 2010
54
1
48
Hi,

l installed 3 new Proxmox servers v5.1, and I would like to set them as a cluster.

on serv1 (10.1.0.1) :
Code:
pvecm create MYCLUSTER

on serv2 (10.1.0.2) :
Code:
pvecm add 10.1.0.2
=> I entered the root password
and I get process messages and then it was blocked on :
waiting for quorum...

From serv1, I can ping serv2, the IP resolution is fine, and vice-versa (average 0.200 ms)
I am using a "vRack" to connect my servers (OVH) and I check with OVH : it supports multicast.

After rebooting serv2, I cannot anymore access the serv2 UI

From serv1 :
Code:
# pvecm status
Quorum information
------------------
Date:             Fri Dec 29 18:06:55 2017
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1/4
Quorate:          No

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.1.0.1 (local)

From serv2 :
Code:
# pvecm status
Quorum information
------------------
Date:             Fri Dec 29 18:07:16 2017
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          2/5804
Quorate:          No

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.1.0.2 (local)


In syslog (on serv2) ... :


When I try to join the cluster :
Code:
systemd[1]: Stopping The Proxmox VE cluster filesystem...
pmxcfs[7856]: [main] notice: exit proxmox configuration filesystem (0)
systemd[1]: Stopped The Proxmox VE cluster filesystem.
systemd[1]: Starting The Proxmox VE cluster filesystem...
pmxcfs[13905]: [quorum] crit: quorum_initialize failed: 2
pmxcfs[13905]: [quorum] crit: can't initialize service
pmxcfs[13905]: [confdb] crit: cmap_initialize failed: 2
pmxcfs[13905]: [confdb] crit: can't initialize service
pmxcfs[13905]: [dcdb] crit: cpg_initialize failed: 2
pmxcfs[13905]: [dcdb] crit: can't initialize service
pmxcfs[13905]: [status] crit: cpg_initialize failed: 2
pmxcfs[13905]: [status] crit: can't initialize service
pveproxy[8289]: ipcc_send_rec[1] failed: Transport endpoint is not connected
pveproxy[8289]: ipcc_send_rec[2] failed: Connection refused
pveproxy[8289]: ipcc_send_rec[3] failed: Connection refused
systemd[1]: Started The Proxmox VE cluster filesystem.
systemd[1]: Starting Corosync Cluster Engine...
corosync[13923]:  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
corosync[13923]: notice  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
corosync[13923]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
corosync[13923]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
corosync[13923]: notice  [TOTEM ] Initializing transport (UDP/IP Multicast).
corosync[13923]: notice  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
corosync[13923]:  [TOTEM ] Initializing transport (UDP/IP Multicast).
corosync[13923]:  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
corosync[13923]: notice  [TOTEM ] The network interface [10.1.0.2] is now up.
corosync[13923]: notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
corosync[13923]: info    [QB    ] server name: cmap
corosync[13923]: notice  [SERV  ] Service engine loaded: corosync configuration service [1]
corosync[13923]: info    [QB    ] server name: cfg
corosync[13923]: notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
corosync[13923]: info    [QB    ] server name: cpg
corosync[13923]: notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
corosync[13923]:  [TOTEM ] The network interface [10.1.0.2] is now up.
corosync[13923]: notice  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
corosync[13923]: warning [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
corosync[13923]: warning [WD    ] resource load_15min missing a recovery key.
corosync[13923]: warning [WD    ] resource memory_used missing a recovery key.
corosync[13923]: info    [WD    ] no resources configured.
corosync[13923]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
corosync[13923]: notice  [QUORUM] Using quorum provider corosync_votequorum
corosync[13923]: notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
corosync[13923]: info    [QB    ] server name: votequorum
corosync[13923]: notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
corosync[13923]: info    [QB    ] server name: quorum
corosync[13923]:  [SERV  ] Service engine loaded: corosync configuration map access [0]
systemd[1]: Started Corosync Cluster Engine.
corosync[13923]: notice  [TOTEM ] A new membership (10.1.0.2:4) was formed. Members joined: 2
corosync[13923]: notice  [QUORUM] Members[1]: 2
corosync[13923]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
corosync[13923]:  [QB    ] server name: cmap
corosync[13923]:  [SERV  ] Service engine loaded: corosync configuration service [1]
corosync[13923]:  [QB    ] server name: cfg
corosync[13923]:  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
corosync[13923]:  [QB    ] server name: cpg
corosync[13923]:  [SERV  ] Service engine loaded: corosync profile loading service [4]
corosync[13923]:  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
corosync[13923]:  [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
corosync[13923]:  [WD    ] resource load_15min missing a recovery key.
corosync[13923]:  [WD    ] resource memory_used missing a recovery key.
corosync[13923]:  [WD    ] no resources configured.
corosync[13923]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
corosync[13923]:  [QUORUM] Using quorum provider corosync_votequorum
corosync[13923]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
corosync[13923]:  [QB    ] server name: votequorum
corosync[13923]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
corosync[13923]:  [QB    ] server name: quorum
corosync[13923]:  [TOTEM ] A new membership (10.1.0.2:4) was formed. Members joined: 2
corosync[13923]:  [QUORUM] Members[1]: 2
corosync[13923]:  [MAIN  ] Completed service synchronization, ready to provide service.
pvestatd[8085]: ipcc_send_rec[1] failed: Transport endpoint is not connected
pve-ha-crm[8225]: ipcc_send_rec[1] failed: Transport endpoint is not connected
pve-ha-lrm[8304]: ipcc_send_rec[1] failed: Transport endpoint is not connected
corosync[13923]: notice  [TOTEM ] A new membership (10.1.0.2:8) was formed. Members
corosync[13923]: notice  [QUORUM] Members[1]: 2
corosync[13923]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
corosync[13923]:  [TOTEM ] A new membership (10.1.0.2:8) was formed. Members
corosync[13923]:  [QUORUM] Members[1]: 2
corosync[13923]:  [MAIN  ] Completed service synchronization, ready to provide service.
corosync[13923]: notice  [TOTEM ] A new membership (10.1.0.2:12) was formed. Members
corosync[13923]: notice  [QUORUM] Members[1]: 2
corosync[13923]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
corosync[13923]:  [TOTEM ] A new membership (10.1.0.2:12) was formed. Members
corosync[13923]:  [QUORUM] Members[1]: 2
corosync[13923]:  [MAIN  ] Completed service synchronization, ready to provide service.
pmxcfs[13905]: [status] notice: update cluster info (cluster name  IANDI, version = 2)
corosync[13923]: notice  [TOTEM ] A new membership (10.1.0.2:16) was formed. Members
corosync[13923]: notice  [QUORUM] Members[1]: 2
corosync[13923]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
corosync[13923]:  [TOTEM ] A new membership (10.1.0.2:16) was formed. Members
corosync[13923]:  [QUORUM] Members[1]: 2
corosync[13923]:  [MAIN  ] Completed service synchronization, ready to provide service.
pmxcfs[13905]: [dcdb] notice: members: 2/13905
pmxcfs[13905]: [dcdb] notice: all data is up to date
pmxcfs[13905]: [status] notice: members: 2/13905
pmxcfs[13905]: [status] notice: all data is up to date

And after I reboot :
Code:
systemd[1]: Started LSB: Ceph RBD Mapping.
systemd[1]: Starting The Proxmox VE cluster filesystem...
systemd[1]: Starting LXC Container Initialization and Autoboot Code...
pmxcfs[7529]: [quorum] crit: quorum_initialize failed: 2
pmxcfs[7529]: [quorum] crit: can't initialize service
pmxcfs[7529]: [confdb] crit: cmap_initialize failed: 2
pmxcfs[7529]: [confdb] crit: can't initialize service
pmxcfs[7529]: [dcdb] crit: cpg_initialize failed: 2
pmxcfs[7529]: [dcdb] crit: can't initialize service
pmxcfs[7529]: [status] crit: cpg_initialize failed: 2
pmxcfs[7529]: [status] crit: can't initialize service
systemd[1]: Started The Proxmox VE cluster filesystem.
systemd[1]: Starting Proxmox VE firewall...
systemd[1]: Starting Corosync Cluster Engine...
systemd[1]: Starting PVE Status Daemon...
corosync[7689]:  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
corosync[7689]: notice  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
corosync[7689]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
corosync[7689]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
corosync[7689]: notice  [TOTEM ] Initializing transport (UDP/IP Multicast).
corosync[7689]: notice  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
corosync[7689]:  [TOTEM ] Initializing transport (UDP/IP Multicast).
corosync[7689]:  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
corosync[7689]: notice  [TOTEM ] The network interface [10.1.0.2] is now up.
corosync[7689]:  [TOTEM ] The network interface [10.1.0.2] is now up.
corosync[7689]: notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
corosync[7689]: info    [QB    ] server name: cmap
corosync[7689]: notice  [SERV  ] Service engine loaded: corosync configuration service [1]
corosync[7689]: info    [QB    ] server name: cfg
corosync[7689]: notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
corosync[7689]: info    [QB    ] server name: cpg
corosync[7689]: notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
corosync[7689]: notice  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
corosync[7689]: warning [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
corosync[7689]: warning [WD    ] resource load_15min missing a recovery key.
corosync[7689]: warning [WD    ] resource memory_used missing a recovery key.
corosync[7689]: info    [WD    ] no resources configured.
corosync[7689]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
corosync[7689]: notice  [QUORUM] Using quorum provider corosync_votequorum
corosync[7689]:  [SERV  ] Service engine loaded: corosync configuration map access [0]
systemd[1]: Started Corosync Cluster Engine.
corosync[7689]: notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
corosync[7689]: info    [QB    ] server name: votequorum
corosync[7689]: notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
corosync[7689]: info    [QB    ] server name: quorum
corosync[7689]: notice  [TOTEM ] A new membership (10.1.0.2:3860) was formed. Members joined: 2
corosync[7689]: notice  [QUORUM] Members[1]: 2
corosync[7689]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
corosync[7689]:  [QB    ] server name: cmap
corosync[7689]:  [SERV  ] Service engine loaded: corosync configuration service [1]
corosync[7689]:  [QB    ] server name: cfg
corosync[7689]:  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
corosync[7689]:  [QB    ] server name: cpg
corosync[7689]:  [SERV  ] Service engine loaded: corosync profile loading service [4]
corosync[7689]:  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
corosync[7689]:  [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
corosync[7689]:  [WD    ] resource load_15min missing a recovery key.
corosync[7689]:  [WD    ] resource memory_used missing a recovery key.
corosync[7689]:  [WD    ] no resources configured.
corosync[7689]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
corosync[7689]:  [QUORUM] Using quorum provider corosync_votequorum
corosync[7689]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
corosync[7689]:  [QB    ] server name: votequorum
corosync[7689]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
corosync[7689]:  [QB    ] server name: quorum
corosync[7689]:  [TOTEM ] A new membership (10.1.0.2:3860) was formed. Members joined: 2
corosync[7689]:  [QUORUM] Members[1]: 2
corosync[7689]:  [MAIN  ] Completed service synchronization, ready to provide service.
systemd[1]: Starting PVE API Daemon...
pve-firewall[7782]: starting server
systemd[1]: Started Proxmox VE firewall.
pvestatd[7798]: starting server
systemd[1]: Started PVE Status Daemon.
pvedaemon[7859]: starting server
pvedaemon[7859]: starting 3 worker(s)
pvedaemon[7859]: worker 7860 started
pvedaemon[7859]: worker 7861 started
pvedaemon[7859]: worker 7863 started
systemd[1]: Started PVE API Daemon.
systemd[1]: Starting PVE Cluster Ressource Manager Daemon...
systemd[1]: Starting PVE API Proxy Server...
pve-ha-crm[7891]: starting server
pve-ha-crm[7891]: status change startup => wait_for_quorum
systemd[1]: Started PVE Cluster Ressource Manager Daemon.
systemd[1]: Starting PVE Local HA Ressource Manager Daemon...
pveproxy[7952]: starting server
pveproxy[7952]: starting 3 worker(s)
pveproxy[7952]: worker 7955 started
pveproxy[7952]: worker 7956 started
pveproxy[7952]: worker 7957 started

pveproxy[7956]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.
pveproxy[7955]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.
pveproxy[7957]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.

systemd[1]: Started PVE API Proxy Server.
systemd[1]: Starting PVE SPICE Proxy Server...
pve-ha-lrm[8048]: starting server
pve-ha-lrm[8048]: status change startup => wait_for_agent_lock
systemd[1]: Started PVE Local HA Ressource Manager Daemon.
spiceproxy[8055]: starting server
spiceproxy[8055]: starting 1 worker(s)
spiceproxy[8055]: worker 8059 started
systemd[1]: Started PVE SPICE Proxy Server.
systemd[1]: Starting PVE guests...

corosync[7689]: notice  [TOTEM ] A new membership (10.1.0.2:3864) was formed. Members
corosync[7689]:  [TOTEM ] A new membership (10.1.0.2:3864) was formed. Members
corosync[7689]: notice  [QUORUM] Members[1]: 2
corosync[7689]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
corosync[7689]:  [QUORUM] Members[1]: 2
corosync[7689]:  [MAIN  ] Completed service synchronization, ready to provide service.

pve-guests[8060]: <root@pam> starting task UPID:serv2:00001F92:00001749:5A46738E:startall::root@pam:

corosync[7689]: notice  [TOTEM ] A new membership (10.1.0.2:3868) was formed. Members
corosync[7689]: notice  [QUORUM] Members[1]: 2
corosync[7689]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
corosync[7689]:  [TOTEM ] A new membership (10.1.0.2:3868) was formed. Members
corosync[7689]:  [QUORUM] Members[1]: 2
corosync[7689]:  [MAIN  ] Completed service synchronization, ready to provide service.
 
I tried the command and it does not work ("command not found").
Any clue on how to install it ? :-/
 
Sorry, I looked around (and did not find anything about) but forget to simply : apt-get install omping
 
Hi,

l installed 3 new Proxmox servers v5.1, and I would like to set them as a cluster.

on serv1 (10.1.0.1) :
Code:
pvecm create MYCLUSTER

on serv2 (10.1.0.2) :
Code:
pvecm add 10.1.0.2

You used the wrong IP, you need to add the node to the existing cluster on 10.1.0.1.

> pvecm add 10.1.0.1
 
Hi,

Thanks for your reply, in fact it was a typo mistake,
I did : pvecm add 10.1.0.1, and get the waiting for quorum error

After testing with omping, I have a multicast problem, that I am trying to troubleshoot.
Code:
  unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.048/0.148/0.359/0.044
multicast, xmt/rcv/%loss = 10000/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000

When I do a tcpdump, I get the same kind of result from both side :
Code:
09:35:32.052348 IP serv1.5404 > 239.192.8.169.5405: UDP, length 136
09:35:32.308365 IP serv2.5404 > 239.192.8.169.5405: UDP, length 136
09:35:32.412496 IP serv1.5404 > 239.192.8.169.5405: UDP, length 136
09:35:32.668493 IP serv2.5404 > 239.192.8.169.5405: UDP, length 136
09:35:32.772609 IP serv1.5404 > 239.192.8.169.5405: UDP, length 136
09:35:33.028626 IP serv2.5404 > 239.192.8.169.5405: UDP, length 136

I am wondering if these are the packets that are "multicasted", and if so, if the problem could be a firewall problem rather than a switch problem ?

Hakim
 
Well it was a problem with the server firewall ...
After using :
Code:
iptables -A INPUT -i vmbr1 -s 10.1.0.0/24 -m pkttype --pkt-type multicast -j ACCEPT

Multicast was working...

But ... I had weird problem with my servers : could not reboot (either from the console or from GUI), the server and the GUI was still alive...

Too much problems and weirdness, I don't fill safe to continue with clusters. So I give up

All the best for New Year's Eve
 
For whom it may help,

It appears that for an obvious reason, when two nodes have problem to communicate the nodes cannot be rebooted (need hardware reboot).

The problem of communication was because not only the nodes communicate on TCP 22 and UDP 5404,5405 (which is described in the doc) but it also need to communicate on TCP 8006 (which tooks some hours of tests to find out)

Hakim
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!