failed to add node to new 4.0 cluster

stefws

Renowned Member
Jan 29, 2015
302
4
83
Denmark
siimnet.dk
Just created a new 4.0 test cluster with boxes that had authority keys in place and password-less ssh root access between clean installed+patched boxes.
On first node I created a new cluster with 'pvecm create clustername' and adding 2. node seemed to take like forever and eventually I interrupted that,
wonder how to recover from this 'Activity blocked' state and my nodes?

Code:
root@n1:~# pvecm status
Quorum information
------------------
Date:             Wed Sep  2 09:49:13 2015
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          4
Quorate:          No


Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:            


Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.3.0.1 (local)

On node 2 I got this:

Code:
root@n2:~# pvecm add n1
copy corosync auth key
stopping pve-cluster service
backup old database
waiting for quorum...^C
root@n2:~# pvecm add 10.3.0.1
can't create shared ssh key database '/etc/pve/priv/authorized_keys'
authentication key already exists
root@n2:~# ping n1
PING n1 (10.3.0.1) 56(84) bytes of data.
64 bytes from n1 (10.3.0.1): icmp_seq=1 ttl=64 time=0.171 ms
64 bytes from n1 (10.3.0.1): icmp_seq=2 ttl=64 time=0.158 ms
^C


This is found in /var/log/daemon.log:

Code:
Sep  2 09:42:22 n2 pmxcfs[1406]: [main] notice: teardown filesystem
Sep  2 09:42:24 n2 pmxcfs[1406]: [main] notice: exit proxmox configuration filesystem (0)
Sep  2 09:42:24 n2 pmxcfs[3204]: [quorum] crit: quorum_initialize failed: 2
Sep  2 09:42:24 n2 pmxcfs[3204]: [quorum] crit: can't initialize service
Sep  2 09:42:24 n2 pmxcfs[3204]: [confdb] crit: cmap_initialize failed: 2
Sep  2 09:42:24 n2 pmxcfs[3204]: [confdb] crit: can't initialize service
Sep  2 09:42:24 n2 pmxcfs[3204]: [dcdb] crit: cpg_initialize failed: 2
Sep  2 09:42:24 n2 pmxcfs[3204]: [dcdb] crit: can't initialize service
Sep  2 09:42:24 n2 pmxcfs[3204]: [status] crit: cpg_initialize failed: 2
Sep  2 09:42:24 n2 pmxcfs[3204]: [status] crit: can't initialize service
Sep  2 09:42:24 n2 pve-ha-crm[1800]: ipcc_send_rec failed: Transport endpoint is not connected
Sep  2 09:42:24 n2 pve-ha-crm[1800]: ipcc_send_rec failed: Connection refused
Sep  2 09:42:24 n2 pve-ha-crm[1800]: ipcc_send_rec failed: Connection refused
Sep  2 09:42:24 n2 pve-ha-lrm[1810]: ipcc_send_rec failed: Transport endpoint is not connected
Sep  2 09:42:24 n2 pve-ha-lrm[1810]: ipcc_send_rec failed: Connection refused
Sep  2 09:42:24 n2 pve-ha-lrm[1810]: ipcc_send_rec failed: Connection refused
Sep  2 09:42:25 n2 corosync[3220]:  [MAIN  ] Corosync Cluster Engine ('2.3.4.22-8252'): started and ready to provide service.
Sep  2 09:42:25 n2 corosync[3220]:  [MAIN  ] Corosync built-in features: augeas systemd pie relro bindnow
Sep  2 09:42:25 n2 corosync[3222]:  [TOTEM ] Initializing transport (UDP/IP Multicast).
Sep  2 09:42:25 n2 corosync[3222]:  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Sep  2 09:42:25 n2 corosync[3222]:  [TOTEM ] The network interface [10.3.0.2] is now up.
Sep  2 09:42:25 n2 corosync[3222]:  [SERV  ] Service engine loaded: corosync configuration map access [0]
Sep  2 09:42:25 n2 corosync[3222]:  [QB    ] server name: cmap
Sep  2 09:42:25 n2 corosync[3222]:  [SERV  ] Service engine loaded: corosync configuration service [1]
Sep  2 09:42:25 n2 corosync[3222]:  [QB    ] server name: cfg
Sep  2 09:42:25 n2 corosync[3222]:  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Sep  2 09:42:25 n2 corosync[3222]:  [QB    ] server name: cpg
Sep  2 09:42:25 n2 corosync[3222]:  [SERV  ] Service engine loaded: corosync profile loading service [4]
Sep  2 09:42:25 n2 corosync[3222]:  [QUORUM] Using quorum provider corosync_votequorum
Sep  2 09:42:25 n2 corosync[3222]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Sep  2 09:42:25 n2 corosync[3222]:  [QB    ] server name: votequorum
Sep  2 09:42:25 n2 corosync[3222]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Sep  2 09:42:25 n2 corosync[3222]:  [QB    ] server name: quorum
Sep  2 09:42:25 n2 corosync[3222]:  [TOTEM ] A new membership (10.3.0.2:4) was formed. Members joined: 2
Sep  2 09:42:25 n2 corosync[3222]:  [QUORUM] Members[1]: 2
Sep  2 09:42:25 n2 corosync[3222]:  [MAIN  ] Completed service synchronization, ready to provide service.
Sep  2 09:42:25 n2 corosync[3214]: Starting Corosync Cluster Engine (corosync): [  OK  ]
Sep  2 09:42:26 n2 corosync[3222]:  [TOTEM ] A new membership (10.3.0.2:8) was formed. Members
Sep  2 09:42:26 n2 corosync[3222]:  [QUORUM] Members[1]: 2
Sep  2 09:42:26 n2 corosync[3222]:  [MAIN  ] Completed service synchronization, ready to provide service.
Sep  2 09:42:28 n2 corosync[3222]:  [TOTEM ] A new membership (10.3.0.2:12) was formed. Members
Sep  2 09:42:28 n2 corosync[3222]:  [QUORUM] Members[1]: 2
Sep  2 09:42:28 n2 corosync[3222]:  [MAIN  ] Completed service synchronization, ready to provide service.
Sep  2 09:42:29 n2 corosync[3222]:  [TOTEM ] A new membership (10.3.0.2:16) was formed. Members
Sep  2 09:42:29 n2 corosync[3222]:  [QUORUM] Members[1]: 2
Sep  2 09:42:29 n2 corosync[3222]:  [MAIN  ] Completed service synchronization, ready to provide service.
Sep  2 09:42:30 n2 pmxcfs[3204]: [status] notice: update cluster info (cluster name  test-pmx, version = 2)
Sep  2 09:42:31 n2 corosync[3222]:  [TOTEM ] A new membership (10.3.0.2:20) was formed. Members
Sep  2 09:42:31 n2 corosync[3222]:  [QUORUM] Members[1]: 2
Sep  2 09:42:31 n2 corosync[3222]:  [MAIN  ] Completed service synchronization, ready to provide service.
Sep  2 09:42:31 n2 pmxcfs[3204]: [dcdb] notice: members: 2/3204
Sep  2 09:42:31 n2 pmxcfs[3204]: [dcdb] notice: all data is up to date
Sep  2 09:42:31 n2 pmxcfs[3204]: [status] notice: members: 2/3204
Sep  2 09:42:31 n2 pmxcfs[3204]: [status] notice: all data is up to date
Sep  2 09:42:32 n2 corosync[3222]:  [TOTEM ] A new membership (10.3.0.2:24) was formed. Members
Sep  2 09:42:32 n2 corosync[3222]:  [QUORUM] Members[1]: 2
Sep  2 09:42:32 n2 corosync[3222]:  [MAIN  ] Completed service synchronization, ready to provide service.
Sep  2 09:42:34 n2 corosync[3222]:  [TOTEM ] A new membership (10.3.0.2:28) was formed. Members
Sep  2 09:42:34 n2 corosync[3222]:  [QUORUM] Members[1]: 2
Sep  2 09:42:34 n2 corosync[3222]:  [MAIN  ] Completed service synchronization, ready to provide service.
Sep  2 09:42:35 n2 corosync[3222]:  [TOTEM ] A new membership (10.3.0.2:32) was formed. Members
Sep  2 09:42:35 n2 corosync[3222]:  [QUORUM] Members[1]: 2
Sep  2 09:42:35 n2 corosync[3222]:  [MAIN  ] Completed service synchronization, ready to provide service.
Sep  2 09:42:36 n2 corosync[3222]:  [TOTEM ] A new membership (10.3.0.2:36) was formed. Members
Sep  2 09:42:36 n2 corosync[3222]:  [QUORUM] Members[1]: 2
Sep  2 09:42:36 n2 corosync[3222]:  [MAIN  ] Completed service synchronization, ready to provide service.
...
Sep  2 09:46:46 n2 corosync[3222]:  [TOTEM ] A new membership (10.3.0.2:732) was formed. Members
Sep  2 09:46:46 n2 corosync[3222]:  [QUORUM] Members[1]: 2
Sep  2 09:46:46 n2 corosync[3222]:  [MAIN  ] Completed service synchronization, ready to provide service.



root@n2:~# grep -c 'Members\[1\]: 2' /var/log/daemon.log
183

After a reboot n2 says:

Code:
from daemon.log:
...
Sep  2 10:27:31 n2 networking[1159]: done.
Sep  2 10:27:31 n2 pmxcfs[1416]: [quorum] crit: quorum_initialize failed: 2
Sep  2 10:27:31 n2 pmxcfs[1416]: [quorum] crit: can't initialize service
Sep  2 10:27:31 n2 pmxcfs[1416]: [confdb] crit: cmap_initialize failed: 2
Sep  2 10:27:31 n2 pmxcfs[1416]: [confdb] crit: can't initialize service
Sep  2 10:27:31 n2 pmxcfs[1416]: [dcdb] crit: cpg_initialize failed: 2
Sep  2 10:27:31 n2 pmxcfs[1416]: [dcdb] crit: can't initialize service
Sep  2 10:27:31 n2 pmxcfs[1416]: [status] crit: cpg_initialize failed: 2
Sep  2 10:27:31 n2 pmxcfs[1416]: [status] crit: can't initialize service
...


root@n2:~# pvecm status
Quorum information
------------------
Date:             Wed Sep  2 10:28:39 2015
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          892
Quorate:          No


Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:            


Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.3.0.2 (local)


Any hints appreciated, TIA
 
Okay did this on n1:

Code:
pvecm expected 1
pvecm delnode n2
root@n1:~# pvecm status
Quorum information
------------------
Date:             Wed Sep  2 11:21:28 2015
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          4
Quorate:          Yes


Votequorum information
----------------------
Expected votes:   1
Highest expected: 1
Total votes:      1
Quorum:           1  
Flags:            Quorate 


Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.3.0.1 (local)

Will try to reinstall n2 and start over...
 
Nope no luck :/

What might be an issue when adding a node to a cluster?

I've got on each node, beside a public IP, a private vlan 10.3.0.0/24 which I intended to use for cluster communication, shouldn't this work?
 
Is your sshd listening on this private VLAN? What's the point of private VLAN anyhow? seems like you'd probably want an actual LAN to keep the traffic from congesting the main NIC.
 
I initially had several issues when adding 4.0 cluster members. I found the following to help in my situation.

- Make sure name resolution works on all hosts for all hosts
- Make sure time is synchronized on all hosts
- Disable IGMP snooping on your switch(es) if enabled

Due to the multicast reliance of PVE4, IGMP snooping tripped me up several times before I disabled it all together on the switches that my hosts were connected to.

Of course, YMMV.
 
Yes sshd is listing and I can jump password less via ssh between all nodes on this vlan. Purpose is to keep cluster communication local within cluster, ie. off the public wan and yes from congesting from possible public traffic.
 
It seems to have to do with issue of multicast for corosync through my switches. Playing around with switch config I managed to get it partly to work.

- multicasting needs of course to be routed onto the vlan used. Usually this is default route/gw.
- secondly my switches needs to allow multicast packets to get to receivers which they didn't in trunk mode as I understand it.

will fool more around to figure out how to config lacp ports in my switches and ensure multicasting to work...

While partly working I saw a multicast group like this:

JunOS# show igmp-snooping membership
VLAN: default
239.192.107.125*
Interfaces: ae2.0, ae1.0, ae0.0

will try to disable snooping, but not sure if my uplinks from router supply a igmp quierer or I'll need to make the Juniper SWs take over this job or use a static igmp group...
 
I made sure that the time is synchronized and I'm seeing packets from both sides .. so there is no igmp snooping problem .. but it still does not work

Code:
16:20:20.453224 IP 10.xxx.xxx.148.5404 > 239.192.192.141.5405: UDP, length 136
16:20:20.453494 IP 10.xxx.xxx.149.5404 > 239.192.192.141.5405: UDP, length 136
16:20:20.813544 IP 10.xxx.xxx.148.5404 > 239.192.192.141.5405: UDP, length 136
16:20:20.813950 IP 10.xxx.xxx.149.5404 > 239.192.192.141.5405: UDP, length 136

Here the join

Code:
root@proxmox02:~# pvecm add 10.xxx.xxx.148
The authenticity of host '10.xxx.xxx.148 (10.10.193.148)' can't be established.
ECDSA key fingerprint is 9f:15:99:ff:d2:c7:be:da:40:43:72:2f:40:20:69:30.
Are you sure you want to continue connecting (yes/no)? yes
root@10.xxx.xxx.148's password: 
copy corosync auth key
stopping pve-cluster service
backup old database
waiting for quorum...

here the output of corosync on the first node

Code:
root@proxmox01:~# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
   Active: active (running) since Die 2015-10-20 15:38:17 CEST; 1h 8min ago
 Main PID: 1114 (corosync)
   CGroup: /system.slice/corosync.service
           └─1114 corosync

Okt 20 15:38:16 proxmox01 corosync[1114]: [QB    ] server name: votequorum
Okt 20 15:38:16 proxmox01 corosync[1114]: [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Okt 20 15:38:16 proxmox01 corosync[1114]: [QB    ] server name: quorum
Okt 20 15:38:16 proxmox01 corosync[1114]: [TOTEM ] A new membership (10.xxx.xxx.148:4) was formed. Members joined: 1
Okt 20 15:38:16 proxmox01 corosync[1114]: [QUORUM] Members[1]: 1
Okt 20 15:38:16 proxmox01 corosync[1114]: [MAIN  ] Completed service synchronization, ready to provide service.
Okt 20 15:38:17 proxmox01 corosync[1107]: Starting Corosync Cluster Engine (corosync): [  OK  ]
Okt 20 15:38:59 proxmox01 corosync[1114]: [CFG   ] Config reload requested by node 1
Okt 20 15:38:59 proxmox01 corosync[1114]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Okt 20 15:38:59 proxmox01 corosync[1114]: [QUORUM] Members[1]: 1

and here the node I want to join

Code:
root@proxmox02:~# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
   Active: active (running) since Die 2015-10-20 15:39:01 CEST; 1h 9min ago
 Main PID: 1118 (corosync)
   CGroup: /system.slice/corosync.service
           └─1118 corosync

Okt 20 15:39:01 proxmox02 corosync[1118]: [SERV  ] Service engine loaded: corosync profile loading service [4]
Okt 20 15:39:01 proxmox02 corosync[1118]: [QUORUM] Using quorum provider corosync_votequorum
Okt 20 15:39:01 proxmox02 corosync[1118]: [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Okt 20 15:39:01 proxmox02 corosync[1118]: [QB    ] server name: votequorum
Okt 20 15:39:01 proxmox02 corosync[1118]: [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Okt 20 15:39:01 proxmox02 corosync[1118]: [QB    ] server name: quorum
Okt 20 15:39:01 proxmox02 corosync[1118]: [TOTEM ] A new membership (10.xxx.xxx.149:4) was formed. Members joined: 2
Okt 20 15:39:01 proxmox02 corosync[1118]: [QUORUM] Members[1]: 2
Okt 20 15:39:01 proxmox02 corosync[1118]: [MAIN  ] Completed service synchronization, ready to provide service.
Okt 20 15:39:01 proxmox02 corosync[1109]: Starting Corosync Cluster Engine (corosync): [  OK  ]
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!