pve-4.2 add node fails

mir

Famous Member
Apr 14, 2012
3,568
128
133
Copenhagen, Denmark
I have upgraded an empty node using apt-get dist-upgrade according to https://forum.proxmox.com/threads/howto-proxmox-3-4-4-2-upgrade-with-qemu-live-migration.27348/
Following advice for adding further nodes failed utterly so I have moved all VM configurations to the first node leave no VM's on other node. Rebooting the node removing cluster and corosync configuration will bring it up again in a working state but of course without connection to the other node. To ensure multicast operation I have performed all multicast test from the trouble shot wiki successful.

When I try to add a node to the newly created cluster on the first node it fails every time leaving the node in a broken state and as a result the first node is also broken since it looses quorum. Below is the result of adding a node. This result is replicated each time I try it from a healthy state.

root@esx2:~# pvecm add 10.0.0.1 -force
copy corosync auth key
stopping pve-cluster service
backup old database
waiting for quorum...OK
generating node certificates
unable to create directory '/etc/pve/priv' - Device or resource busy
 
from: https://pve.proxmox.com/wiki/Multicast_notes#Using_omping
  • note to find the multicast address on proxmox 4.X run this:
corosync-cmapctl -g totem.interface.0.mcastaddr
Is this still valid?
I don't see this key but quorum is achieved.
pvecm s
Quorum information
------------------
Date: Mon May 16 14:43:59 2016
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 928
Quorate: Yes

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.0.0.1 (local)
0x00000002 1 10.0.0.2
From syslog proving multicast working
May 16 14:18:42 esx1 corosync[13943]: [TOTEM ] A new membership (10.0.0.1:928) was formed. Members joined: 2
May 16 14:18:42 esx1 corosync[13943]: [QUORUM] This node is within the primary component and will provide service.
May 16 14:18:42 esx1 corosync[13943]: [QUORUM] Members[2]: 1 2
May 16 14:18:42 esx1 corosync[13943]: [MAIN ] Completed service synchronization, ready to provide service.
May 16 14:18:42 esx1 pmxcfs[13918]: [status] notice: node has quorum
May 16 14:22:05 esx1 sshd[6260]: Accepted publickey for root from 10.0.0.2 port 60252 ssh2: RSA 95:6f:51:4b:83:1d:a4:20:ee:41:e1:51:33:c9:b6:2c

Starting pve-cluster fails causing this logs in syslog:
pve-ha-lrm[2611]: ipcc_send_rec failed: Connection refused

Howto fix?
 
It seems the problem can be narrowed down to initializing pmxcfs after joining the cluster. It does not help copying the config.db file manually so I think the problem is related to the encrypted communication in corosync fails. Seems the failing join is related to the key exchange between the cluster nodes so synchronization fails which causes the newly added node to refuse initializing pmxcfs filesystem in /etc/pve.
 
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: esx2
nodeid: 2
quorum_votes: 1
ring0_addr: esx2-corosync
}

node {
name: esx1
nodeid: 1
quorum_votes: 1
ring0_addr: esx1-corosync
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: midgaard
config_version: 10
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 172.16.3.8
ringnumber: 0
}

}
# cat /etc/hosts
127.0.0.1 localhost
172.16.3.8 esx1-corosync.datanom.net esx1-corosync
172.16.3.9 esx2-corosync.datanom.net esx2-corosync

10.0.0.1 esx1.datanom.net esx1 pvelocalhost
10.0.0.2 esx2.datanom.net esx2
10.0.1.10 omnios.datanom.net omnios

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
Above is configuration where corosync is on a dedicated vlan (Gbit eth connected with a HP Procurve Switch). Before I was using default configuration where corosync was using proxmox network (Infiband DDR). Errors was the same and all multicast test is successful both on Ethernet and Infiband. Infiband work with Proxmox 3.4 but I have decided dedicated network for the future.
 
Hi Mir,

The problem is you host ip resolving is on 2 networks.

Do on both host
First change the corosync.conf in /etc/corosync as follow
the bindnet address must be in the same network as the ring address.
change the ring0_addr from name to IP
then restart the corosync.
now check if corosync is up with corosync-quorumtool
Then restart the pve-manager

only on one host
now copy the /etc/corosync/corosync.conf to /etc/pve/corosync.conf
 
The problem is you host ip resolving is on 2 networks
Where do you see that?
esx1 resolves to 10.0.0.1.
esx1-corosync resolves to 172.16.3.8.

And as explained before when I initially installed there was only the 10.0.0.x/24 network showing exactly the same error.

the bindnet address must be in the same network as the ring address.
But bindnet address is on the same network as the ring address!

bindnetaddr: 172.16.3.8
esx1-corosync: 172.16.3.8
esx2-corosync: 172.16.3.9

But did as you suggested and still the same error. Log from evicted node:
May 18 12:56:27 esx2 pmxcfs[2160]: [status] notice: update cluster info (cluster name midgaard, version = 11)
May 18 12:56:27 esx2 pmxcfs[2160]: [dcdb] crit: cpg_initialize failed: 6
May 18 12:56:27 esx2 pmxcfs[2160]: [status] crit: cpg_initialize failed: 6
May 18 12:56:27 esx2 pmxcfs[2160]: [status] notice: node has quorum
 
corosync quorum status from both nodes:

Code:
root@esx1:~# corosync-quorumtool
Quorum information
------------------
Date:             Wed May 18 13:01:00 2016
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          4008
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
         1          1 172.16.3.8 (local)
         2          1 172.16.3.9

Code:
root@esx2:~# corosync-quorumtool
Quorum information
------------------
Date:             Wed May 18 13:01:22 2016
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          2
Ring ID:          4008
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
         1          1 172.16.3.8
         2          1 172.16.3.9 (local)
 
Where do you see that?

Here the first corosync.conf you send the ring adress where
esx1-corosync = 10.0.0.1 and the bindnetaddress bindnetaddr: 172.16.3.8

Can you send me the output of

pvecm status
/etc/pve/corosync.conf
/etc/corosync/corosync.conf
systemctl status pve-manager
systemctl status corosync
 
I have the same problem sometimes.
My cluster has 7 nodes. All the times I had to insert a new node, one or more servers restarted.

Also, some virtual machines restarted after losing connection with the storage, which is a FreeNAS providing lvm via fiber channel.
Great disorder because, to turn off the vm takes too long.
 
Can you provide more information.
We are taking about cluster communication or storage problems?
What storage you have?
 
cluster communication.
What do you mean by storage? If you refer to where the proxmox nodes are installed then storage is bare metal.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!