pve-4.2 add node fails

mir · May 16, 2016

I have upgraded an empty node using apt-get dist-upgrade according to https://forum.proxmox.com/threads/howto-proxmox-3-4-4-2-upgrade-with-qemu-live-migration.27348/
Following advice for adding further nodes failed utterly so I have moved all VM configurations to the first node leave no VM's on other node. Rebooting the node removing cluster and corosync configuration will bring it up again in a working state but of course without connection to the other node. To ensure multicast operation I have performed all multicast test from the trouble shot wiki successful.

When I try to add a node to the newly created cluster on the first node it fails every time leaving the node in a broken state and as a result the first node is also broken since it looses quorum. Below is the result of adding a node. This result is replicated each time I try it from a healthy state.

root@esx2:~# pvecm add 10.0.0.1 -force
copy corosync auth key
stopping pve-cluster service
backup old database
waiting for quorum...OK
generating node certificates
unable to create directory '/etc/pve/priv' - Device or resource busy

mir · May 16, 2016

from: https://pve.proxmox.com/wiki/Multicast_notes#Using_omping

note to find the multicast address on proxmox 4.X run this:

corosync-cmapctl -g totem.interface.0.mcastaddr

Is this still valid?
I don't see this key but quorum is achieved.

pvecm s
Quorum information
------------------
Date: Mon May 16 14:43:59 2016
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 928
Quorate: Yes

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.0.0.1 (local)
0x00000002 1 10.0.0.2

From syslog proving multicast working

May 16 14:18:42 esx1 corosync[13943]: [TOTEM ] A new membership (10.0.0.1:928) was formed. Members joined: 2
May 16 14:18:42 esx1 corosync[13943]: [QUORUM] This node is within the primary component and will provide service.
May 16 14:18:42 esx1 corosync[13943]: [QUORUM] Members[2]: 1 2
May 16 14:18:42 esx1 corosync[13943]: [MAIN ] Completed service synchronization, ready to provide service.
May 16 14:18:42 esx1 pmxcfs[13918]: [status] notice: node has quorum
May 16 14:22:05 esx1 sshd[6260]: Accepted publickey for root from 10.0.0.2 port 60252 ssh2: RSA 95:6f:51:4b:83:1d:a4:20:ee:41:e1:51:33:c9:b6:2c

Starting pve-cluster fails causing this logs in syslog:
pve-ha-lrm[2611]: ipcc_send_rec failed: Connection refused

Howto fix?

mir · May 17, 2016

It seems the problem can be narrowed down to initializing pmxcfs after joining the cluster. It does not help copying the config.db file manually so I think the problem is related to the encrypted communication in corosync fails. Seems the failing join is related to the key exchange between the cluster nodes so synchronization fails which causes the newly added node to refuse initializing pmxcfs filesystem in /etc/pve.

mir · May 17, 2016

I really could use some advice here?

dietmar · May 17, 2016

never observer that. Please can you post your corosync.conf?

mir · May 17, 2016

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: esx2
nodeid: 2
quorum_votes: 1
ring0_addr: esx2-corosync
}

node {
name: esx1
nodeid: 1
quorum_votes: 1
ring0_addr: esx1-corosync
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: midgaard
config_version: 10
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 172.16.3.8
ringnumber: 0
}

}

# cat /etc/hosts
127.0.0.1 localhost
172.16.3.8 esx1-corosync.datanom.net esx1-corosync
172.16.3.9 esx2-corosync.datanom.net esx2-corosync

10.0.0.1 esx1.datanom.net esx1 pvelocalhost
10.0.0.2 esx2.datanom.net esx2
10.0.1.10 omnios.datanom.net omnios

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Above is configuration where corosync is on a dedicated vlan (Gbit eth connected with a HP Procurve Switch). Before I was using default configuration where corosync was using proxmox network (Infiband DDR). Errors was the same and all multicast test is successful both on Ethernet and Infiband. Infiband work with Proxmox 3.4 but I have decided dedicated network for the future.

mir · May 17, 2016

I forgot to mention that I use native linux bridge and not openvswitch.

dietmar · May 17, 2016

Seem hostnames resolves to wrong IP address?

10.0.0.X != 172.16.3.X (bindnetaddr)

mir · May 17, 2016

dietmar said:
Seem hostnames resolves to wrong IP address?

10.0.0.X != 172.16.3.X (bindnetaddr)

When following default setup bindaddr was 10.0.0.X but the same result.
I am following these instructions: https://pve.proxmox.com/wiki/Separate_Cluster_Network
Which is exactly what my configuration is?

mir · May 17, 2016

From syslog when it fails initializing pmxcfs

wolfgang · May 18, 2016

Hi Mir,

The problem is you host ip resolving is on 2 networks.

Do on both host
First change the corosync.conf in /etc/corosync as follow
the bindnet address must be in the same network as the ring address.
change the ring0_addr from name to IP
then restart the corosync.
now check if corosync is up with corosync-quorumtool
Then restart the pve-manager

only on one host
now copy the /etc/corosync/corosync.conf to /etc/pve/corosync.conf

mir · May 18, 2016

wolfgang said:
The problem is you host ip resolving is on 2 networks

Where do you see that?
esx1 resolves to 10.0.0.1.
esx1-corosync resolves to 172.16.3.8.

And as explained before when I initially installed there was only the 10.0.0.x/24 network showing exactly the same error.

wolfgang said:
the bindnet address must be in the same network as the ring address.

But bindnet address is on the same network as the ring address!

bindnetaddr: 172.16.3.8
esx1-corosync: 172.16.3.8
esx2-corosync: 172.16.3.9

But did as you suggested and still the same error. Log from evicted node:

May 18 12:56:27 esx2 pmxcfs[2160]: [status] notice: update cluster info (cluster name midgaard, version = 11)
May 18 12:56:27 esx2 pmxcfs[2160]: [dcdb] crit: cpg_initialize failed: 6
May 18 12:56:27 esx2 pmxcfs[2160]: [status] crit: cpg_initialize failed: 6
May 18 12:56:27 esx2 pmxcfs[2160]: [status] notice: node has quorum

mir · May 18, 2016

corosync quorum status from both nodes:

Code:

root@esx1:~# corosync-quorumtool
Quorum information
------------------
Date:             Wed May 18 13:01:00 2016
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          4008
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
         1          1 172.16.3.8 (local)
         2          1 172.16.3.9

Code:

root@esx2:~# corosync-quorumtool
Quorum information
------------------
Date:             Wed May 18 13:01:22 2016
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          2
Ring ID:          4008
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
         1          1 172.16.3.8
         2          1 172.16.3.9 (local)

wolfgang · May 18, 2016

mir said:
Where do you see that?

Here the first corosync.conf you send the ring adress where
esx1-corosync = 10.0.0.1 and the bindnetaddress bindnetaddr: 172.16.3.8

Can you send me the output of

pvecm status
/etc/pve/corosync.conf
/etc/corosync/corosync.conf
systemctl status pve-manager
systemctl status corosync

mir · May 18, 2016

I will send it by email.

mir · May 18, 2016

to Wolfgang's on the list. Sending therefore to both herrn Link und herrn Bumiller ;-)

Norberto Iannicelli · May 18, 2016

I have the same problem sometimes.
My cluster has 7 nodes. All the times I had to insert a new node, one or more servers restarted.

Also, some virtual machines restarted after losing connection with the storage, which is a FreeNAS providing lvm via fiber channel.
Great disorder because, to turn off the vm takes too long.

wolfgang · May 19, 2016

Can you provide more information.
We are taking about cluster communication or storage problems?
What storage you have?

mir · May 19, 2016

cluster communication.
What do you mean by storage? If you refer to where the proxmox nodes are installed then storage is bare metal.

wolfgang · May 19, 2016

This question where directed to Norberto Iannicelli.

pve-4.2 add node fails

Famous Member

Famous Member

Famous Member

Famous Member

Proxmox Staff Member

Famous Member

Famous Member

Proxmox Staff Member

Famous Member

Famous Member

Attachments

Proxmox Retired Staff

Famous Member

Famous Member

Proxmox Retired Staff

Famous Member

Famous Member

Renowned Member

Proxmox Retired Staff

Famous Member

Proxmox Retired Staff

We value your privacy