Cannot add node(s) to cluster

alexskysilk

Distinguished Member
Oct 16, 2015
2,009
448
153
Chatsworth, CA
www.skysilk.com
I am having a hard time adding SOME nodes to an existing cluster. all nodes are freshly updated and running same version of proxmox-ve, as follows:

Code:
# pveversion -v
proxmox-ve: 5.1-42 (running kernel: 4.15.15-1-pve)
pve-manager: 5.1-51 (running version: 5.1-51/96be5354)
pve-kernel-4.13: 5.1-44
pve-kernel-4.15: 5.1-3
pve-kernel-4.15.15-1-pve: 4.15.15-6
pve-kernel-4.15.3-1-pve: 4.15.3-1
pve-kernel-4.13.16-2-pve: 4.13.16-47
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-2-pve: 4.13.13-33
corosync: 2.4.2-pve4
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-30
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-18
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-2
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
openvswitch-switch: 2.7.0-2
proxmox-widget-toolkit: 1.0-15
pve-cluster: 5.0-25
pve-container: 2.0-21
pve-docs: 5.1-17
pve-firewall: 3.0-8
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-2
qemu-server: 5.0-25
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.7-pve1~bpo9

4 nodes are intel based and are working normally; 2 nodes are amd based and will not join the cluster with the dreaded "
'configuration error: nodelist or quorum.expected_votes must be configured!'" error; hosts file is identical on all nodes, and the cluster traffic nics are all identical as well. all nodes are pingable from all nodes either by name or ip.

when executing "pvecm add node" from one of the amd machines, there is an entry inserted into corosync.conf on the cluster but with the IP address instead of their name, eg:

Code:
# cat /etc/pve/corosync.conf
logging {
  debug: on
  to_syslog: yes
}

nodelist {
  node {
    name: nvme1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: nvme1
  }
  node {
    name: nvme2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: nvme2
  }
  node {
    name: nvme3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: nvme3
  }
  node {
    name: nvme4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: nvme4
  }
  node {
    name: nvme5
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 172.19.0.5
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: test
  config_version: 17
  interface {
    bindnetaddr: 172.19.0.1
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

I tried to change it manually and uprev the config_version but no change in behavior. help!
 
Hi
hosts file is identical on all nodes
Where you have two corosync.conf
/etc/corosync/corosync.conf this where corosync is looking for the config
/etc/pve/corosync.conf this is where pve will distribute the config file

Do not mix IP and names in the config.
I personally like IP more because this is more obvious what IP and network are used.
 
I get that; I attempted to add the node using the hosts short name, not IP address; the addition process completes part way and then fails, but not before the node gets added to the cluster's corosync.conf (/etc/pve/corosync.conf) BUT WITH THE IP ADDRESS instead of the hosts short name. I dont understand why, and the new node is unable to have corosync start.
 
Can the hostname in the corosync be resolved?
Check the /etc/hosts and also the IP address in the /etc/network/interfaces must have the same IPs.
 
Yes, but there is no real corrective action I can prescribe. It is most likely network related although I cant say what. Problem was cured by blowing everything away and reinstalling- there is no rational reason to it having worked afterwards, but no rational reason for it not working in the first place...

If its any consolation, I haven't seen anything like that since on at least 4 subsequent clusters.