[SOLVED] Cluster problem - can't add new node

roosei

Renowned Member
Nov 3, 2016
12
0
66
43
Prague, Czech Republic
www.nux.cz
Hello,

I have 4 nodes cluster and I need to add new node.

Code:
proxmox-ve: 5.2-2 (running kernel: 4.15.17-2-pve)
pve-manager: 5.2-1 (running version: 5.2-1/0fcd7879)
pve-kernel-4.15: 5.2-2
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-31
libpve-guest-common-perl: 2.0-16
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-18
pve-cluster: 5.0-27
pve-container: 2.0-23
pve-docs: 5.2-4
pve-firewall: 3.0-9
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-5
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-26
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

After adding node:
Code:
pvecm add 172.16.0.2 -ring0_addr node5-corosync -use_ssh

corosync wrote
Code:
Jun  5 14:08:11 node5 corosync[28501]:  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Jun  5 14:08:11 node5 corosync[28501]: notice  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Jun  5 14:08:11 node5 corosync[28501]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
Jun  5 14:08:11 node5 corosync[28501]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
Jun  5 14:08:11 node5 corosync[28501]: notice  [TOTEM ] Initializing transport (UDP/IP Multicast).
Jun  5 14:08:11 node5 corosync[28501]: notice  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Jun  5 14:08:11 node5 corosync[28501]:  [TOTEM ] Initializing transport (UDP/IP Multicast).
Jun  5 14:08:11 node5 corosync[28501]:  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Jun  5 14:08:11 node5 corosync[28501]: notice  [TOTEM ] The network interface [172.16.0.6] is now up.
Jun  5 14:08:11 node5 corosync[28501]:  [TOTEM ] The network interface [172.16.0.6] is now up.
Jun  5 14:08:11 node5 corosync[28501]: notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
Jun  5 14:08:11 node5 corosync[28501]: info    [QB    ] server name: cmap
Jun  5 14:08:11 node5 corosync[28501]: notice  [SERV  ] Service engine loaded: corosync configuration service [1]
Jun  5 14:08:11 node5 corosync[28501]: info    [QB    ] server name: cfg
Jun  5 14:08:11 node5 corosync[28501]: notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jun  5 14:08:11 node5 corosync[28501]: info    [QB    ] server name: cpg
Jun  5 14:08:11 node5 corosync[28501]: notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
Jun  5 14:08:11 node5 corosync[28501]:  [SERV  ] Service engine loaded: corosync configuration map access [0]
Jun  5 14:08:11 node5 corosync[28501]: notice  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jun  5 14:08:11 node5 corosync[28501]: warning [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
Jun  5 14:08:11 node5 corosync[28501]: warning [WD    ] resource load_15min missing a recovery key.
Jun  5 14:08:11 node5 corosync[28501]: warning [WD    ] resource memory_used missing a recovery key.
Jun  5 14:08:11 node5 corosync[28501]: info    [WD    ] no resources configured.
Jun  5 14:08:11 node5 corosync[28501]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Jun  5 14:08:11 node5 corosync[28501]: notice  [QUORUM] Using quorum provider corosync_votequorum
Jun  5 14:08:11 node5 corosync[28501]: notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jun  5 14:08:11 node5 corosync[28501]: info    [QB    ] server name: votequorum
Jun  5 14:08:11 node5 corosync[28501]: notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jun  5 14:08:11 node5 corosync[28501]: info    [QB    ] server name: quorum
Jun  5 14:08:11 node5 corosync[28501]: notice  [TOTEM ] A new membership (172.16.0.6:12756) was formed. Members joined: 5
Jun  5 14:08:11 node5 corosync[28501]: warning [CPG   ] downlist left_list: 0 received
Jun  5 14:08:11 node5 corosync[28501]:  [QB    ] server name: cmap
Jun  5 14:08:11 node5 systemd[1]: Started Corosync Cluster Engine.
Jun  5 14:08:11 node5 corosync[28501]: notice  [QUORUM] Members[1]: 5
Jun  5 14:08:11 node5 corosync[28501]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jun  5 14:08:11 node5 corosync[28501]:  [SERV  ] Service engine loaded: corosync configuration service [1]
Jun  5 14:08:11 node5 corosync[28501]:  [QB    ] server name: cfg
Jun  5 14:08:11 node5 corosync[28501]:  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jun  5 14:08:11 node5 corosync[28501]:  [QB    ] server name: cpg
Jun  5 14:08:11 node5 corosync[28501]:  [SERV  ] Service engine loaded: corosync profile loading service [4]
Jun  5 14:08:11 node5 corosync[28501]:  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jun  5 14:08:11 node5 corosync[28501]:  [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
Jun  5 14:08:11 node5 corosync[28501]:  [WD    ] resource load_15min missing a recovery key.
Jun  5 14:08:11 node5 corosync[28501]:  [WD    ] resource memory_used missing a recovery key.
Jun  5 14:08:11 node5 corosync[28501]:  [WD    ] no resources configured.
Jun  5 14:08:11 node5 corosync[28501]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
Jun  5 14:08:11 node5 corosync[28501]:  [QUORUM] Using quorum provider corosync_votequorum
Jun  5 14:08:11 node5 corosync[28501]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jun  5 14:08:11 node5 corosync[28501]:  [QB    ] server name: votequorum
Jun  5 14:08:11 node5 corosync[28501]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jun  5 14:08:11 node5 corosync[28501]:  [QB    ] server name: quorum
Jun  5 14:08:11 node5 corosync[28501]:  [TOTEM ] A new membership (172.16.0.6:12756) was formed. Members joined: 5
Jun  5 14:08:11 node5 corosync[28501]:  [CPG   ] downlist left_list: 0 received
Jun  5 14:08:11 node5 corosync[28501]:  [QUORUM] Members[1]: 5
Jun  5 14:08:11 node5 corosync[28501]:  [MAIN  ] Completed service synchronization, ready to provide service.

And "Expected votes" in cluster was raised to 5.

However, new node still shows "waiting for quorum..." and configs in /etc/pve is not synced from the cluster.

Than new node5 is showing just local node with expected 5 votes and all others nodes can't see it through pvecm status.

But it doesn't look like network issue, omping test was successful, I checked hostnames etc. Only error in syslog I found:

Code:
Jun  5 14:08:07 node5 pvesr[28252]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun  5 14:08:08 node5 pvesr[28252]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun  5 14:08:09 node5 pvesr[28252]: error with cfs lock 'file-replication_cfg': no quorum!
Jun  5 14:08:09 node5 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a

Any ideas how to solve it?
 
It looks ok, 172.16.0.6 is the "new" node... Any ideas how to debug multicast? I also tried to disable all firewalls in the networks.

Code:
 omping -c 600 -i 1 -q 172.16.0.2 172.16.0.3 172.16.0.4 172.16.0.5 172.16.0.6
172.16.0.2 : waiting for response msg
172.16.0.3 : waiting for response msg
172.16.0.4 : waiting for response msg
172.16.0.5 : waiting for response msg
172.16.0.3 : server told us to stop
172.16.0.2 : waiting for response msg
172.16.0.4 : waiting for response msg
172.16.0.5 : waiting for response msg
172.16.0.4 : server told us to stop
172.16.0.5 : server told us to stop
172.16.0.2 : server told us to stop

172.16.0.2 : response message never received
172.16.0.3 : response message never received
172.16.0.4 : response message never received
172.16.0.5 : response message never received

Also interesting is, that other nodes in this cluster is working perfectly for many months... So I don't suppose it's network problems, perhaps some mistake in config or something similar :(
 
It looks ok, 172.16.0.6 is the "new" node... Any ideas how to debug multicast? I also tried to disable all firewalls in the networks.

Code:
 omping -c 600 -i 1 -q 172.16.0.2 172.16.0.3 172.16.0.4 172.16.0.5 172.16.0.6
172.16.0.2 : waiting for response msg
172.16.0.3 : waiting for response msg
172.16.0.4 : waiting for response msg
172.16.0.5 : waiting for response msg
172.16.0.3 : server told us to stop
172.16.0.2 : waiting for response msg
172.16.0.4 : waiting for response msg
172.16.0.5 : waiting for response msg
172.16.0.4 : server told us to stop
172.16.0.5 : server told us to stop
172.16.0.2 : server told us to stop

172.16.0.2 : response message never received
172.16.0.3 : response message never received
172.16.0.4 : response message never received
172.16.0.5 : response message never received

Also interesting is, that other nodes in this cluster is working perfectly for many months... So I don't suppose it's network problems, perhaps some mistake in config or something similar :(

I am also running into this issue on a new cluster.

Code:
asmping 224.0.2.1 192.168.0.10
asmping joined (S,G) = (*,224.0.2.234)
pinging 192.168.0.10 from 207.66.141.156
  unicast from 192.168.0.10, seq=1 dist=0 time=0.219 ms
  unicast from 192.168.0.10, seq=2 dist=0 time=0.141 ms
multicast from 192.168.0.10, seq=2 dist=0 time=0.166 ms
multicast from 192.168.0.10, seq=3 dist=0 time=0.189 ms
  unicast from 192.168.0.10, seq=3 dist=0 time=0.205 ms
  unicast from 192.168.0.10, seq=4 dist=0 time=0.106 ms
multicast from 192.168.0.10, seq=4 dist=0 time=0.163 ms
  unicast from 192.168.0.10, seq=5 dist=0 time=0.159 ms
multicast from 192.168.0.10, seq=5 dist=0 time=0.181 ms
  unicast from 192.168.0.10, seq=6 dist=0 time=0.127 ms
multicast from 192.168.0.10, seq=6 dist=0 time=0.191 ms
multicast from 192.168.0.10, seq=7 dist=0 time=0.185 ms
  unicast from 192.168.0.10, seq=7 dist=0 time=0.200 ms
multicast from 192.168.0.10, seq=8 dist=0 time=0.199 ms
  unicast from 192.168.0.10, seq=8 dist=0 time=0.217 ms
  unicast from 192.168.0.10, seq=9 dist=0 time=0.189 ms
multicast from 192.168.0.10, seq=9 dist=0 time=0.213 ms