[SOLVED] Cluster problem - can't add new node

roosei

Active Member
Nov 3, 2016
12
0
41
42
Prague, Czech Republic
www.nux.cz
Hello,

I have 4 nodes cluster and I need to add new node.

Code:
proxmox-ve: 5.2-2 (running kernel: 4.15.17-2-pve)
pve-manager: 5.2-1 (running version: 5.2-1/0fcd7879)
pve-kernel-4.15: 5.2-2
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-31
libpve-guest-common-perl: 2.0-16
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-18
pve-cluster: 5.0-27
pve-container: 2.0-23
pve-docs: 5.2-4
pve-firewall: 3.0-9
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-5
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-26
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

After adding node:
Code:
pvecm add 172.16.0.2 -ring0_addr node5-corosync -use_ssh

corosync wrote
Code:
Jun  5 14:08:11 node5 corosync[28501]:  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Jun  5 14:08:11 node5 corosync[28501]: notice  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Jun  5 14:08:11 node5 corosync[28501]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
Jun  5 14:08:11 node5 corosync[28501]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
Jun  5 14:08:11 node5 corosync[28501]: notice  [TOTEM ] Initializing transport (UDP/IP Multicast).
Jun  5 14:08:11 node5 corosync[28501]: notice  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Jun  5 14:08:11 node5 corosync[28501]:  [TOTEM ] Initializing transport (UDP/IP Multicast).
Jun  5 14:08:11 node5 corosync[28501]:  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Jun  5 14:08:11 node5 corosync[28501]: notice  [TOTEM ] The network interface [172.16.0.6] is now up.
Jun  5 14:08:11 node5 corosync[28501]:  [TOTEM ] The network interface [172.16.0.6] is now up.
Jun  5 14:08:11 node5 corosync[28501]: notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
Jun  5 14:08:11 node5 corosync[28501]: info    [QB    ] server name: cmap
Jun  5 14:08:11 node5 corosync[28501]: notice  [SERV  ] Service engine loaded: corosync configuration service [1]
Jun  5 14:08:11 node5 corosync[28501]: info    [QB    ] server name: cfg
Jun  5 14:08:11 node5 corosync[28501]: notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jun  5 14:08:11 node5 corosync[28501]: info    [QB    ] server name: cpg
Jun  5 14:08:11 node5 corosync[28501]: notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
Jun  5 14:08:11 node5 corosync[28501]:  [SERV  ] Service engine loaded: corosync configuration map access [0]
Jun  5 14:08:11 node5 corosync[28501]: notice  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jun  5 14:08:11 node5 corosync[28501]: warning [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
Jun  5 14:08:11 node5 corosync[28501]: warning [WD    ] resource load_15min missing a recovery key.
Jun  5 14:08:11 node5 corosync[28501]: warning [WD    ] resource memory_used missing a recovery key.
Jun  5 14:08:11 node5 corosync[28501]: info    [WD    ] no resources configured.
Jun  5 14:08:11 node5 corosync[28501]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Jun  5 14:08:11 node5 corosync[28501]: notice  [QUORUM] Using quorum provider corosync_votequorum
Jun  5 14:08:11 node5 corosync[28501]: notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jun  5 14:08:11 node5 corosync[28501]: info    [QB    ] server name: votequorum
Jun  5 14:08:11 node5 corosync[28501]: notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jun  5 14:08:11 node5 corosync[28501]: info    [QB    ] server name: quorum
Jun  5 14:08:11 node5 corosync[28501]: notice  [TOTEM ] A new membership (172.16.0.6:12756) was formed. Members joined: 5
Jun  5 14:08:11 node5 corosync[28501]: warning [CPG   ] downlist left_list: 0 received
Jun  5 14:08:11 node5 corosync[28501]:  [QB    ] server name: cmap
Jun  5 14:08:11 node5 systemd[1]: Started Corosync Cluster Engine.
Jun  5 14:08:11 node5 corosync[28501]: notice  [QUORUM] Members[1]: 5
Jun  5 14:08:11 node5 corosync[28501]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jun  5 14:08:11 node5 corosync[28501]:  [SERV  ] Service engine loaded: corosync configuration service [1]
Jun  5 14:08:11 node5 corosync[28501]:  [QB    ] server name: cfg
Jun  5 14:08:11 node5 corosync[28501]:  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jun  5 14:08:11 node5 corosync[28501]:  [QB    ] server name: cpg
Jun  5 14:08:11 node5 corosync[28501]:  [SERV  ] Service engine loaded: corosync profile loading service [4]
Jun  5 14:08:11 node5 corosync[28501]:  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jun  5 14:08:11 node5 corosync[28501]:  [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
Jun  5 14:08:11 node5 corosync[28501]:  [WD    ] resource load_15min missing a recovery key.
Jun  5 14:08:11 node5 corosync[28501]:  [WD    ] resource memory_used missing a recovery key.
Jun  5 14:08:11 node5 corosync[28501]:  [WD    ] no resources configured.
Jun  5 14:08:11 node5 corosync[28501]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
Jun  5 14:08:11 node5 corosync[28501]:  [QUORUM] Using quorum provider corosync_votequorum
Jun  5 14:08:11 node5 corosync[28501]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jun  5 14:08:11 node5 corosync[28501]:  [QB    ] server name: votequorum
Jun  5 14:08:11 node5 corosync[28501]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jun  5 14:08:11 node5 corosync[28501]:  [QB    ] server name: quorum
Jun  5 14:08:11 node5 corosync[28501]:  [TOTEM ] A new membership (172.16.0.6:12756) was formed. Members joined: 5
Jun  5 14:08:11 node5 corosync[28501]:  [CPG   ] downlist left_list: 0 received
Jun  5 14:08:11 node5 corosync[28501]:  [QUORUM] Members[1]: 5
Jun  5 14:08:11 node5 corosync[28501]:  [MAIN  ] Completed service synchronization, ready to provide service.

And "Expected votes" in cluster was raised to 5.

However, new node still shows "waiting for quorum..." and configs in /etc/pve is not synced from the cluster.

Than new node5 is showing just local node with expected 5 votes and all others nodes can't see it through pvecm status.

But it doesn't look like network issue, omping test was successful, I checked hostnames etc. Only error in syslog I found:

Code:
Jun  5 14:08:07 node5 pvesr[28252]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun  5 14:08:08 node5 pvesr[28252]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun  5 14:08:09 node5 pvesr[28252]: error with cfs lock 'file-replication_cfg': no quorum!
Jun  5 14:08:09 node5 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a

Any ideas how to solve it?
 
It looks ok, 172.16.0.6 is the "new" node... Any ideas how to debug multicast? I also tried to disable all firewalls in the networks.

Code:
 omping -c 600 -i 1 -q 172.16.0.2 172.16.0.3 172.16.0.4 172.16.0.5 172.16.0.6
172.16.0.2 : waiting for response msg
172.16.0.3 : waiting for response msg
172.16.0.4 : waiting for response msg
172.16.0.5 : waiting for response msg
172.16.0.3 : server told us to stop
172.16.0.2 : waiting for response msg
172.16.0.4 : waiting for response msg
172.16.0.5 : waiting for response msg
172.16.0.4 : server told us to stop
172.16.0.5 : server told us to stop
172.16.0.2 : server told us to stop

172.16.0.2 : response message never received
172.16.0.3 : response message never received
172.16.0.4 : response message never received
172.16.0.5 : response message never received

Also interesting is, that other nodes in this cluster is working perfectly for many months... So I don't suppose it's network problems, perhaps some mistake in config or something similar :(
 
It looks ok, 172.16.0.6 is the "new" node... Any ideas how to debug multicast? I also tried to disable all firewalls in the networks.

Code:
 omping -c 600 -i 1 -q 172.16.0.2 172.16.0.3 172.16.0.4 172.16.0.5 172.16.0.6
172.16.0.2 : waiting for response msg
172.16.0.3 : waiting for response msg
172.16.0.4 : waiting for response msg
172.16.0.5 : waiting for response msg
172.16.0.3 : server told us to stop
172.16.0.2 : waiting for response msg
172.16.0.4 : waiting for response msg
172.16.0.5 : waiting for response msg
172.16.0.4 : server told us to stop
172.16.0.5 : server told us to stop
172.16.0.2 : server told us to stop

172.16.0.2 : response message never received
172.16.0.3 : response message never received
172.16.0.4 : response message never received
172.16.0.5 : response message never received

Also interesting is, that other nodes in this cluster is working perfectly for many months... So I don't suppose it's network problems, perhaps some mistake in config or something similar :(

I am also running into this issue on a new cluster.

Code:
asmping 224.0.2.1 192.168.0.10
asmping joined (S,G) = (*,224.0.2.234)
pinging 192.168.0.10 from 207.66.141.156
  unicast from 192.168.0.10, seq=1 dist=0 time=0.219 ms
  unicast from 192.168.0.10, seq=2 dist=0 time=0.141 ms
multicast from 192.168.0.10, seq=2 dist=0 time=0.166 ms
multicast from 192.168.0.10, seq=3 dist=0 time=0.189 ms
  unicast from 192.168.0.10, seq=3 dist=0 time=0.205 ms
  unicast from 192.168.0.10, seq=4 dist=0 time=0.106 ms
multicast from 192.168.0.10, seq=4 dist=0 time=0.163 ms
  unicast from 192.168.0.10, seq=5 dist=0 time=0.159 ms
multicast from 192.168.0.10, seq=5 dist=0 time=0.181 ms
  unicast from 192.168.0.10, seq=6 dist=0 time=0.127 ms
multicast from 192.168.0.10, seq=6 dist=0 time=0.191 ms
multicast from 192.168.0.10, seq=7 dist=0 time=0.185 ms
  unicast from 192.168.0.10, seq=7 dist=0 time=0.200 ms
multicast from 192.168.0.10, seq=8 dist=0 time=0.199 ms
  unicast from 192.168.0.10, seq=8 dist=0 time=0.217 ms
  unicast from 192.168.0.10, seq=9 dist=0 time=0.189 ms
multicast from 192.168.0.10, seq=9 dist=0 time=0.213 ms
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!