Adding node to established cluster fails with incomplete coro sync

verbunk

New Member
Jun 14, 2023
2
8
3
Hey Everyone,

Hoping to get some tips diagnosing an issue adding a node to an established cluster.

= Starting config
- 3 nodes
- Management UI is on vLan 40 and Coro is on vLan 45 on sep eth ports. Originally I had both on vLan 40 but then migrated coro to vLan 45 by adding a ring and then removing old (the below still shows old vLan for some reason).
- Existing 3 nodes are happy

Bash:
# pvecm status
Cluster information
-------------------
Name:             pve-i1
Config Version:   12
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Apr  5 00:07:32 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.5123
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.40.78 (local)
0x00000002          1 10.0.40.79
0x00000003          1 10.0.40.80

The new node is fairly naive. I did install then `apt upgrade` to similar patch level. On new node I verified that it could see all vLans that the established cluster could see for every node (ping each IP). When I grab the cluster join token and drop it into the new node and it pops up the modal dialog with status. It shows,

Code:
Establishing API connection with host '10.0.40.77'
Login succeeded.
check cluster join API version
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service

(10.0.40.77 is management IP of node 3)

After a while I sometimes get an invalid token error OR some non-descriptive join error.

Looking in the logs (journalctl -f) I see some lines,

Code:
pveproxy[3906]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 2009.

Looking at the clusterfs folder ... it's mostly blank in comparison.

Bash:
# ls -alh
total 4.5K
drwxr-xr-x  2 root www-data    0 Dec 31  1969 .
drwxr-xr-x 92 root root     4.0K Apr  4 23:44 ..
-r--r-----  1 root www-data  155 Dec 31  1969 .clusterlog
-r--r-----  1 root www-data  633 Apr  4 23:39 corosync.conf
-rw-r-----  1 root www-data    2 Dec 31  1969 .debug
lr-xr-xr-x  1 root www-data    0 Dec 31  1969 local -> nodes/pve-i1n4
lr-xr-xr-x  1 root www-data    0 Dec 31  1969 lxc -> nodes/pve-i1n4/lxc
-r--r-----  1 root www-data  309 Dec 31  1969 .members
lr-xr-xr-x  1 root www-data    0 Dec 31  1969 openvz -> nodes/pve-i1n4/openvz
lr-xr-xr-x  1 root www-data    0 Dec 31  1969 qemu-server -> nodes/pve-i1n4/qemu-server
-r--r-----  1 root www-data  222 Dec 31  1969 .rrd
-r--r-----  1 root www-data  924 Dec 31  1969 .version
-r--r-----  1 root www-data   18 Dec 31  1969 .vmlist

- corosync.conf looks OK with all nodes accounted for.
- the other nodes do see a folder nodes/pve-i1n4 for this new node but the nodes folder is completely missing in /etc/pve/ on the new node.

pve-cluster.service appears OK.

Code:
Apr 05 00:26:38 pve-i1n4 systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Apr 05 00:26:38 pve-i1n4 pmxcfs[8008]: [main] notice: resolved node name 'pve-i1n4' to '10.0.40.78' for default node IP address
Apr 05 00:26:38 pve-i1n4 pmxcfs[8008]: [main] notice: resolved node name 'pve-i1n4' to '10.0.40.78' for default node IP address
Apr 05 00:26:38 pve-i1n4 pmxcfs[8010]: [status] notice: update cluster info (cluster name  pve-i1, version = 12)
Apr 05 00:26:38 pve-i1n4 pmxcfs[8010]: [dcdb] notice: members: 4/8010
Apr 05 00:26:38 pve-i1n4 pmxcfs[8010]: [dcdb] notice: all data is up to date
Apr 05 00:26:38 pve-i1n4 pmxcfs[8010]: [status] notice: members: 4/8010
Apr 05 00:26:38 pve-i1n4 pmxcfs[8010]: [status] notice: all data is up to date
Apr 05 00:26:39 pve-i1n4 systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.

but of course there are issues - namely the webUI won't allow logins anymore (b/c ssl is broken?) . status on new node shows Activity Blocked

Code:
# pvecm status
Cluster information
-------------------
Name:             pve-i1
Config Version:   12
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Apr  5 00:30:38 2024
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000004
Ring ID:          4.f
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      1
Quorum:           3 Activity blocked
Flags:          

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 10.0.45.78 (local)

but I'm unsure how to check why. On the established cluster pve-cluster.service seems happy?

Code:
Apr 04 23:57:20 pve-i1n1 pmxcfs[2061]: [status] notice: received log
Apr 04 23:59:17 pve-i1n1 pmxcfs[2061]: [status] notice: received log
Apr 05 00:08:00 pve-i1n1 pmxcfs[2061]: [status] notice: received log
Apr 05 00:14:17 pve-i1n1 pmxcfs[2061]: [status] notice: received log
Apr 05 00:16:29 pve-i1n1 pmxcfs[2061]: [dcdb] notice: data verification successful
Apr 05 00:23:20 pve-i1n1 pmxcfs[2061]: [status] notice: received log
Apr 05 00:29:18 pve-i1n1 pmxcfs[2061]: [status] notice: received log

I've already tried to help it along by attempting to create the missing folders thinking at some point it would have what it needed and continue,

Code:
/etc/pve# mkdir nodes
mkdir: cannot create directory ‘nodes’: Permission denied

but no such luck.



= Follow-up to "namely the webUI won't allow logins anymore (b/c ssl is broken?) " : in the logs I'm now noticing it killing the web workers after ~1s so it must be aborting the login. SSH still works as you'd expect since it's a sep daemon.




= Adding : I've also run a `apt full-upgrade` as suggested in another thread - No improvement.

On each node I've checked /etc/hosts and each has this node's IP / FQDN / Hostname and each node in established can ping by hostname,

From established node,

Code:
# ping -c1 pve-i1n4
PING pve-i1n4.domain.tld (10.0.40.78) 56(84) bytes of data.
64 bytes from pve-i1n4.domain.tld (10.0.40.78): icmp_seq=1 ttl=64 time=0.315 ms

--- pve-i1n4.domain.tld ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.715/0.715/0.715/0.000 ms
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!