Adding node to established cluster fails with incomplete coro sync

verbunk · Apr 5, 2024

Hey Everyone,

Hoping to get some tips diagnosing an issue adding a node to an established cluster.

= Starting config
- 3 nodes
- Management UI is on vLan 40 and Coro is on vLan 45 on sep eth ports. Originally I had both on vLan 40 but then migrated coro to vLan 45 by adding a ring and then removing old (the below still shows old vLan for some reason).
- Existing 3 nodes are happy

Bash:

# pvecm status
Cluster information
-------------------
Name:             pve-i1
Config Version:   12
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Apr  5 00:07:32 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.5123
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.40.78 (local)
0x00000002          1 10.0.40.79
0x00000003          1 10.0.40.80

The new node is fairly naive. I did install then `apt upgrade` to similar patch level. On new node I verified that it could see all vLans that the established cluster could see for every node (ping each IP). When I grab the cluster join token and drop it into the new node and it pops up the modal dialog with status. It shows,

Code:

Establishing API connection with host '10.0.40.77'
Login succeeded.
check cluster join API version
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service

(10.0.40.77 is management IP of node 3)

After a while I sometimes get an invalid token error OR some non-descriptive join error.

Looking in the logs (journalctl -f) I see some lines,

Code:

pveproxy[3906]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 2009.

Looking at the clusterfs folder ... it's mostly blank in comparison.

Bash:

# ls -alh
total 4.5K
drwxr-xr-x  2 root www-data    0 Dec 31  1969 .
drwxr-xr-x 92 root root     4.0K Apr  4 23:44 ..
-r--r-----  1 root www-data  155 Dec 31  1969 .clusterlog
-r--r-----  1 root www-data  633 Apr  4 23:39 corosync.conf
-rw-r-----  1 root www-data    2 Dec 31  1969 .debug
lr-xr-xr-x  1 root www-data    0 Dec 31  1969 local -> nodes/pve-i1n4
lr-xr-xr-x  1 root www-data    0 Dec 31  1969 lxc -> nodes/pve-i1n4/lxc
-r--r-----  1 root www-data  309 Dec 31  1969 .members
lr-xr-xr-x  1 root www-data    0 Dec 31  1969 openvz -> nodes/pve-i1n4/openvz
lr-xr-xr-x  1 root www-data    0 Dec 31  1969 qemu-server -> nodes/pve-i1n4/qemu-server
-r--r-----  1 root www-data  222 Dec 31  1969 .rrd
-r--r-----  1 root www-data  924 Dec 31  1969 .version
-r--r-----  1 root www-data   18 Dec 31  1969 .vmlist

- corosync.conf looks OK with all nodes accounted for.
- the other nodes do see a folder nodes/pve-i1n4 for this new node but the nodes folder is completely missing in /etc/pve/ on the new node.

pve-cluster.service appears OK.

Code:

Apr 05 00:26:38 pve-i1n4 systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Apr 05 00:26:38 pve-i1n4 pmxcfs[8008]: [main] notice: resolved node name 'pve-i1n4' to '10.0.40.78' for default node IP address
Apr 05 00:26:38 pve-i1n4 pmxcfs[8008]: [main] notice: resolved node name 'pve-i1n4' to '10.0.40.78' for default node IP address
Apr 05 00:26:38 pve-i1n4 pmxcfs[8010]: [status] notice: update cluster info (cluster name  pve-i1, version = 12)
Apr 05 00:26:38 pve-i1n4 pmxcfs[8010]: [dcdb] notice: members: 4/8010
Apr 05 00:26:38 pve-i1n4 pmxcfs[8010]: [dcdb] notice: all data is up to date
Apr 05 00:26:38 pve-i1n4 pmxcfs[8010]: [status] notice: members: 4/8010
Apr 05 00:26:38 pve-i1n4 pmxcfs[8010]: [status] notice: all data is up to date
Apr 05 00:26:39 pve-i1n4 systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.

but of course there are issues - namely the webUI won't allow logins anymore (b/c ssl is broken?) . status on new node shows Activity Blocked

Code:

# pvecm status
Cluster information
-------------------
Name:             pve-i1
Config Version:   12
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Apr  5 00:30:38 2024
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000004
Ring ID:          4.f
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      1
Quorum:           3 Activity blocked
Flags:          

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 10.0.45.78 (local)

but I'm unsure how to check why. On the established cluster pve-cluster.service seems happy?

Code:

Apr 04 23:57:20 pve-i1n1 pmxcfs[2061]: [status] notice: received log
Apr 04 23:59:17 pve-i1n1 pmxcfs[2061]: [status] notice: received log
Apr 05 00:08:00 pve-i1n1 pmxcfs[2061]: [status] notice: received log
Apr 05 00:14:17 pve-i1n1 pmxcfs[2061]: [status] notice: received log
Apr 05 00:16:29 pve-i1n1 pmxcfs[2061]: [dcdb] notice: data verification successful
Apr 05 00:23:20 pve-i1n1 pmxcfs[2061]: [status] notice: received log
Apr 05 00:29:18 pve-i1n1 pmxcfs[2061]: [status] notice: received log

I've already tried to help it along by attempting to create the missing folders thinking at some point it would have what it needed and continue,

Code:
/etc/pve# mkdir nodes
mkdir: cannot create directory ‘nodes’: Permission denied

but no such luck.

= Follow-up to "namely the webUI won't allow logins anymore (b/c ssl is broken?) " : in the logs I'm now noticing it killing the web workers after ~1s so it must be aborting the login. SSH still works as you'd expect since it's a sep daemon.

= Adding : I've also run a `apt full-upgrade` as suggested in another thread - No improvement.

On each node I've checked /etc/hosts and each has this node's IP / FQDN / Hostname and each node in established can ping by hostname,

From established node,

Code:

# ping -c1 pve-i1n4
PING pve-i1n4.domain.tld (10.0.40.78) 56(84) bytes of data.
64 bytes from pve-i1n4.domain.tld (10.0.40.78): icmp_seq=1 ttl=64 time=0.315 ms

--- pve-i1n4.domain.tld ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.715/0.715/0.715/0.000 ms

Search

Search

Adding node to established cluster fails with incomplete coro sync

verbunk

Member

We value your privacy