Hey Everyone,
Hoping to get some tips diagnosing an issue adding a node to an established cluster.
= Starting config
- 3 nodes
- Management UI is on vLan 40 and Coro is on vLan 45 on sep eth ports. Originally I had both on vLan 40 but then migrated coro to vLan 45 by adding a ring and then removing old (the below still shows old vLan for some reason).
- Existing 3 nodes are happy
The new node is fairly naive. I did install then `apt upgrade` to similar patch level. On new node I verified that it could see all vLans that the established cluster could see for every node (ping each IP). When I grab the cluster join token and drop it into the new node and it pops up the modal dialog with status. It shows,
(10.0.40.77 is management IP of node 3)
After a while I sometimes get an invalid token error OR some non-descriptive join error.
Looking in the logs (journalctl -f) I see some lines,
Looking at the clusterfs folder ... it's mostly blank in comparison.
- corosync.conf looks OK with all nodes accounted for.
- the other nodes do see a folder nodes/pve-i1n4 for this new node but the nodes folder is completely missing in /etc/pve/ on the new node.
pve-cluster.service appears OK.
but of course there are issues - namely the webUI won't allow logins anymore (b/c ssl is broken?) . status on new node shows Activity Blocked
but I'm unsure how to check why. On the established cluster pve-cluster.service seems happy?
I've already tried to help it along by attempting to create the missing folders thinking at some point it would have what it needed and continue,
Code:
/etc/pve# mkdir nodes
mkdir: cannot create directory ‘nodes’: Permission denied
but no such luck.
= Follow-up to "namely the webUI won't allow logins anymore (b/c ssl is broken?) " : in the logs I'm now noticing it killing the web workers after ~1s so it must be aborting the login. SSH still works as you'd expect since it's a sep daemon.
= Adding : I've also run a `apt full-upgrade` as suggested in another thread - No improvement.
On each node I've checked /etc/hosts and each has this node's IP / FQDN / Hostname and each node in established can ping by hostname,
From established node,
Hoping to get some tips diagnosing an issue adding a node to an established cluster.
= Starting config
- 3 nodes
- Management UI is on vLan 40 and Coro is on vLan 45 on sep eth ports. Originally I had both on vLan 40 but then migrated coro to vLan 45 by adding a ring and then removing old (the below still shows old vLan for some reason).
- Existing 3 nodes are happy
Bash:
# pvecm status
Cluster information
-------------------
Name: pve-i1
Config Version: 12
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Fri Apr 5 00:07:32 2024
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.5123
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.0.40.78 (local)
0x00000002 1 10.0.40.79
0x00000003 1 10.0.40.80
The new node is fairly naive. I did install then `apt upgrade` to similar patch level. On new node I verified that it could see all vLans that the established cluster could see for every node (ping each IP). When I grab the cluster join token and drop it into the new node and it pops up the modal dialog with status. It shows,
Code:
Establishing API connection with host '10.0.40.77'
Login succeeded.
check cluster join API version
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
(10.0.40.77 is management IP of node 3)
After a while I sometimes get an invalid token error OR some non-descriptive join error.
Looking in the logs (journalctl -f) I see some lines,
Code:
pveproxy[3906]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 2009.
Looking at the clusterfs folder ... it's mostly blank in comparison.
Bash:
# ls -alh
total 4.5K
drwxr-xr-x 2 root www-data 0 Dec 31 1969 .
drwxr-xr-x 92 root root 4.0K Apr 4 23:44 ..
-r--r----- 1 root www-data 155 Dec 31 1969 .clusterlog
-r--r----- 1 root www-data 633 Apr 4 23:39 corosync.conf
-rw-r----- 1 root www-data 2 Dec 31 1969 .debug
lr-xr-xr-x 1 root www-data 0 Dec 31 1969 local -> nodes/pve-i1n4
lr-xr-xr-x 1 root www-data 0 Dec 31 1969 lxc -> nodes/pve-i1n4/lxc
-r--r----- 1 root www-data 309 Dec 31 1969 .members
lr-xr-xr-x 1 root www-data 0 Dec 31 1969 openvz -> nodes/pve-i1n4/openvz
lr-xr-xr-x 1 root www-data 0 Dec 31 1969 qemu-server -> nodes/pve-i1n4/qemu-server
-r--r----- 1 root www-data 222 Dec 31 1969 .rrd
-r--r----- 1 root www-data 924 Dec 31 1969 .version
-r--r----- 1 root www-data 18 Dec 31 1969 .vmlist
- corosync.conf looks OK with all nodes accounted for.
- the other nodes do see a folder nodes/pve-i1n4 for this new node but the nodes folder is completely missing in /etc/pve/ on the new node.
pve-cluster.service appears OK.
Code:
Apr 05 00:26:38 pve-i1n4 systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Apr 05 00:26:38 pve-i1n4 pmxcfs[8008]: [main] notice: resolved node name 'pve-i1n4' to '10.0.40.78' for default node IP address
Apr 05 00:26:38 pve-i1n4 pmxcfs[8008]: [main] notice: resolved node name 'pve-i1n4' to '10.0.40.78' for default node IP address
Apr 05 00:26:38 pve-i1n4 pmxcfs[8010]: [status] notice: update cluster info (cluster name pve-i1, version = 12)
Apr 05 00:26:38 pve-i1n4 pmxcfs[8010]: [dcdb] notice: members: 4/8010
Apr 05 00:26:38 pve-i1n4 pmxcfs[8010]: [dcdb] notice: all data is up to date
Apr 05 00:26:38 pve-i1n4 pmxcfs[8010]: [status] notice: members: 4/8010
Apr 05 00:26:38 pve-i1n4 pmxcfs[8010]: [status] notice: all data is up to date
Apr 05 00:26:39 pve-i1n4 systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
but of course there are issues - namely the webUI won't allow logins anymore (b/c ssl is broken?) . status on new node shows Activity Blocked
Code:
# pvecm status
Cluster information
-------------------
Name: pve-i1
Config Version: 12
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Fri Apr 5 00:30:38 2024
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000004
Ring ID: 4.f
Quorate: No
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 1
Quorum: 3 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
0x00000004 1 10.0.45.78 (local)
but I'm unsure how to check why. On the established cluster pve-cluster.service seems happy?
Code:
Apr 04 23:57:20 pve-i1n1 pmxcfs[2061]: [status] notice: received log
Apr 04 23:59:17 pve-i1n1 pmxcfs[2061]: [status] notice: received log
Apr 05 00:08:00 pve-i1n1 pmxcfs[2061]: [status] notice: received log
Apr 05 00:14:17 pve-i1n1 pmxcfs[2061]: [status] notice: received log
Apr 05 00:16:29 pve-i1n1 pmxcfs[2061]: [dcdb] notice: data verification successful
Apr 05 00:23:20 pve-i1n1 pmxcfs[2061]: [status] notice: received log
Apr 05 00:29:18 pve-i1n1 pmxcfs[2061]: [status] notice: received log
I've already tried to help it along by attempting to create the missing folders thinking at some point it would have what it needed and continue,
Code:
/etc/pve# mkdir nodes
mkdir: cannot create directory ‘nodes’: Permission denied
but no such luck.
= Follow-up to "namely the webUI won't allow logins anymore (b/c ssl is broken?) " : in the logs I'm now noticing it killing the web workers after ~1s so it must be aborting the login. SSH still works as you'd expect since it's a sep daemon.
= Adding : I've also run a `apt full-upgrade` as suggested in another thread - No improvement.
On each node I've checked /etc/hosts and each has this node's IP / FQDN / Hostname and each node in established can ping by hostname,
From established node,
Code:
# ping -c1 pve-i1n4
PING pve-i1n4.domain.tld (10.0.40.78) 56(84) bytes of data.
64 bytes from pve-i1n4.domain.tld (10.0.40.78): icmp_seq=1 ttl=64 time=0.315 ms
--- pve-i1n4.domain.tld ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.715/0.715/0.715/0.000 ms
Last edited: