Problems joining new node with 6.4 to cluster with 6.0

skraw · Dec 15, 2022

Hello all,

I tried to join a new node today to an existing cluster with version 6.0-4. The new node has version 6.4 (I thought it would be a good idea to use the latest minor version). First everything went well, but after completion it turns out the node is not active in the cluster. It was entered in the corosync.conf with correct link values. But the node itself shows a problem:
Dec 15 13:32:29 pm-249 pveproxy[2037]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1907.

in fact /etc/pve looks like this:

root@pm-249:~# ls -l /etc/pve/
total 1
-r--r----- 1 root www-data 1035 Dec 15 13:05 corosync.conf
lr-xr-xr-x 1 root www-data 0 Jan 1 1970 local -> nodes/pm-249
lr-xr-xr-x 1 root www-data 0 Jan 1 1970 lxc -> nodes/pm-249/lxc
lr-xr-xr-x 1 root www-data 0 Jan 1 1970 openvz -> nodes/pm-249/openvz
lr-xr-xr-x 1 root www-data 0 Jan 1 1970 qemu-server -> nodes/pm-249/qemu-server

Which basically means that the folder "nodes" is missing. Probably that is why it cannot join correctly. And I cannot login to the web GUI.
Is there some way to solve this, besides using the old version 6.0-4 in a re-install?
Thank you for reading!
--
Regards

Dunuin · Dec 15, 2022

You know that even PVE 6.4 is end of life since summer, so your nodes shouldn't receive security patches for a long time?

Lukas Wagner · Dec 15, 2022

Hello,

I would advise to upgrade to PVE 7.3 first, and then add the new node with the same version. As @Dunuin mentioned, v6.4 is End-Of-Life and therefore receives no further security updates.

https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0

skraw · Dec 15, 2022

The thing is we would like to consolidate the cluster with new nodes and killing old ones. So the idea was to put in one new node (which is large enough to take the whole cluster vhosts) with 6.4, which is needed for update to 7.X anyway. So put in node, migrate everything to it, kill all old nodes, add another new node, and then upgrade both to 7.X. That was the plan.

The issue looks like a problem with a fuse mount, correct?

skraw · Dec 15, 2022

Let me mention that we read the doc you pointed to before and there:

Preconditions

Upgraded to the latest version of Proxmox VE 6.4 (check correct package repository configuration)

So we are in fact at the beginning of your requested upgrade and find out that even the one-by-one upate to 6.4 seems to create heavy problems.
We would be in exactly the same position if we took one of the old nodes to 6.4. Very likely it would not work with the old nodes' cluster either.

Lukas Wagner · Dec 16, 2022

Hello,

skraw said:
The issue looks like a problem with a fuse mount, correct?

/etc/pve is a FUSE mount, correct. It is provided via pmxcfs [1], which is controlled by the pve-cluster systemd service.

skraw said:
root@pm-249:~# ls -l /etc/pve/
total 1
-r--r----- 1 root www-data 1035 Dec 15 13:05 corosync.conf
lr-xr-xr-x 1 root www-data 0 Jan 1 1970 local -> nodes/pm-249
lr-xr-xr-x 1 root www-data 0 Jan 1 1970 lxc -> nodes/pm-249/lxc
lr-xr-xr-x 1 root www-data 0 Jan 1 1970 openvz -> nodes/pm-249/openvz
lr-xr-xr-x 1 root www-data 0 Jan 1 1970 qemu-server -> nodes/pm-249/qemu-server

is this the full output of this command? Because other files seem to be missing as well. Does journalctl -u pve-cluster show any errors for the pve-cluster service?

Do the other nodes recognize the new node as part of your cluster? For that, please show us the output of pvecm status.

In general, adding a node with 6.4 to a cluster running 6.0 should work, I verified this myself yesterday in some quick experiments.

[1] https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs)

Hope this helps,

skraw · Dec 16, 2022

Hello,

thank you for coming back to the issue. Here is the requested output (hopefully correctly inlined)

# journalctl -u pve-cluster

Code:

Dec 15 13:31:25 pm-249 systemd[1]: Starting The Proxmox VE cluster filesystem...
Dec 15 13:31:25 pm-249 pmxcfs[1631]: [quorum] crit: quorum_initialize failed: 2
Dec 15 13:31:25 pm-249 pmxcfs[1631]: [quorum] crit: can't initialize service
Dec 15 13:31:25 pm-249 pmxcfs[1631]: [confdb] crit: cmap_initialize failed: 2
Dec 15 13:31:25 pm-249 pmxcfs[1631]: [confdb] crit: can't initialize service
Dec 15 13:31:25 pm-249 pmxcfs[1631]: [dcdb] crit: cpg_initialize failed: 2
Dec 15 13:31:25 pm-249 pmxcfs[1631]: [dcdb] crit: can't initialize service
Dec 15 13:31:25 pm-249 pmxcfs[1631]: [status] crit: cpg_initialize failed: 2
Dec 15 13:31:25 pm-249 pmxcfs[1631]: [status] crit: can't initialize service
Dec 15 13:31:26 pm-249 systemd[1]: Started The Proxmox VE cluster filesystem.
Dec 15 13:31:31 pm-249 pmxcfs[1631]: [status] notice: update cluster info (cluster name  XXX, version = 20)
Dec 15 13:31:41 pm-249 pmxcfs[1631]: [dcdb] notice: members: 5/1631
Dec 15 13:31:41 pm-249 pmxcfs[1631]: [dcdb] notice: all data is up to date
Dec 15 13:31:41 pm-249 pmxcfs[1631]: [status] notice: members: 5/1631
Dec 15 13:31:41 pm-249 pmxcfs[1631]: [status] notice: all data is up to date
Dec 15 13:32:18 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 10
Dec 15 13:32:19 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 20
Dec 15 13:32:20 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 30
Dec 15 13:32:21 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 40
Dec 15 13:32:22 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 50
Dec 15 13:32:23 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 60
Dec 15 13:32:24 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 70
Dec 15 13:32:25 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 80
Dec 15 13:32:26 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 90
Dec 15 13:32:26 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retried 91 times
Dec 15 13:33:08 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 10
Dec 15 13:33:09 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 20
Dec 15 13:33:10 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 30
Dec 15 13:33:10 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retried 33 times
Dec 15 13:35:38 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 10
Dec 15 13:35:39 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 20
Dec 15 13:35:40 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 30
Dec 15 13:35:41 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 40
Dec 15 13:35:42 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 50
Dec 15 13:35:43 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 60
Dec 15 13:35:44 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 70
Dec 15 13:35:45 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 80
Dec 15 13:35:46 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 90
Dec 15 13:35:47 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retry 100
Dec 15 13:35:47 pm-249 pmxcfs[1631]: [status] notice: cpg_send_message retried 100 times
Dec 15 13:35:47 pm-249 pmxcfs[1631]: [status] crit: cpg_send_message failed: 6

Yes, the ls-l shown is really the complete list. Indeed some other files are missing, too.

And here the output of pvecm status on the cluster:

Code:

Quorum information
------------------
Date:             Fri Dec 16 09:19:47 2022
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000001
Ring ID:          1/7464
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      4
Quorum:           3 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.254.16 (local)
0x00000002          1 192.168.254.252
0x00000003          1 192.168.254.251
0x00000004          1 192.168.254.17

In fact the cluster nodes configs look ok. The 5th node is in corosync.conf.
I really don't know how this can happen, besides not compatible software versions. Or how to rescue this.
Any hints welcome!
--
Regards

Lukas Wagner · Dec 16, 2022

skraw said:
The thing is we would like to consolidate the cluster with new nodes and killing old ones. So the idea was to put in one new node (which is large enough to take the whole cluster vhosts) with 6.4, which is needed for update to 7.X anyway. So put in node, migrate everything to it, kill all old nodes, add another new node, and then upgrade both to 7.X. That was the plan.

Just a random thought: Since you are planning to kill all the old nodes anyway, have you thought about building an entirely new cluster on 7.3 and then using backup/restore to bring over all guests? Or would this be too much downtime?

skraw · Dec 16, 2022

You're right. The major reason against this idea is the downtime. But there are others as well. We need to use the old servers for some time as a failover resource until all new material is delivered (which is beyond our influence).

PS: Whats the right tag to include lists like above? Obviously CODE is not... sorry for that.

skraw · Dec 16, 2022

Is there a possibility that the problem has to do with passwords? When I joined the node to the cluster this node had another password entered during the setup phase.
Is there a way to change this password for pve? Or is "passwd root" sufficient?

Lukas Wagner · Dec 16, 2022

I've discussed this with a more senior colleague who agreed with me that the backup/restore approach would be probably the best approach. That being said, if this approach does not work for you, aim to only add nodes with the same version to the cluster. Smaller version jumps should normally not be an issue (since this is a normal situation in cluster updates, where one node after the other is updated to the next minor version), however larger jumps between cluster nodes might lead to errors like the one you've experienced. For the future, make sure that you keep your cluster on up to date versions. 6.0 was released in Summer of 2019, which means that you went almost 3.5 years without any (security) updates.

To fix the current situation, I'd first remove the new node from the cluster (pvecm delnode <new_node_hostname>) and then either:

try again with 6.0 on the new node or
bring the old nodes to at least 6.4 and then add the new node with the same version (freshly installed, to be sure)

You mentioned previously that you had big troubles updating the nodes one-by-one to 6.4? What issues did you run into?

skraw said:
PS: Whats the right tag to include lists like above? Obviously CODE is not... sorry for that.

CODE tags are the correct ones, but you used ICODE (inline code)

skraw said:
Is there a possibility that the problem has to do with passwords? When I joined the node to the cluster this node had another password entered during the setup phase.

Don't think so. When joining nodes, the password is only used for SSHing into the cluster once to set up things - after that, everything is key/certificate based.

skraw · Dec 16, 2022

Thanks for your help.
I changed the ICODE, thanks for the hint.

Last question: how am I to make updates only taking small steps for old versions? Lets say I start from the 6.0. How do I update to 6.1 ? My understanding so far was that this is only possible as one step to the current last minor version of the respective major the node is on.

Lukas Wagner · Dec 16, 2022

skraw said:
Last question: how am I to make updates only taking small steps for old versions? Lets say I start from the 6.0. How do I update to 6.1 ? My understanding so far was that this is only possible as one step to the current last minor version of the respective major the node is on.

You understood that correctly, it is you can only update to the most recent minor version, since there is only one common repo for every major release. This is why it is generally recommended to regularly update, in order to avoid large version jumps. We of course try to ensure that there is no breakage between versions, but since most testing takes place for x.y -> x.y+1, it can happen there some issues for larger jumps are never discovered.

So for the future, make sure that you keep your system up-to-date to avoid situations like these. If stability is your main concern and the reason for delaying upgrades, I'd suggest to buy subscriptions for your servers, giving you access to the enterprise repository.

Hope this helps,

Search

Search

Problems joining new node with 6.4 to cluster with 6.0

skraw

Well-Known Member

Dunuin

Distinguished Member

Lukas Wagner

Proxmox Staff Member

skraw

Well-Known Member

skraw

Well-Known Member

Preconditions

Lukas Wagner

Proxmox Staff Member

skraw

Well-Known Member

Lukas Wagner

Proxmox Staff Member

skraw

Well-Known Member

skraw

Well-Known Member

Lukas Wagner

Proxmox Staff Member

skraw

Well-Known Member

Lukas Wagner

Proxmox Staff Member

We value your privacy

Problems joining new node with 6.4 to cluster with 6.0

Well-Known Member

Distinguished Member

Proxmox Staff Member

Well-Known Member

Well-Known Member

Preconditions​

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

We value your privacy

Preconditions