Cannot Add Proxmox 5.x node to Prox 4.x cluster

Mar 24, 2017
13
0
1
61
I recently built 8 additional nodes (licensed) to add to our existing (licensed) cluster. When trying to add the new nodes which were built using proxmox 5 i received the following message:


Login succeeded.
Request addition of this node
Remote side is not able to use API for Cluster join! Pass the 'use_ssh' switch or update the remote side.

I subsequently used the --use_ssh switch which added the node to the cluster but corosync does not function.

root@prox21:~# pvecm status
Cannot initialize CMAP service
root@prox21:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: failed (Result: timeout) since Tue 2018-07-24 10:23:22 PDT; 2min 40s ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 2871 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=killed, signal=TERM)
Main PID: 2871 (code=killed, signal=TERM)
CPU: 37ms

Jul 24 10:21:52 prox21 systemd[1]: Starting Corosync Cluster Engine...
Jul 24 10:21:52 prox21 corosync[2871]: [MAIN ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide
Jul 24 10:21:52 prox21 corosync[2871]: notice [MAIN ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to p
Jul 24 10:21:52 prox21 corosync[2871]: info [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog augea
Jul 24 10:21:52 prox21 corosync[2871]: [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog augeas syste
Jul 24 10:23:22 prox21 systemd[1]: corosync.service: Start operation timed out. Terminating.
Jul 24 10:23:22 prox21 systemd[1]: Failed to start Corosync Cluster Engine.
Jul 24 10:23:22 prox21 systemd[1]: corosync.service: Unit entered failed state.
Jul 24 10:23:22 prox21 systemd[1]: corosync.service: Failed with result 'timeout'.
root@prox21:~# journalctl -xe
Jul 24 10:25:40 prox21 pmxcfs[2889]: [quorum] crit: quorum_initialize failed: 2
Jul 24 10:25:40 prox21 pmxcfs[2889]: [confdb] crit: cmap_initialize failed: 2
Jul 24 10:25:40 prox21 pmxcfs[2889]: [dcdb] crit: cpg_initialize failed: 2
Jul 24 10:25:40 prox21 pmxcfs[2889]: [status] crit: cpg_initialize failed: 2
Jul 24 10:25:46 prox21 pmxcfs[2889]: [quorum] crit: quorum_initialize failed: 2
Jul 24 10:25:46 prox21 pmxcfs[2889]: [confdb] crit: cmap_initialize failed: 2
Jul 24 10:25:46 prox21 pmxcfs[2889]: [dcdb] crit: cpg_initialize failed: 2
Jul 24 10:25:46 prox21 pmxcfs[2889]: [status] crit: cpg_initialize failed: 2
Jul 24 10:25:52 prox21 pmxcfs[2889]: [quorum] crit: quorum_initialize failed: 2
Jul 24 10:25:52 prox21 pmxcfs[2889]: [confdb] crit: cmap_initialize failed: 2
Jul 24 10:25:52 prox21 pmxcfs[2889]: [dcdb] crit: cpg_initialize failed: 2
Jul 24 10:25:52 prox21 pmxcfs[2889]: [status] crit: cpg_initialize failed: 2
Jul 24 10:25:58 prox21 pmxcfs[2889]: [quorum] crit: quorum_initialize failed: 2
Jul 24 10:25:58 prox21 pmxcfs[2889]: [confdb] crit: cmap_initialize failed: 2
Jul 24 10:25:58 prox21 pmxcfs[2889]: [dcdb] crit: cpg_initialize failed: 2
Jul 24 10:25:58 prox21 pmxcfs[2889]: [status] crit: cpg_initialize failed: 2
Jul 24 10:26:00 prox21 systemd[1]: Starting Proxmox VE replication runner...
-- Subject: Unit pvesr.service has begun start-up
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- Unit pvesr.service has begun starting up.
Jul 24 10:26:01 prox21 pvesr[3167]: error with cfs lock 'file-replication_cfg': no quorum!
Jul 24 10:26:01 prox21 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 24 10:26:01 prox21 systemd[1]: Failed to start Proxmox VE replication runner.
-- Subject: Unit pvesr.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- Unit pvesr.service has failed.
--
-- The result is failed.
Jul 24 10:26:01 prox21 systemd[1]: pvesr.service: Unit entered failed state.
Jul 24 10:26:01 prox21 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jul 24 10:26:01 prox21 cron[2492]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Jul 24 10:26:04 prox21 pmxcfs[2889]: [quorum] crit: quorum_initialize failed: 2
Jul 24 10:26:04 prox21 pmxcfs[2889]: [confdb] crit: cmap_initialize failed: 2
Jul 24 10:26:04 prox21 pmxcfs[2889]: [dcdb] crit: cpg_initialize failed: 2
Jul 24 10:26:04 prox21 pmxcfs[2889]: [status] crit: cpg_initialize failed: 2
Jul 24 10:26:10 prox21 pmxcfs[2889]: [quorum] crit: quorum_initialize failed: 2
Jul 24 10:26:10 prox21 pmxcfs[2889]: [confdb] crit: cmap_initialize failed: 2
Jul 24 10:26:10 prox21 pmxcfs[2889]: [dcdb] crit: cpg_initialize failed: 2
Jul 24 10:26:10 prox21 pmxcfs[2889]: [status] crit: cpg_initialize failed: 2

PVE Version of existing cluster:
pve-manager/4.4-13/7ea56165 (running kernel: 4.4.44-1-pve)

PVE Version of Node to add:
pve-manager/5.2-5/eb24855a (running kernel: 4.15.17-1-pve)

I have tried restarting corosync, pve-cluster. The two servers are on the same network. The new node actually appears on the cluster as ssh will login correctly. It will simply not sync and communicate further.
 
As far as I know, only nodes of the same version (major.minor) can operate the right way in a cluster.
 
I've clustered different versions with both versions 2 to 3 and 3 to 4. The only issues encountered revolved around doing live migrations between servers of different versions. I've never encountered corosync issues like this before.

This is a major issue as it involves that basic question of "how do you update a large cluster?". Do you simply add nodes and build a new one with manual archive transfers? Do you shut your entire operation down to update all the cluster nodes at one time?

At this point, I am concerned with performing any type of upgrade within the cluster as I do not know if it will function at all afterwards which of course is a very serious issue.
 
Hi,
Corosync should be compatible as log the porotocoll version is the same.
PVE 4 and PVE5 use both Version 2.
For upgrade this is ok but we never test joining new nodes to an old cluster.

Please send the corosync.conf form /etc/pve/corosync.conf
 
The two files are identical in size except for order on the systems. I've attached both the new node prox21 and the one that was used for the cluster add.
 

Attachments

  • new_node_prox21.txt
    1.5 KB · Views: 3
  • old_node_prox1.txt
    1.5 KB · Views: 3
Also the syslog in the master node shows this:

Code:
Jul 25 10:43:43 prox1 pveproxy[183638]: Could not verify remote node certificate 'DE:A5:17:A9:69:E6:CA:CD:39:0D:27:1A:EB:60:64:23:AA:0D:31:D6:17:24:A1:FA:4F:E9:34:3C:B2:01:BC:1D' with list of pinned certificates, refreshing cache
Jul 25 10:43:58 prox1 pveproxy[183637]: Could not verify remote node certificate 'DE:A5:17:A9:69:E6:CA:CD:39:0D:27:1A:EB:60:64:23:AA:0D:31:D6:17:24:A1:FA:4F:E9:34:3C:B2:01:BC:1D' with list of pinned certificates, refreshing cache
Jul 25 10:44:41 prox1 pveproxy[183639]: Could not verify remote node certificate 'DE:A5:17:A9:69:E6:CA:CD:39:0D:27:1A:EB:60:64:23:AA:0D:31:D6:17:24:A1:FA:4F:E9:34:3C:B2:01:BC:1D' with list of pinned certificates, refreshing cache
Jul 25 10:44:43 prox1 pveproxy[183638]: Could not verify remote node certificate 'DE:A5:17:A9:69:E6:CA:CD:39:0D:27:1A:EB:60:64:23:AA:0D:31:D6:17:24:A1:FA:4F:E9:34:3C:B2:01:BC:1D' with list of pinned certificates, refreshing cache
Jul 25 10:44:58 prox1 pveproxy[183637]: Could not verify remote node certificate 'DE:A5:17:A9:69:E6:CA:CD:39:0D:27:1A:EB:60:64:23:AA:0D:31:D6:17:24:A1:FA:4F:E9:34:3C:B2:01:BC:1D' with list of pinned certificates, refreshing cache
 
You mixed hostnames and IP address in the config at ring0_address filed.
This does not work.
I would recommend you set them all to IP addresses.
Also the syslog in the master node shows this:
Proxmox has no master node and this error is not related to corosync and the cluster engine.
This error is a certificate error from the web server.
 
HI
I realize that there is no "master". I simply used that term to reference the node that we always start first and always use to add other nodes too.

The only difference this time is that the normal pvecm add command did not work. Instead, I had to use the --use_ssh flag which has never been the case before. For all past additions, I also only used an ip address as part of this process never a hostname. The new node is listed in our internal DNS as are all of the other servers. So other than the --use_flag and the new proxmox version, I know of no other reason why it would be listed in corosync.conf as a host instead of as an IP. The cluster has been in operation this way for nearly 2 years without prior issues.

Can I edit the corosync.conf to correct this for the new node?
 
Last edited:
Hi
Looking again for advice on how to address this with the corosync comments listed above regarding using IP addresses vs hostnames.

When we add a node we do not specify a hostname. We simply use the pvecm command to add the node as liste. If it is only adding hostnames in corosync.conf it is doing so by default. Why would a default be set to a bad behavior or recommendation from the documentation.

Is there a way for it to only use IP's in the corosync.conf when adding a node ?

Do I need to destroy my cluster and rebuild it to correct this or is there a method to edit the corosync to correct it ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!