Cannot Add Proxmox 5.x node to Prox 4.x cluster

isg_itadmin · Jul 24, 2018

I recently built 8 additional nodes (licensed) to add to our existing (licensed) cluster. When trying to add the new nodes which were built using proxmox 5 i received the following message:

Login succeeded.
Request addition of this node
Remote side is not able to use API for Cluster join! Pass the 'use_ssh' switch or update the remote side.

I subsequently used the --use_ssh switch which added the node to the cluster but corosync does not function.

root@prox21:~# pvecm status
Cannot initialize CMAP service
root@prox21:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: failed (Result: timeout) since Tue 2018-07-24 10:23:22 PDT; 2min 40s ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 2871 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=killed, signal=TERM)
Main PID: 2871 (code=killed, signal=TERM)
CPU: 37ms

Jul 24 10:21:52 prox21 systemd[1]: Starting Corosync Cluster Engine...
Jul 24 10:21:52 prox21 corosync[2871]: [MAIN ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide
Jul 24 10:21:52 prox21 corosync[2871]: notice [MAIN ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to p
Jul 24 10:21:52 prox21 corosync[2871]: info [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog augea
Jul 24 10:21:52 prox21 corosync[2871]: [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog augeas syste
Jul 24 10:23:22 prox21 systemd[1]: corosync.service: Start operation timed out. Terminating.
Jul 24 10:23:22 prox21 systemd[1]: Failed to start Corosync Cluster Engine.
Jul 24 10:23:22 prox21 systemd[1]: corosync.service: Unit entered failed state.
Jul 24 10:23:22 prox21 systemd[1]: corosync.service: Failed with result 'timeout'.
root@prox21:~# journalctl -xe
Jul 24 10:25:40 prox21 pmxcfs[2889]: [quorum] crit: quorum_initialize failed: 2
Jul 24 10:25:40 prox21 pmxcfs[2889]: [confdb] crit: cmap_initialize failed: 2
Jul 24 10:25:40 prox21 pmxcfs[2889]: [dcdb] crit: cpg_initialize failed: 2
Jul 24 10:25:40 prox21 pmxcfs[2889]: [status] crit: cpg_initialize failed: 2
Jul 24 10:25:46 prox21 pmxcfs[2889]: [quorum] crit: quorum_initialize failed: 2
Jul 24 10:25:46 prox21 pmxcfs[2889]: [confdb] crit: cmap_initialize failed: 2
Jul 24 10:25:46 prox21 pmxcfs[2889]: [dcdb] crit: cpg_initialize failed: 2
Jul 24 10:25:46 prox21 pmxcfs[2889]: [status] crit: cpg_initialize failed: 2
Jul 24 10:25:52 prox21 pmxcfs[2889]: [quorum] crit: quorum_initialize failed: 2
Jul 24 10:25:52 prox21 pmxcfs[2889]: [confdb] crit: cmap_initialize failed: 2
Jul 24 10:25:52 prox21 pmxcfs[2889]: [dcdb] crit: cpg_initialize failed: 2
Jul 24 10:25:52 prox21 pmxcfs[2889]: [status] crit: cpg_initialize failed: 2
Jul 24 10:25:58 prox21 pmxcfs[2889]: [quorum] crit: quorum_initialize failed: 2
Jul 24 10:25:58 prox21 pmxcfs[2889]: [confdb] crit: cmap_initialize failed: 2
Jul 24 10:25:58 prox21 pmxcfs[2889]: [dcdb] crit: cpg_initialize failed: 2
Jul 24 10:25:58 prox21 pmxcfs[2889]: [status] crit: cpg_initialize failed: 2
Jul 24 10:26:00 prox21 systemd[1]: Starting Proxmox VE replication runner...
-- Subject: Unit pvesr.service has begun start-up
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- Unit pvesr.service has begun starting up.
Jul 24 10:26:01 prox21 pvesr[3167]: error with cfs lock 'file-replication_cfg': no quorum!
Jul 24 10:26:01 prox21 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 24 10:26:01 prox21 systemd[1]: Failed to start Proxmox VE replication runner.
-- Subject: Unit pvesr.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- Unit pvesr.service has failed.
--
-- The result is failed.
Jul 24 10:26:01 prox21 systemd[1]: pvesr.service: Unit entered failed state.
Jul 24 10:26:01 prox21 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jul 24 10:26:01 prox21 cron[2492]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Jul 24 10:26:04 prox21 pmxcfs[2889]: [quorum] crit: quorum_initialize failed: 2
Jul 24 10:26:04 prox21 pmxcfs[2889]: [confdb] crit: cmap_initialize failed: 2
Jul 24 10:26:04 prox21 pmxcfs[2889]: [dcdb] crit: cpg_initialize failed: 2
Jul 24 10:26:04 prox21 pmxcfs[2889]: [status] crit: cpg_initialize failed: 2
Jul 24 10:26:10 prox21 pmxcfs[2889]: [quorum] crit: quorum_initialize failed: 2
Jul 24 10:26:10 prox21 pmxcfs[2889]: [confdb] crit: cmap_initialize failed: 2
Jul 24 10:26:10 prox21 pmxcfs[2889]: [dcdb] crit: cpg_initialize failed: 2
Jul 24 10:26:10 prox21 pmxcfs[2889]: [status] crit: cpg_initialize failed: 2

PVE Version of existing cluster:
pve-manager/4.4-13/7ea56165 (running kernel: 4.4.44-1-pve)

PVE Version of Node to add:
pve-manager/5.2-5/eb24855a (running kernel: 4.15.17-1-pve)

I have tried restarting corosync, pve-cluster. The two servers are on the same network. The new node actually appears on the cluster as ssh will login correctly. It will simply not sync and communicate further.

yarick123 · Jul 24, 2018

As far as I know, only nodes of the same version (major.minor) can operate the right way in a cluster.

isg_itadmin · Jul 25, 2018

I've clustered different versions with both versions 2 to 3 and 3 to 4. The only issues encountered revolved around doing live migrations between servers of different versions. I've never encountered corosync issues like this before.

This is a major issue as it involves that basic question of "how do you update a large cluster?". Do you simply add nodes and build a new one with manual archive transfers? Do you shut your entire operation down to update all the cluster nodes at one time?

At this point, I am concerned with performing any type of upgrade within the cluster as I do not know if it will function at all afterwards which of course is a very serious issue.

wolfgang · Jul 25, 2018

Hi,
Corosync should be compatible as log the porotocoll version is the same.
PVE 4 and PVE5 use both Version 2.
For upgrade this is ok but we never test joining new nodes to an old cluster.

Please send the corosync.conf form /etc/pve/corosync.conf

isg_itadmin · Jul 25, 2018

The two files are identical in size except for order on the systems. I've attached both the new node prox21 and the one that was used for the cluster add.

isg_itadmin · Jul 25, 2018

Also the syslog in the master node shows this:

Code:

Jul 25 10:43:43 prox1 pveproxy[183638]: Could not verify remote node certificate 'DE:A5:17:A9:69:E6:CA:CD:39:0D:27:1A:EB:60:64:23:AA:0D:31:D6:17:24:A1:FA:4F:E9:34:3C:B2:01:BC:1D' with list of pinned certificates, refreshing cache
Jul 25 10:43:58 prox1 pveproxy[183637]: Could not verify remote node certificate 'DE:A5:17:A9:69:E6:CA:CD:39:0D:27:1A:EB:60:64:23:AA:0D:31:D6:17:24:A1:FA:4F:E9:34:3C:B2:01:BC:1D' with list of pinned certificates, refreshing cache
Jul 25 10:44:41 prox1 pveproxy[183639]: Could not verify remote node certificate 'DE:A5:17:A9:69:E6:CA:CD:39:0D:27:1A:EB:60:64:23:AA:0D:31:D6:17:24:A1:FA:4F:E9:34:3C:B2:01:BC:1D' with list of pinned certificates, refreshing cache
Jul 25 10:44:43 prox1 pveproxy[183638]: Could not verify remote node certificate 'DE:A5:17:A9:69:E6:CA:CD:39:0D:27:1A:EB:60:64:23:AA:0D:31:D6:17:24:A1:FA:4F:E9:34:3C:B2:01:BC:1D' with list of pinned certificates, refreshing cache
Jul 25 10:44:58 prox1 pveproxy[183637]: Could not verify remote node certificate 'DE:A5:17:A9:69:E6:CA:CD:39:0D:27:1A:EB:60:64:23:AA:0D:31:D6:17:24:A1:FA:4F:E9:34:3C:B2:01:BC:1D' with list of pinned certificates, refreshing cache

wolfgang · Jul 26, 2018

You mixed hostnames and IP address in the config at ring0_address filed.
This does not work.
I would recommend you set them all to IP addresses.

isg_itadmin said:
Also the syslog in the master node shows this:

Proxmox has no master node and this error is not related to corosync and the cluster engine.
This error is a certificate error from the web server.

isg_itadmin · Jul 26, 2018

HI
I realize that there is no "master". I simply used that term to reference the node that we always start first and always use to add other nodes too.

The only difference this time is that the normal pvecm add command did not work. Instead, I had to use the --use_ssh flag which has never been the case before. For all past additions, I also only used an ip address as part of this process never a hostname. The new node is listed in our internal DNS as are all of the other servers. So other than the --use_flag and the new proxmox version, I know of no other reason why it would be listed in corosync.conf as a host instead of as an IP. The cluster has been in operation this way for nearly 2 years without prior issues.

Can I edit the corosync.conf to correct this for the new node?

isg_itadmin · Jul 31, 2018

Hi
Looking again for advice on how to address this with the corosync comments listed above regarding using IP addresses vs hostnames.

When we add a node we do not specify a hostname. We simply use the pvecm command to add the node as liste. If it is only adding hostnames in corosync.conf it is doing so by default. Why would a default be set to a bad behavior or recommendation from the documentation.

Is there a way for it to only use IP's in the corosync.conf when adding a node ?

Do I need to destroy my cluster and rebuild it to correct this or is there a method to edit the corosync to correct it ?

Search

Search

Cannot Add Proxmox 5.x node to Prox 4.x cluster

isg_itadmin

New Member

yarick123

Member

isg_itadmin

New Member

wolfgang

Proxmox Retired Staff

isg_itadmin

New Member

Attachments

isg_itadmin

New Member

wolfgang

Proxmox Retired Staff

isg_itadmin

New Member

isg_itadmin

New Member