Cluster unable to add new node from Canada

kenneth_vkd · Aug 28, 2020

Hi
Today we provisioned a OVH dedicated server as a new node for our PVE cluster.
Upon adding the node from the UI and setting the ring0 address/link0 address, we could not get the node to join.
We tried to figure out why and finally got it to join.
However, there is a strange difference between /etc/corosync/corosync.conf and /etc/pve/corosync.conf

/etc/pve/corosync.conf is the same on all nodes, but /etc/corosync/corosync.conf is different. The newly added node has the same contents in both files
Here is diff from one of the existing nodes (diff /etc/pve/corosync.conf /etc/corosync/corosync.conf):

Diff:

20,25d19

<     name: ns570850

<     nodeid: 2

<     quorum_votes: 1

<     ring0_addr: 172.16.0.7

<   }

<   node {

51c45

<   config_version: 18

---

>   config_version: 20

53c47

<     bindnetaddr: 172.16.0.2

---

>     bindnetaddr: 172.16.0.5

58c52

<   transport: udpu

---

>   transport: knet

pvecm status shows config version 18 and transport udpu on all nodes

All previous nodes were started on PVE 5 (using the OVH proxmox5-zfs template) and then upgraded to PVE 6 by following the upgrade guide and upgrading corosync 2.x to corosync 3.x before the upgrade to pve 6
The new node was deployed also on PVE 5 (using the OVH proxmox5-zfs template) and then upgraded to PVE 6 before joining it to the cluster

Output of systemctl status pve-cluster.service on the new node:

Code:

● pve-cluster.service - The Proxmox VE cluster filesystem

   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)

   Active: active (running) since Fri 2020-08-28 10:43:17 CEST; 6min ago

  Process: 1507 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)

Main PID: 1531 (pmxcfs)

    Tasks: 5 (limit: 4915)

   Memory: 37.1M

   CGroup: /system.slice/pve-cluster.service

           └─1531 /usr/bin/pmxcfs



Aug 28 10:49:10 ns570850 pmxcfs[1531]: [dcdb] crit: cpg_initialize failed: 2

Aug 28 10:49:10 ns570850 pmxcfs[1531]: [status] crit: cpg_initialize failed: 2

Aug 28 10:49:16 ns570850 pmxcfs[1531]: [quorum] crit: quorum_initialize failed: 2

Aug 28 10:49:16 ns570850 pmxcfs[1531]: [confdb] crit: cmap_initialize failed: 2

Aug 28 10:49:16 ns570850 pmxcfs[1531]: [dcdb] crit: cpg_initialize failed: 2

Aug 28 10:49:16 ns570850 pmxcfs[1531]: [status] crit: cpg_initialize failed: 2

Aug 28 10:49:22 ns570850 pmxcfs[1531]: [quorum] crit: quorum_initialize failed: 2

Aug 28 10:49:22 ns570850 pmxcfs[1531]: [confdb] crit: cmap_initialize failed: 2

Aug 28 10:49:22 ns570850 pmxcfs[1531]: [dcdb] crit: cpg_initialize failed: 2

Aug 28 10:49:22 ns570850 pmxcfs[1531]: [status] crit: cpg_initialize failed: 2

Output of systemctl status corosync.service on the new node:

Code:

● corosync.service - Corosync Cluster Engine

   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)

   Active: failed (Result: exit-code) since Fri 2020-08-28 10:43:17 CEST; 6min ago

     Docs: man:corosync

           man:corosync.conf

           man:corosync_overview

  Process: 1646 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)

Main PID: 1646 (code=exited, status=8)



Aug 28 10:43:17 ns570850 systemd[1]: Starting Corosync Cluster Engine...

Aug 28 10:43:17 ns570850 corosync[1646]:   [MAIN  ] Corosync Cluster Engine 3.0.4 starting up

Aug 28 10:43:17 ns570850 corosync[1646]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow

Aug 28 10:43:17 ns570850 corosync[1646]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.

Aug 28 10:43:17 ns570850 corosync[1646]:   [MAIN  ] Please migrate config file to nodelist.

Aug 28 10:43:17 ns570850 corosync[1646]:   [MAIN  ] parse error in config: crypto_cipher & crypto_hash are only valid for the Knet transport.

Aug 28 10:43:17 ns570850 corosync[1646]:   [MAIN  ] Corosync Cluster Engine exiting with status 8 at main.c:1392.

Aug 28 10:43:17 ns570850 systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a

Aug 28 10:43:17 ns570850 systemd[1]: corosync.service: Failed with result 'exit-code'.

Aug 28 10:43:17 ns570850 systemd[1]: Failed to start Corosync Cluster Engine.

Contents of /etc/pve/corosync.conf on the new node:

Code:

logging {

  debug: off

  to_syslog: yes

}



nodelist {

  node {

    name: ns3088794

    nodeid: 5

    quorum_votes: 1

    ring0_addr: 1.0.0.6

  }

  node {

    name: ns3128036

    nodeid: 1

    quorum_votes: 1

    ring0_addr: 1.0.0.5

  }

  node {

    name: ns570850

    nodeid: 2

    quorum_votes: 1

    ring0_addr: 1.0.0.7

  }

  node {

    name: ns61100575

    nodeid: 4

    quorum_votes: 1

    ring0_addr: 1.0.0.3

  }

  node {

    name: ns6136203

    nodeid: 3

    quorum_votes: 1

    ring0_addr: 1.0.0.2

  }

  node {

    name: ns631099096

    nodeid: 6

    quorum_votes: 1

    ring0_addr: 1.0.0.4

  }

}



quorum {

  provider: corosync_votequorum

}



totem {

  cluster_name: VKD-EU

  config_version: 18

  interface {

    bindnetaddr: 1.0.0.7

    ringnumber: 0

  }

  ip_version: ipv4

  secauth: on

  transport: udpu

  version: 2

}

All IP-addresses were replaced for security reasons
If I manually edit the /etc/corosync/corosync.conf on the new node to have knet as the transport, then corosync starts.

EDIT: Just redeployed the node and this time tried to add from the commandline using pvecm add 1.0.0.2 --force -link0 1.0.0.7 and this gave the same result

So there seems to be some kind of confusion on whether the cluster should run udpu or knet

kenneth_vkd · Aug 31, 2020

Hi again
Some more, maybe relevant information.
All pre-existing nodes are located in the OVH datacenters across Europe, while the new node is located in Canada. Latency of a ping is between 150-300ms. But would this make it impossible to have it join the cluster?

wolfgang · Aug 31, 2020

Hello,

corosync is very sensitive to latency, so problems with it are displayed in the syslog.
The larger the cluster, the more likely it is that it will fail and the new node will not match the others.
You could copy the file from pve to corosync and restart the corosync.service, but I don't think this will work for long.
The OVH virtual rack is a shared medium for which no latency can be guaranteed.

kenneth_vkd · Aug 31, 2020

wolfgang said:
Hello,

corosync is very sensitive to latency, so problems with it are displayed in the syslog.
The larger the cluster, the more likely it is that it will fail and the new node will not match the others.
You could copy the file from pve to corosync and restart the corosync.service, but I don't think this will work for long.
The OVH virtual rack is a shared medium for which no latency can be guaranteed.

So the high latency between Canada and Europe could be the limiting factor?

We have tried to manually copy corosyn.conf, which got the cluster to start and show up on the other nodes, but it shows up as having no connection (the red circle with a cross over it and the grayed out node name)
If so, then it would seem as if we have to run a separate cluster for Canada, which is not connected to the Europe cluster

wolfgang · Aug 31, 2020

kenneth_vkd said:
So the high latency between Canada and Europe could be the limiting factor?

Yes, it is. I saw links with 5ms that makes problems.

Search

Search

Cluster unable to add new node from Canada

kenneth_vkd

Well-Known Member

kenneth_vkd

Well-Known Member

wolfgang

Proxmox Retired Staff

kenneth_vkd

Well-Known Member

wolfgang

Proxmox Retired Staff