Adding new node breaks cluster

Ashley · Oct 31, 2016

Hello,

Currently have a 4 node cluster trying to add a 5th node, the add command goes through fine and connects to the cluster, however then sits at waiting for quorm.

Checking the service pve-cluster status on another node is full of:

Oct 31 17:04:14 prox pmxcfs[3807]: [dcdb] crit: cpg_send_message failed: 6
Oct 31 17:04:14 prox pmxcfs[3807]: [status] notice: cpg_send_message retry 10
Oct 31 17:04:15 prox pmxcfs[3807]: [dcdb] notice: cpg_send_message retry 10
Oct 31 17:04:15 prox pmxcfs[3807]: [status] notice: cpg_send_message retry 20
Oct 31 17:04:16 prox pmxcfs[3807]: [dcdb] notice: cpg_send_message retry 20
Oct 31 17:04:16 prox pmxcfs[3807]: [status] notice: cpg_send_message retry 30
Oct 31 17:04:17 prox pmxcfs[3807]: [dcdb] notice: cpg_send_message retry 30
Oct 31 17:04:17 prox pmxcfs[3807]: [status] notice: cpg_send_message retry 40
Oct 31 17:04:18 prox pmxcfs[3807]: [dcdb] notice: cpg_send_message retry 40

The whole cluster & GUI loses connectivity, the only way I can fix this is by service corosync stop on the new node.

Attached is corosync.conf file
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: n1
nodeid: 1
quorum_votes: 1
ring0_addr: n1-corosync
}

node {
name: prox
nodeid: 3
quorum_votes: 1
ring0_addr: prox-corosync
}

node {
name: n2
nodeid: 2
quorum_votes: 1
ring0_addr: n2-corosync
}

node {
name: sn1
nodeid: 4
quorum_votes: 1
ring0_addr: sn1-corosync
}

node {
name: sn2
nodeid: 5
quorum_votes: 1
ring0_addr: sn2-corosync
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: clustername
config_version: 7
ip_version: ipv4
secauth: on
transport: udpu
version: 2
interface {
bindnetaddr: 172.16.1.1
ringnumber: 0
}

}

New node and all other nodes have updated host file, and all can ping no issue, currently running on udpu so I know is not a multicast issue, seems when the new node is added into the cluster goes into a dead lock until corosync is shutdown on new node.

Log from new node also shows :
Oct 31 17:04:19 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 620
Oct 31 17:04:20 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 630
Oct 31 17:04:21 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 640
Oct 31 17:04:22 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 650
Oct 31 17:04:23 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 660
Oct 31 17:04:24 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 670
Oct 31 17:04:25 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 680
Oct 31 17:04:26 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 690
Oct 31 17:04:27 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 700
Oct 31 17:04:28 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 710

Thanks
,Ashley

Ashley · Oct 31, 2016

Not sure how it fixed it but reinstalled with a different name and updated all the host files to this new name and added fine.

spirit · Nov 1, 2016

>>transport: udpu

Note that unicast only works with a small number of nodes. So maybe 5th is near the limit.

Ashley · Nov 1, 2016

Maybe it is, I have resolved the issue with needing udpu, do you know if it is straight forward as removing the line within the config file, letting the file sync to all nodes and then restart corosync on each node.

Or am I asking for trouble?

robhost · Nov 1, 2016

spirit said:
>>transport: udpu

Note that unicast only works with a small number of nodes. So maybe 5th is near the limit.

Can you explain why exactly this is a problem? I could not find anything about these limit in the corosync docs. Thanks!

robhost · Nov 1, 2016

Ashley said:
Maybe it is, I have resolved the issue with needing udpu, do you know if it is straight forward as removing the line within the config file, letting the file sync to all nodes and then restart corosync on each node.

Or am I asking for trouble?

spirit said:
>>transport: udpu

Note that unicast only works with a small number of nodes. So maybe 5th is near the limit.

PS: The "official" corosync example uses 16 nodes with UDPU, see https://github.com/fghaas/corosync/blob/master/conf/corosync.conf.example.udpu - do you think they're all wrong there?

Ashley · Nov 2, 2016

robhost said:
PS: The "official" corosync example uses 16 nodes with UDPU, see https://github.com/fghaas/corosync/blob/master/conf/corosync.conf.example.udpu - do you think they're all wrong there?

Well as I said reinstalling the node with a new name as resolved the issue (already tried a reinstall with the same name), so something either happened during the original node add and a removal of the node has still left some information somewhere that is causing an issue with adding a node with the same name now.

Since adding had no issues running at the current level of 5, however maybe its just noted from Proxmox dev experience or to be on the safe side of the limit of "4"

robhost · Nov 2, 2016

Ashley said:
Since adding had no issues running at the current level of 5, however maybe its just noted from Proxmox dev experience or to be on the safe side of the limit of "4"

Yes, we're also running several clusters with 5 or more nodes and UDPU.
Just wondering the PVE Wiki mentions this limit explicitly with "do not use it with more that 4 cluster nodes.".

Search

Search

Adding new node breaks cluster

Ashley

Member

Ashley

Member

spirit

Distinguished Member

Ashley

Member

robhost

Active Member

robhost

Active Member

Ashley

Member

robhost

Active Member