Adding new node breaks cluster

Ashley

Member
Jun 28, 2016
267
15
18
34
Hello,

Currently have a 4 node cluster trying to add a 5th node, the add command goes through fine and connects to the cluster, however then sits at waiting for quorm.

Checking the service pve-cluster status on another node is full of:

Oct 31 17:04:14 prox pmxcfs[3807]: [dcdb] crit: cpg_send_message failed: 6
Oct 31 17:04:14 prox pmxcfs[3807]: [status] notice: cpg_send_message retry 10
Oct 31 17:04:15 prox pmxcfs[3807]: [dcdb] notice: cpg_send_message retry 10
Oct 31 17:04:15 prox pmxcfs[3807]: [status] notice: cpg_send_message retry 20
Oct 31 17:04:16 prox pmxcfs[3807]: [dcdb] notice: cpg_send_message retry 20
Oct 31 17:04:16 prox pmxcfs[3807]: [status] notice: cpg_send_message retry 30
Oct 31 17:04:17 prox pmxcfs[3807]: [dcdb] notice: cpg_send_message retry 30
Oct 31 17:04:17 prox pmxcfs[3807]: [status] notice: cpg_send_message retry 40
Oct 31 17:04:18 prox pmxcfs[3807]: [dcdb] notice: cpg_send_message retry 40

The whole cluster & GUI loses connectivity, the only way I can fix this is by service corosync stop on the new node.

Attached is corosync.conf file
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: n1
nodeid: 1
quorum_votes: 1
ring0_addr: n1-corosync
}

node {
name: prox
nodeid: 3
quorum_votes: 1
ring0_addr: prox-corosync
}

node {
name: n2
nodeid: 2
quorum_votes: 1
ring0_addr: n2-corosync
}

node {
name: sn1
nodeid: 4
quorum_votes: 1
ring0_addr: sn1-corosync
}

node {
name: sn2
nodeid: 5
quorum_votes: 1
ring0_addr: sn2-corosync
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: clustername
config_version: 7
ip_version: ipv4
secauth: on
transport: udpu
version: 2
interface {
bindnetaddr: 172.16.1.1
ringnumber: 0
}

}


New node and all other nodes have updated host file, and all can ping no issue, currently running on udpu so I know is not a multicast issue, seems when the new node is added into the cluster goes into a dead lock until corosync is shutdown on new node.

Log from new node also shows :
Oct 31 17:04:19 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 620
Oct 31 17:04:20 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 630
Oct 31 17:04:21 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 640
Oct 31 17:04:22 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 650
Oct 31 17:04:23 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 660
Oct 31 17:04:24 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 670
Oct 31 17:04:25 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 680
Oct 31 17:04:26 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 690
Oct 31 17:04:27 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 700
Oct 31 17:04:28 sn2 pmxcfs[2933]: [dcdb] notice: cpg_join retry 710

Thanks
,Ashley
 
Not sure how it fixed it but reinstalled with a different name and updated all the host files to this new name and added fine.
 
Maybe it is, I have resolved the issue with needing udpu, do you know if it is straight forward as removing the line within the config file, letting the file sync to all nodes and then restart corosync on each node.

Or am I asking for trouble?
 
Maybe it is, I have resolved the issue with needing udpu, do you know if it is straight forward as removing the line within the config file, letting the file sync to all nodes and then restart corosync on each node.

Or am I asking for trouble?

>>transport: udpu

Note that unicast only works with a small number of nodes. So maybe 5th is near the limit.
PS: The "official" corosync example uses 16 nodes with UDPU, see https://github.com/fghaas/corosync/blob/master/conf/corosync.conf.example.udpu - do you think they're all wrong there? :)
 
  • Like
Reactions: EuroDomenii
PS: The "official" corosync example uses 16 nodes with UDPU, see https://github.com/fghaas/corosync/blob/master/conf/corosync.conf.example.udpu - do you think they're all wrong there? :)

Well as I said reinstalling the node with a new name as resolved the issue (already tried a reinstall with the same name), so something either happened during the original node add and a removal of the node has still left some information somewhere that is causing an issue with adding a node with the same name now.

Since adding had no issues running at the current level of 5, however maybe its just noted from Proxmox dev experience or to be on the safe side of the limit of "4"
 
Since adding had no issues running at the current level of 5, however maybe its just noted from Proxmox dev experience or to be on the safe side of the limit of "4"
Yes, we're also running several clusters with 5 or more nodes and UDPU.
Just wondering the PVE Wiki mentions this limit explicitly with "do not use it with more that 4 cluster nodes.".
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!