Add node to cluster fail

Leonardo Ramirez

Active Member
Jun 11, 2018
39
2
28
45
Hello, I have been trying to add a new node to my cluster, but when I added, this node don't work via WEB, only I can accesos using SSH.

1589577324020.png


So, I reinstalled proxmox, and do it again, but it's the same problem. Of course I deleted node in my cluster prevously. "pvecm delnode pve7".

When the new installation finished, I connect to my new node pve7 using SSH, and use now command line:

pve add 172.16.100.100

And I have the same result.

1589577511203.png


In my cluster I can see the node, but like offline.

1589577568364.png
 
Attache de log error in my cluster

root@pve:~# journalctl -b -u corosync -u pve-cluster
-- Logs begin at Thu 2020-05-14 13:14:10 CDT, end at Fri 2020-05-15 16:20:18 CDT. --
May 14 13:14:19 pve systemd[1]: Starting The Proxmox VE cluster filesystem...
May 14 13:14:19 pve pmxcfs[1703]: [quorum] crit: quorum_initialize failed: 2
May 14 13:14:19 pve pmxcfs[1703]: [quorum] crit: can't initialize service
May 14 13:14:19 pve pmxcfs[1703]: [confdb] crit: cmap_initialize failed: 2
May 14 13:14:19 pve pmxcfs[1703]: [confdb] crit: can't initialize service
May 14 13:14:19 pve pmxcfs[1703]: [dcdb] crit: cpg_initialize failed: 2
May 14 13:14:19 pve pmxcfs[1703]: [dcdb] crit: can't initialize service
May 14 13:14:19 pve pmxcfs[1703]: [status] crit: cpg_initialize failed: 2
May 14 13:14:19 pve pmxcfs[1703]: [status] crit: can't initialize service
May 14 13:14:20 pve systemd[1]: Started The Proxmox VE cluster filesystem.
May 14 13:14:20 pve systemd[1]: Starting Corosync Cluster Engine...
May 14 13:14:20 pve corosync[1780]: [MAIN ] Corosync Cluster Engine 3.0.3 starting up
May 14 13:14:20 pve corosync[1780]: [MAIN ] Corosync built-in features: dbus monitoring wa
May 14 13:14:21 pve corosync[1780]: [TOTEM ] Initializing transport (Kronosnet).
May 14 13:14:21 pve corosync[1780]: [TOTEM ] kronosnet crypto initialized: aes256/sha256
May 14 13:14:21 pve corosync[1780]: [TOTEM ] totemknet initialized
May 14 13:14:21 pve corosync[1780]: [KNET ] common: crypto_nss.so has been loaded from /us
May 14 13:14:21 pve corosync[1780]: [SERV ] Service engine loaded: corosync configuration
May 14 13:14:21 pve corosync[1780]: [QB ] server name: cmap
May 14 13:14:21 pve corosync[1780]: [SERV ] Service engine loaded: corosync configuration
May 14 13:14:21 pve corosync[1780]: [QB ] server name: cfg
May 14 13:14:21 pve corosync[1780]: [SERV ] Service engine loaded: corosync cluster closed
May 14 13:14:21 pve corosync[1780]: [QB ] server name: cpg
May 14 13:14:21 pve corosync[1780]: [SERV ] Service engine loaded: corosync profile loadin
May 14 13:14:21 pve corosync[1780]: [SERV ] Service engine loaded: corosync resource monit
May 14 13:14:21 pve corosync[1780]: [WD ] Watchdog not enabled by configuration
May 14 13:14:21 pve corosync[1780]: [WD ] resource load_15min missing a recovery key.
May 14 13:14:21 pve corosync[1780]: [WD ] resource memory_used missing a recovery key.
May 14 13:14:21 pve corosync[1780]: [WD ] no resources configured.
May 14 13:14:21 pve corosync[1780]: [SERV ] Service engine loaded: corosync watchdog servi
May 14 13:14:21 pve corosync[1780]: [QUORUM] Using quorum provider corosync_votequorum
May 14 13:14:21 pve corosync[1780]: [SERV ] Service engine loaded: corosync vote quorum se
May 14 13:14:21 pve corosync[1780]: [QB ] server name: votequorum
May 14 13:14:21 pve corosync[1780]: [SERV ] Service engine loaded: corosync cluster quorum
May 14 13:14:21 pve corosync[1780]: [QB ] server name: quorum
May 14 13:14:21 pve corosync[1780]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
May 14 13:14:21 pve corosync[1780]: [KNET ] host: host: 4 has no active links
May 14 13:14:21 pve corosync[1780]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
May 14 13:14:21 pve corosync[1780]: [KNET ] host: host: 4 has no active links
May 14 13:14:21 pve corosync[1780]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
May 14 13:14:21 pve corosync[1780]: [KNET ] host: host: 4 has no active links
May 14 13:14:21 pve corosync[1780]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
May 14 13:14:21 pve corosync[1780]: [KNET ] host: host: 1 has no active links
May 14 13:14:21 pve corosync[1780]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)


And this is the log from my new node pve7.

root@pve7:~# journalctl -b -u corosync -u pve-cluster
-- Logs begin at Fri 2020-05-15 15:59:24 CDT, end at Fri 2020-05-15 16:21:42 CDT. --
May 15 15:59:29 pve7 systemd[1]: Starting The Proxmox VE cluster filesystem...
May 15 15:59:30 pve7 systemd[1]: Started The Proxmox VE cluster filesystem.
May 15 15:59:30 pve7 systemd[1]: Condition check resulted in Corosync Cluster Engine being skipped.
May 15 16:11:03 pve7 systemd[1]: Stopping The Proxmox VE cluster filesystem...
May 15 16:11:03 pve7 pmxcfs[1176]: [main] notice: teardown filesystem
May 15 16:11:04 pve7 pmxcfs[1176]: [main] notice: exit proxmox configuration filesystem (0)
May 15 16:11:04 pve7 systemd[1]: pve-cluster.service: Succeeded.
May 15 16:11:04 pve7 systemd[1]: Stopped The Proxmox VE cluster filesystem.
May 15 16:11:04 pve7 systemd[1]: Starting Corosync Cluster Engine...
May 15 16:11:04 pve7 systemd[1]: Starting The Proxmox VE cluster filesystem...
May 15 16:11:04 pve7 pmxcfs[8053]: [quorum] crit: quorum_initialize failed: 2
May 15 16:11:04 pve7 pmxcfs[8053]: [quorum] crit: can't initialize service
May 15 16:11:04 pve7 pmxcfs[8053]: [confdb] crit: cmap_initialize failed: 2
May 15 16:11:04 pve7 pmxcfs[8053]: [confdb] crit: can't initialize service
May 15 16:11:04 pve7 pmxcfs[8053]: [dcdb] crit: cpg_initialize failed: 2
May 15 16:11:04 pve7 pmxcfs[8053]: [dcdb] crit: can't initialize service
May 15 16:11:04 pve7 pmxcfs[8053]: [status] crit: cpg_initialize failed: 2
May 15 16:11:04 pve7 pmxcfs[8053]: [status] crit: can't initialize service
May 15 16:11:04 pve7 corosync[8050]: [MAIN ] Corosync Cluster Engine 3.0.3 starting up
May 15 16:11:04 pve7 corosync[8050]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
May 15 16:11:04 pve7 corosync[8050]: [TOTEM ] Initializing transport (Kronosnet).
May 15 16:11:05 pve7 corosync[8050]: [TOTEM ] kronosnet crypto initialized: aes256/sha256
May 15 16:11:05 pve7 corosync[8050]: [TOTEM ] totemknet initialized
May 15 16:11:05 pve7 corosync[8050]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
May 15 16:11:05 pve7 corosync[8050]: [SERV ] Service engine loaded: corosync configuration map access [0]
May 15 16:11:05 pve7 corosync[8050]: [QB ] server name: cmap
May 15 16:11:05 pve7 corosync[8050]: [SERV ] Service engine loaded: corosync configuration service [1]
May 15 16:11:05 pve7 corosync[8050]: [QB ] server name: cfg
May 15 16:11:05 pve7 corosync[8050]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
May 15 16:11:05 pve7 corosync[8050]: [QB ] server name: cpg
May 15 16:11:05 pve7 corosync[8050]: [SERV ] Service engine loaded: corosync profile loading service [4]
May 15 16:11:05 pve7 corosync[8050]: [SERV ] Service engine loaded: corosync resource monitoring service [6]
May 15 16:11:05 pve7 corosync[8050]: [WD ] Watchdog not enabled by configuration
May 15 16:11:05 pve7 corosync[8050]: [WD ] resource load_15min missing a recovery key.
May 15 16:11:05 pve7 corosync[8050]: [WD ] resource memory_used missing a recovery key.
 

Attachments

Hi,

After an hour, I see my new pve7 online, but something is working wrong, because now I see offline node pve3. And before I didnt have problema with the nodes. And When I try to do something in WEB interface it's very slow, and constantly show me an error like:

1589580211547.png
For a few minutes I see my cluster with node pve3 offline.

1589580227544.png
After 10 or more minutes I saw it like this:

1589580313817.png
 
Hi, can you please post/attach /etc/corosync/corosync.conf from pve7 and one of the working nodes?

It looks like the join worked initially, as the other nodes are at least visible and thus an initial sync should have happened, but now they cannot talk with each other.

How are those nodes connected together physically? How does the network topology looks like?
Where the first 6 nodes added to the cluster in the same way as the new one?
 
Node PVE1 (It's working)

root@pve:~# cat /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: pve
nodeid: 2
quorum_votes: 1
ring0_addr: 172.16.100.100
}
node {
name: pve2
nodeid: 4
quorum_votes: 1
ring0_addr: 172.16.100.101
}
node {
name: pve3
nodeid: 1
quorum_votes: 1
ring0_addr: 172.16.100.102
}
node {
name: pve4
nodeid: 3
quorum_votes: 1
ring0_addr: 172.16.100.103
}
node {
name: pve5
nodeid: 5
quorum_votes: 1
ring0_addr: 172.16.100.104
}
node {
name: pve6
nodeid: 6
quorum_votes: 1
ring0_addr: 172.16.100.105
}
node {
name: pve7
nodeid: 7
quorum_votes: 1
ring0_addr: 172.16.100.106
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: HB
config_version: 31
interface {
linknumber: 0
}
ip_version: ipv4-6
secauth: on
version: 2
}



Node PVE7 (Not Working)

root@pve7:~# cat /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: pve
nodeid: 2
quorum_votes: 1
ring0_addr: 172.16.100.100
}
node {
name: pve2
nodeid: 4
quorum_votes: 1
ring0_addr: 172.16.100.101
}
node {
name: pve3
nodeid: 1
quorum_votes: 1
ring0_addr: 172.16.100.102
}
node {
name: pve4
nodeid: 3
quorum_votes: 1
ring0_addr: 172.16.100.103
}
node {
name: pve5
nodeid: 5
quorum_votes: 1
ring0_addr: 172.16.100.104
}
node {
name: pve6
nodeid: 6
quorum_votes: 1
ring0_addr: 172.16.100.105
}
node {
name: pve7
nodeid: 7
quorum_votes: 1
ring0_addr: 172.16.100.106
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: HB
config_version: 31
interface {
linknumber: 0
}
ip_version: ipv4-6
secauth: on
version: 2
}
 
Answers:

How are those nodes connected together physically? They use a dedicate network for the cluster, and use a switch cisco 1GB speed each port.

How does the network topology looks like? Topology start.

Where the first 6 nodes added to the cluster in the same way as the new one? Yes I use in one of them command line, the others use GUI.
 
Hello.

I had the same problem yesterday!

I used to add brand new PM VE 6.3 node to existing 4-node PM VE 6.1 cluster (all nodes version 6.1).

Right after new node registration I had the SAME problems:
- node refused to add
- cluster UI was completely broken, but it seems VMs were working with no problem

IMPORTANT detail:
I have found "corosync" process eating 100% CPU on existing cluster node I had use for cluster join.
Cluster managed to work again only when I power off new node.

It is clear for me there is certain network problem when you trying to expand existing cluster.
This is somehow linked to PM VE 6.3.

This is like 4-th message for 2021 I have read for this matter.
I have dedicated network and PM cluster works for years with no problem.

Proxmox stuff please reconsider the situation. It seems there is major network problem
somewhere in PM VE 6.3 or in combination PM VE 6.1/PM VE 6.3

PS. new node was completely updated before cluster join using "pve-no-subscription"
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!