unable to add new node to the existing cluster

tzajac

Member
Aug 9, 2019
7
0
21
49
Hi,

I am battling with adding a new node to the existing cluster.
A cluster was set up initially with 3 nodes.
Some vm were added as well.
Cluster was working fine. I have added node 4 wich still was ok.
Now I am trying to add node 5, and this is where the trouble begins.
So basically I am having a new node that has the same exactly hardware as the other nodes. Installing proxmox.
Then login via web gui. all seems to be ok. SSH to the new node, run updates etc.
then via web gui configuring network cards the same way as on the other nodes (just giving new IP for the new node)...
Once this is done and node rebooted, I am trying to add new node to the cluster via web gui. Once click add, the process is starting and ending with connection error.
Usually refreshing the page gets me back to the login page, where you can log in and see the whole cluster. this time after adding node to the cluster the gui no longer can be accessed by web.
On other node, via web gui I can see 5 nodes however the "old" nodes are having green thick while the new node nr 5 has red cross.

When running the pvecm status command on an "old" node I am getting this:
Quorum information
------------------
Date: Wed Oct 30 12:47:18 2019
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000001
Ring ID: 1/219176
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 4
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.0.0.71 (local)
0x00000002 1 192.0.0.72
0x00000003 1 192.0.0.73
0x00000004 1 192.0.0.74

So looks like the node 5 is missing from the list.

running command ls -l /etc/pve/nodes/ on a "old node I am getting this:
total 0
drwxr-xr-x 2 root www-data 0 Sep 20 15:32 pve-01
drwxr-xr-x 2 root www-data 0 Sep 23 13:31 pve-02
drwxr-xr-x 2 root www-data 0 Sep 23 13:33 pve-03
drwxr-xr-x 2 root www-data 0 Oct 23 12:30 pve-04
drwxr-xr-x 2 root www-data 0 Oct 30 11:41 pve-05

So there's a new node on the list...

another command pvesh get cluster/config/nodes (executed on the "old" node) is showing this:

┌────────┐
│ node │
├────────┤
│ pve-01 │
├────────┤
│ pve-02 │
├────────┤
│ pve-03 │
├────────┤
│ pve-04 │
├────────┤
│ pve-05 │


and again looks like the node 5 is there...

however running command: pvecm nodes yelds this:

Membership information
----------------------
Nodeid Votes Name
1 1 pve-01 (local)
2 1 pve-02
3 1 pve-03
4 1 pve-04

running the same command on the new node I am getting this:

Membership information
----------------------
Nodeid Votes Name
5 1 pve-05 (local)

running command pvecm status on a new node I am getting this:

Quorum information
------------------
Date: Wed Oct 30 12:58:01 2019
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000005
Ring ID: 5/16
Quorate: No

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 1
Quorum: 3 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000005 1 192.0.0.75 (local)


running command ls -l /etc/pve/nodes/ ona new node shows:

ls: cannot access '/etc/pve/nodes/': No such file or directory

I have already remove node 5 from a cluster and then reinstall from scratch proxmox on a new node. and try again to add the node to the cluster and always with the same result.

Not sure if there is something I am doing wrong or in wrong order or there is something else that prevents me from adding a new node to the cluster.
I will appreciate any input and help on this problem as I will need add 2 more nodes...
 
Hi,

Can you please post the output of
Code:
systemctl status pve-cluster corosync
of the new and one "old" (already in cluster) node?

ls: cannot access '/etc/pve/nodes/': No such file or directory
Seems like the pmxcfs (pve-cluster.servcie) is not started.
 
Hi,

Thank you for your kind reply.

this is result from old node runing command:
systemctl status pve-cluster corosync

root@pve-01:~# systemctl status pve-cluster corosync
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor pres
Active: active (running) since Tue 2019-10-29 13:48:49 NZDT; 1 day 16h ago
Process: 2394 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Process: 2578 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited,
Main PID: 2433 (pmxcfs)
Tasks: 13 (limit: 7372)
Memory: 64.1M
CGroup: /system.slice/pve-cluster.service
└─2433 /usr/bin/pmxcfs

Oct 31 00:48:48 pve-01 pmxcfs[2433]: [dcdb] notice: data verification successful
Oct 31 01:48:48 pve-01 pmxcfs[2433]: [dcdb] notice: data verification successful
Oct 31 02:48:48 pve-01 pmxcfs[2433]: [dcdb] notice: data verification successful
Oct 31 02:53:00 pve-01 pmxcfs[2433]: [status] notice: received log
Oct 31 02:53:03 pve-01 pmxcfs[2433]: [status] notice: received log
Oct 31 03:46:59 pve-01 pmxcfs[2433]: [status] notice: received log
Oct 31 03:47:02 pve-01 pmxcfs[2433]: [status] notice: received log
Oct 31 03:48:48 pve-01 pmxcfs[2433]: [dcdb] notice: data verification successful
Oct 31 04:48:48 pve-01 pmxcfs[2433]: [dcdb] notice: data verification successful
Oct 31 05:48:48 pve-01 pmxcfs[2433]: [dcdb] notice: data verification successful

● corosync.service - Corosync Cluster Engine
lines 1-23...skipping...
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2019-10-29 13:48:49 NZDT; 1 day 16h ago
Process: 2394 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Process: 2578 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
Main PID: 2433 (pmxcfs)
Tasks: 13 (limit: 7372)
Memory: 64.1M
CGroup: /system.slice/pve-cluster.service
└─2433 /usr/bin/pmxcfs

Oct 31 00:48:48 pve-01 pmxcfs[2433]: [dcdb] notice: data verification successful
Oct 31 01:48:48 pve-01 pmxcfs[2433]: [dcdb] notice: data verification successful
Oct 31 02:48:48 pve-01 pmxcfs[2433]: [dcdb] notice: data verification successful
Oct 31 02:53:00 pve-01 pmxcfs[2433]: [status] notice: received log
Oct 31 02:53:03 pve-01 pmxcfs[2433]: [status] notice: received log
Oct 31 03:46:59 pve-01 pmxcfs[2433]: [status] notice: received log
Oct 31 03:47:02 pve-01 pmxcfs[2433]: [status] notice: received log
Oct 31 03:48:48 pve-01 pmxcfs[2433]: [dcdb] notice: data verification successful
Oct 31 04:48:48 pve-01 pmxcfs[2433]: [dcdb] notice: data verification successful
Oct 31 05:48:48 pve-01 pmxcfs[2433]: [dcdb] notice: data verification successful

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2019-10-29 13:48:50 NZDT; 1 day 16h ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 2584 (corosync)
Tasks: 9 (limit: 7372)
Memory: 259.2M
CGroup: /system.slice/corosync.service
└─2584 /usr/sbin/corosync -f

Oct 31 00:35:16 pve-01 corosync[2584]: [CPG ] downlist left_list: 0 received
Oct 31 00:35:16 pve-01 corosync[2584]: [QUORUM] Members[2]: 1 4
Oct 31 00:35:16 pve-01 corosync[2584]: [MAIN ] Completed service synchronization, ready to provide servic
Oct 31 00:35:19 pve-01 corosync[2584]: [TOTEM ] A new membership (1:295592) was formed. Members joined: 2
Oct 31 00:35:19 pve-01 corosync[2584]: [CPG ] downlist left_list: 0 received
Oct 31 00:35:19 pve-01 corosync[2584]: [CPG ] downlist left_list: 0 received
Oct 31 00:35:19 pve-01 corosync[2584]: [CPG ] downlist left_list: 0 received
Oct 31 00:35:19 pve-01 corosync[2584]: [QUORUM] This node is within the primary component and will provide
Oct 31 00:35:19 pve-01 corosync[2584]: [QUORUM] Members[3]: 1 2 4
Oct 31 00:35:19 pve-01 corosync[2584]: [MAIN ] Completed service synchronization, ready to provide servic
lines 6-44/44 (END)

and this is the output from new node:

root@pve-05:~# systemctl status pve-cluster corosync
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor pres
Active: active (running) since Wed 2019-10-30 12:26:16 NZDT; 17h ago
Process: 1753 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Process: 1903 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited,
Main PID: 1786 (pmxcfs)
Tasks: 6 (limit: 7372)
Memory: 39.1M
CGroup: /system.slice/pve-cluster.service
└─1786 /usr/bin/pmxcfs

Oct 30 20:26:15 pve-05 pmxcfs[1786]: [dcdb] notice: data verification successful
Oct 30 21:26:15 pve-05 pmxcfs[1786]: [dcdb] notice: data verification successful
Oct 30 22:26:15 pve-05 pmxcfs[1786]: [dcdb] notice: data verification successful
Oct 30 23:26:15 pve-05 pmxcfs[1786]: [dcdb] notice: data verification successful
Oct 31 00:26:15 pve-05 pmxcfs[1786]: [dcdb] notice: data verification successful
Oct 31 01:26:15 pve-05 pmxcfs[1786]: [dcdb] notice: data verification successful
Oct 31 02:26:15 pve-05 pmxcfs[1786]: [dcdb] notice: data verification successful
Oct 31 03:26:15 pve-05 pmxcfs[1786]: [dcdb] notice: data verification successful
Oct 31 04:26:15 pve-05 pmxcfs[1786]: [dcdb] notice: data verification successful
Oct 31 05:26:15 pve-05 pmxcfs[1786]: [dcdb] notice: data verification successful

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset:
Active: active (running) since Wed 2019-10-30 12:26:16 NZDT; 17h ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 1908 (corosync)
Tasks: 9 (limit: 7372)
Memory: 181.4M
CGroup: /system.slice/corosync.service
└─1908 /usr/sbin/corosync -f

Oct 30 12:26:16 pve-05 corosync[1908]: [KNET ] host: host: 4 (passive) best l
Oct 30 12:26:16 pve-05 corosync[1908]: [KNET ] host: host: 4 has no active li
Oct 30 12:26:16 pve-05 corosync[1908]: [KNET ] host: host: 4 (passive) best l
Oct 30 12:26:16 pve-05 corosync[1908]: [KNET ] host: host: 4 has no active li
Oct 30 12:26:16 pve-05 corosync[1908]: [CPG ] downlist left_list: 0 received
Oct 30 12:26:16 pve-05 corosync[1908]: [KNET ] host: host: 5 (passive) best l
Oct 30 12:26:16 pve-05 corosync[1908]: [KNET ] host: host: 5 has no active li
Oct 30 12:26:16 pve-05 corosync[1908]: [QUORUM] Members[1]: 5
Oct 30 12:26:16 pve-05 corosync[1908]: [MAIN ] Completed service synchronizat
Oct 30 12:26:16 pve-05 systemd[1]: Started Corosync Cluster Engine.
lines 22-44/44 (END)

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2019-10-30 12:26:16 NZDT; 17h ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 1908 (corosync)
Tasks: 9 (limit: 7372)
Memory: 181.4M
CGroup: /system.slice/corosync.service
└─1908 /usr/sbin/corosync -f

Oct 30 12:26:16 pve-05 corosync[1908]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 30 12:26:16 pve-05 corosync[1908]: [KNET ] host: host: 4 has no active links
Oct 30 12:26:16 pve-05 corosync[1908]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 30 12:26:16 pve-05 corosync[1908]: [KNET ] host: host: 4 has no active links
Oct 30 12:26:16 pve-05 corosync[1908]: [CPG ] downlist left_list: 0 received
Oct 30 12:26:16 pve-05 corosync[1908]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 0)
Oct 30 12:26:16 pve-05 corosync[1908]: [KNET ] host: host: 5 has no active links
Oct 30 12:26:16 pve-05 corosync[1908]: [QUORUM] Members[1]: 5
Oct 30 12:26:16 pve-05 corosync[1908]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 30 12:26:16 pve-05 systemd[1]: Started Corosync Cluster Engine.
~
~
lines 22-44/44 (END)
 
Now my cluster is gone, at least in web gui as is indicating that is standalone.
 

Attachments

  • proxmox-cluster-error..PNG
    proxmox-cluster-error..PNG
    38.9 KB · Views: 5
OK,
shutting down node 5 and using command :

pvecm delnode pve-05

removed the node and the cluster comes back...
 
Did you re-joined it? Needs to use the CLI for that.

Anyway, the services where running just fine, and that means that either the network connection was not available, or the corosync configuration had some issues - as the latter is done automatically by the API and we have currently no known issues with the config generation (and it's quite widely used) I'd guess that's the network..

Are the nodes connected all through the same switch?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!