Adding 2nd node fails

lifeboy

Active Member
I have a single node cluster (pve 5.2) set up and systematically want to add nodes to it. The way I do this it to move virtual machines off the old OS the machine I want to ad as a node to Proxmox (with rbd as storage). Then I add that machine as a new node.

So I added the 2nd node now, at least tried to, and it hung on "waiting for quorum..." after doing "pvecm add 192.168.0.14". Now 192.168.0.14 (yster4) has the node config stored, but 192.168.0.13 (yster3) complains: Check if node may join a cluster failed! when I try to add the node again, which of course I can't do.

So how do I fix this? I can't remove the node and reinstall it, since corosync doesn't "see" the node yet.

Code:
# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 192.168.0.14 (local)
Code:
# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: yster3
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.0.13
  }
  node {
    name: yster4
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.0.14
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: IMB-production
  config_version: 2
  interface {
    bindnetaddr: 192.168.0.14
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}
On yster3 I have:

Code:
# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: yster3
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.0.13
  }
  node {
    name: yster4
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.0.14
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: IMB-production
  config_version: 2
  interface {
    bindnetaddr: 192.168.0.14
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}
Code:
# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         2          1 192.168.0.13 (local)
On yster3:

Code:
# cat /etc/pve/.members
{
"nodename": "yster3",
"version": 3,
"cluster": { "name": "IMB-production", "version": 2, "nodes": 2, "quorate": 0 },
"nodelist": {
  "yster3": { "id": 2, "online": 1, "ip": "192.168.0.13"},
  "yster4": { "id": 1, "online": 0}
  }
}
On yster4:

Code:
# cat /etc/pve/.members
{
"nodename": "yster4",
"version": 4,
"cluster": { "name": "IMB-production", "version": 2, "nodes": 2, "quorate": 0 },
"nodelist": {
  "yster3": { "id": 2, "online": 0},
  "yster4": { "id": 1, "online": 1, "ip": "192.168.0.14"}
  }
}
Yster3 doesn't have an authkey file.

Code:
# cat /etc/pve/authkey.pub
cat: /etc/pve/authkey.pub: No such file or directory
Should I create a file and copy the key from yster4 into it?

I also noticed that I can't copy-ssh-id to yster3.

Code:
root@yster4:~# ssh-copy-id yster3
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@yster3's password:
cat: write error: Permission denied
This might be the cause of the problem with authkey.pub?

How can I fix this?

This is a clean installation, why is this happening? Is there something that I didn't install?
I did an apt update and apt upgrade after installation.
 
Last edited:

devinacosta

Member
Aug 3, 2017
52
7
13
42
So you mentioned that you had a single node and you want to make it a cluster. What node was your initial host? Yster3 or Yster4?
When you installed the cluster what steps did you follow to initialize the cluster?
Do your hosts only have 1 NIC or multiple NICs?
Does your switch support Multicast, some switches by default don't have it enabled, and you may need to use Unicast instead.

Answer these questions and I will be glad to help you out.
 

lifeboy

Active Member
So you mentioned that you had a single node and you want to make it a cluster. What node was your initial host? Yster3 or Yster4?
When you installed the cluster what steps did you follow to initialize the cluster?
Do your hosts only have 1 NIC or multiple NICs?
Does your switch support Multicast, some switches by default don't have it enabled, and you may need to use Unicast instead.

Answer these questions and I will be glad to help you out.
Original host / 1st node: Yster4
I installed the cluster by firstly copying my ssh keys to Yster4, the running "pvecm init IMB-production". That's it.
I have 2 NICs in each server, but will be adding more in future (3 x 1Gb/s bonded) for ceph's use.
My hosting provider (Hetzner) has confirmed that their Juniper switches are set up to support Multicast.

I have done this so many times on other systems, so did I do something wrong this time?
 
Last edited:

lifeboy

Active Member
Additionally I see this in my syslog:

Code:
Jun 19 15:41:09 yster4 pvesr[3133162]: error with cfs lock 'file-replication_cfg': no quorum!
Jun 19 15:41:09 yster4 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jun 19 15:41:09 yster4 systemd[1]: Failed to start Proxmox VE replication runner.
Jun 19 15:41:09 yster4 systemd[1]: pvesr.service: Unit entered failed state.
Jun 19 15:41:09 yster4 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jun 19 15:42:00 yster4 systemd[1]: Starting Proxmox VE replication runner...
Jun 19 15:42:00 yster4 pvesr[3133607]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 19 15:42:01 yster4 pvesr[3133607]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 19 15:42:02 yster4 pvesr[3133607]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 19 15:42:03 yster4 pvesr[3133607]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 19 15:42:04 yster4 pvesr[3133607]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 19 15:42:05 yster4 pvesr[3133607]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 19 15:42:06 yster4 pvesr[3133607]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 19 15:42:07 yster4 pvesr[3133607]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 19 15:42:08 yster4 pvesr[3133607]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 19 15:42:09 yster4 pvesr[3133607]: error with cfs lock 'file-replication_cfg': no quorum!
Jun 19 15:42:09 yster4 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jun 19 15:42:09 yster4 systemd[1]: Failed to start Proxmox VE replication runner.
Jun 19 15:42:09 yster4 systemd[1]: pvesr.service: Unit entered failed state.
Jun 19 15:42:09 yster4 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jun 19 15:43:00 yster4 systemd[1]: Starting Proxmox VE replication runner...
Jun 19 15:43:00 yster4 pvesr[3134025]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 19 15:43:01 yster4 pvesr[3134025]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 19 15:43:02 yster4 pvesr[3134025]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 19 15:43:03 yster4 pvesr[3134025]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 19 15:43:04 yster4 pvesr[3134025]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 19 15:43:05 yster4 pvesr[3134025]: trying to aquire cfs lock 'file-replication_cfg' ...
I only ever had one node, to which I added some VM's and LXC's. What does the above indicate?
 

lifeboy

Active Member
I have now removed the Yster3 node (pvecm delnode yster3) from the cluster by running this command on Yster4 (the yster3 machine was shutdown already).

I then reinstalled proxmox from the iso and made the following changes:
  1. I added /etc/apt/sources.list.d/pve-community.list and commented out the contents of /etc/apt/sources.list.d/pve-enterprise.list
  2. Ran
    Code:
    apt update
    and then
    Code:
    apt upgrade
    and then
    Code:
    apt install vim
    to get my favourite editor installed.
  3. Installed keys:
    Code:
    ssh-copy-id yster4
    and from yster4,
    Code:
    ssh-copy-id yster3
    . Testing ssh login and keys are used.
  4. Checked yster4 again for status
    Code:
    root@yster4:~# pvecm status
    [/LIST]
    Quorum information
    ------------------
    Date:             Tue Jun 19 22:26:04 2018
    Quorum provider:  corosync_votequorum
    Nodes:            1
    Node ID:          0x00000001
    Ring ID:          1/748
    Quorate:          Yes
    
    Votequorum information
    ----------------------
    Expected votes:   1
    Highest expected: 1
    Total votes:      1
    Quorum:           1
    Flags:            Quorate
    
    Membership information
    ----------------------
        Nodeid      Votes Name
    0x00000001          1 192.168.0.14 (local)
    Then I issued:
    Code:
    pvecm add yster4
    from yster3.

    This is the result I got:

    Code:
    root@yster3:~# pvecm add yster4
    Please enter superuser (root) password for 'yster4':
                                                        Password for root@yster4: *********
    Etablishing API connection with host 'yster4'
    The authenticity of host 'yster4' can't be established.
    X509 SHA256 key fingerprint is CD:82:4B:4D:B3:BE:13:CB:05:5D:BE:AE:EF:C3:E4:24:EA:54:3B:88:C8:A8:36:6F:A1:7C:31:34:90:AB:9D:14.
    Are you sure you want to continue connecting (yes/no)? yes
    Login succeeded.
    Request addition of this node
    Join request OK, finishing setup locally
    stopping pve-cluster service
    backup old database to '/var/lib/pve-cluster/backup/config-1529442754.sql.gz'
    waiting for quorum...
    It's been like that for 5 minutes already. On yster 4 syslog reports:

    Code:
    Jun 19 23:27:00 yster4 systemd[1]: Starting Proxmox VE replication runner...
    Jun 19 23:27:00 yster4 pvesr[3447443]: trying to aquire cfs lock 'file-replication_cfg' ...
    Jun 19 23:27:01 yster4 pvesr[3447443]: trying to aquire cfs lock 'file-replication_cfg' ...
    Jun 19 23:27:02 yster4 pvesr[3447443]: trying to aquire cfs lock 'file-replication_cfg' ...
    Jun 19 23:27:03 yster4 pvesr[3447443]: trying to aquire cfs lock 'file-replication_cfg' ...
    Jun 19 23:27:04 yster4 pvesr[3447443]: trying to aquire cfs lock 'file-replication_cfg' ...
    Jun 19 23:27:05 yster4 pvesr[3447443]: trying to aquire cfs lock 'file-replication_cfg' ...
    Jun 19 23:27:06 yster4 pvesr[3447443]: trying to aquire cfs lock 'file-replication_cfg' ...
    Jun 19 23:27:07 yster4 pvesr[3447443]: trying to aquire cfs lock 'file-replication_cfg' ...
    Jun 19 23:27:08 yster4 pvesr[3447443]: trying to aquire cfs lock 'file-replication_cfg' ...
    Jun 19 23:27:09 yster4 pvesr[3447443]: error with cfs lock 'file-replication_cfg': no quorum!
    Jun 19 23:27:09 yster4 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
    Jun 19 23:27:09 yster4 systemd[1]: Failed to start Proxmox VE replication runner.
    Jun 19 23:27:09 yster4 systemd[1]: pvesr.service: Unit entered failed state.
    Jun 19 23:27:09 yster4 systemd[1]: pvesr.service: Failed with result 'exit-code'.
    Does this mean the problem is on Yster4?

    I really need help with this as I have exhausted all the articles I could find.

    Thanks in advance.
 

lifeboy

Active Member
Working through https://forum.proxmox.com/threads/cluster-problem-cant-add-new-node.44319/ I tried:

Code:
root@yster3:~# omping -c 600 -i 1 -q 192.168.0.13 192.168.0.14
192.168.0.14 : waiting for response msg
192.168.0.14 : waiting for response msg
192.168.0.14 : waiting for response msg
192.168.0.14 : waiting for response msg
192.168.0.14 : waiting for response msg
192.168.0.14 : waiting for response msg
192.168.0.14 : waiting for response msg
192.168.0.14 : waiting for response msg
192.168.0.14 : waiting for response msg
192.168.0.14 : waiting for response msg
192.168.0.14 : waiting for response msg
^C
192.168.0.14 : response message never received
So there is indeed a multicast problem!
 

lifeboy

Active Member
Furthermore, here's how I enable IGMP Querier on the bridge of each node. So now I have:
Code:
# cat /sys/devices/virtual/net/vmbr0/bridge/multicast_querier
1
# cat /sys/class/net/vmbr0/bridge/multicast_snooping
0
Also, I added this to /etc/network/interfaces
Code:
iface vmbr0 inet static
    address  192.168.0.14
    netmask  255.255.255.0
    gateway  192.168.0.1
    bridge_ports eno2
    bridge_stp off
    bridge_fd 0
    post-up ( echo 1 > /sys/devices/virtual/net/$IFACE/bridge/multicast_querier )
Did this on both yster3 and yster4 and restarted corosync and pve-cluster, but still no multicast.

Hetzner, our hosting provider, has confirmed this:
Code:
Hetzner has IGMP snooping enabled on the backlink network. This means your must enable an IGMP Querier on your host, as per the Proxmox documentation:

https://pve.proxmox.com/wiki/Multicast_notes
 

lifeboy

Active Member
So! I'm not quite sure how, but nothing has really changed. However:

Code:
root@yster3:~# omping -v -c 600 -i 1 -q 192.168.0.13 192.168.0.14
192.168.0.14 : waiting for response msg
192.168.0.14 : waiting for response msg
192.168.0.14 : waiting for response msg
192.168.0.14 : waiting for response msg
192.168.0.14 : waiting for response msg
192.168.0.14 : waiting for response msg
192.168.0.14 : waiting for response msg
192.168.0.14 : joined (S,G) = (*, 232.43.211.234), pinging
192.168.0.14 : waiting for response msg
192.168.0.14 : server told us to stop

192.168.0.14 :   unicast, xmt/rcv/%loss = 29/29/0%, min/avg/max/std-dev = 0.078/0.131/0.161/0.014
192.168.0.14 : multicast, xmt/rcv/%loss = 29/29/0%, min/avg/max/std-dev = 0.131/0.179/0.208/0.014
Waiting for 3000 ms to inform other nodes about instance exit
while on the other node:

Code:
root@yster4:~# omping -v -c 600 -i 1 -q 192.168.0.14 192.168.0.13
192.168.0.13 : waiting for response msg
192.168.0.13 : joined (S,G) = (*, 232.43.211.234), pinging
^C
192.168.0.13 :   unicast, xmt/rcv/%loss = 30/30/0%, min/avg/max/std-dev = 0.087/0.181/0.229/0.035
192.168.0.13 : multicast, xmt/rcv/%loss = 30/29/3% (seq>=2 0%), min/avg/max/std-dev = 0.111/0.192/0.242/0.031
Waiting for 3000 ms to inform other nodes about instance exit
Now, let me see if I can get the node to join again.
 

lifeboy

Active Member
Now if I force the node add with
Code:
# pvecm add yster4 --force 1
detected the following error(s):
* authentication key '/etc/corosync/authkey' already exists
* cluster config '/etc/pve/corosync.conf' already exists
* this host already contains virtual guests
* corosync is already running, is this node already in a cluster?!
Please enter superuser (root) password for 'yster4':
                                                    Password for root@yster4: *********
detected the following error(s):
* authentication key '/etc/corosync/authkey' already exists
* cluster config '/etc/pve/corosync.conf' already exists
* this host already contains virtual guests
* corosync is already running, is this node already in a cluster?!
Etablishing API connection with host 'yster4'
The authenticity of host 'yster4' can't be established.
X509 SHA256 key fingerprint is CD:82:4B:4D:B3:BE:13:CB:05:5D:BE:AE:EF:C3:E4:24:EA:54:3B:88:C8:A8:36:6F:A1:7C:31:34:90:AB:9D:14.
Are you sure you want to continue connecting (yes/no)? yes
Login succeeded.
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1529522117.sql.gz'
waiting for quorum...
and on yster4 I see:
Code:
root@yster4:~# pvecm status
Quorum information
------------------
Date:             Wed Jun 20 21:15:25 2018
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1/1908
Quorate:          No

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:          

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.0.14 (local)
And this is with multicast working fine.

also:
Code:
root@yster4:~# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2018-06-20 21:13:15 SAST; 6min ago
  Process: 181000 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
  Process: 180938 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
 Main PID: 180972 (pmxcfs)
    Tasks: 7 (limit: 4915)
   Memory: 41.4M
      CPU: 771ms
   CGroup: /system.slice/pve-cluster.service
           └─180972 /usr/bin/pmxcfs

Jun 20 21:13:14 yster4 pmxcfs[180972]: [status] notice: update cluster info (cluster name  IMB-production, version = 6)
Jun 20 21:13:14 yster4 pmxcfs[180972]: [dcdb] notice: members: 1/180972
Jun 20 21:13:14 yster4 pmxcfs[180972]: [dcdb] notice: all data is up to date
Jun 20 21:13:14 yster4 pmxcfs[180972]: [status] notice: members: 1/180972
Jun 20 21:13:14 yster4 pmxcfs[180972]: [status] notice: all data is up to date
Jun 20 21:13:15 yster4 systemd[1]: Started The Proxmox VE cluster filesystem.
Jun 20 21:13:50 yster4 pmxcfs[180972]: [status] notice: node has quorum
Jun 20 21:15:16 yster4 pmxcfs[180972]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 7)
Jun 20 21:15:16 yster4 pmxcfs[180972]: [status] notice: node lost quorum
Jun 20 21:15:16 yster4 pmxcfs[180972]: [status] notice: update cluster info (cluster name  IMB-production, version = 7)
 

lifeboy

Active Member
In syslog I now see:
Code:
/etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1643
so I forced quorum with
Code:
pvecm expected 1
and ran
Code:
pvecm updatecerts
and then
Code:
pvecm add yster4 --force 1
and the node was added!

However, I still have an error: pvesr[2013]: trying to aquire cfs lock 'file-replication_cfg' and the node gives and error when I click yster3 in the GUI that says:
Code:
Connection error 596: tls_process_server_certificate: certificate verify failed
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!