tried to add a new server to our cluster - it failed

schoda · Jan 24, 2020

Hi,

usually i add a new server to our proxmox cluster via webui. But this time it somehow failed. Now i'm stuck in an undefined state and already tried "pvecm add 192.168.52.76 --force" but it's waiting for quorum forever.

pvecm status on new server:

Quorum information
------------------
Date: Fri Jan 24 13:29:55 2020
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000008
Ring ID: 8/16
Quorate: No

Votequorum information
----------------------
Expected votes: 8
Highest expected: 8
Total votes: 1
Quorum: 5 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000008 1 192.168.52.108 (local)

pvecm status on cluster:

Quorum information
------------------
Date: Fri Jan 24 13:30:52 2020
Quorum provider: corosync_votequorum
Nodes: 7
Node ID: 0x00000001
Ring ID: 1/2020
Quorate: Yes

Votequorum information
----------------------
Expected votes: 8
Highest expected: 8
Total votes: 7
Quorum: 5
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.52.76 (local)
0x00000002 1 192.168.52.77
0x00000003 1 192.168.52.83
0x00000004 1 192.168.52.85
0x00000005 1 192.168.52.87
0x00000006 1 192.168.52.88
0x00000007 1 192.168.52.104

I've set up the new node like the other servers, the only difference is that i got 2x10Gbit bonding interface (LACP) with and vlans on it. The network connectivity is fine i can ssh everywhere.

What would be my best bet to fix this without reinstalling?

Edit (Added interfaces):

auto lo
iface lo inet loopback

# bonding interface
auto eno1np0
iface eno1np0 inet manual

auto eno2np1
iface eno2np1 inet manual

auto bond0
iface bond0 inet manual
slaves eno1np0 eno2np1
bond_miimon 100
bond_mode 802.3ad
bond_xmit_hash_policy layer2+3

auto vmbr0
iface vmbr0 inet static
address 192.168.52.108
netmask 255.255.254.0
gateway 192.168.52.1
bridge_ports bond0
bridge_stp off
bridge_fd 0

iface bond0.1 inet manual
vlan-raw-device bond0
auto vmbr1
iface vmbr1 inet manual
bridge-ports bond0.1
bridge-stp off
bridge-fd 0

Thanks & Cheers,
Daniel

t.lamprecht · Jan 24, 2020

schoda said:
What would be my best bet to fix this without reinstalling?

Can you check /etc/pve/corosync.conf and see if the new node is referenced there correctly (name and IP).
If not fix the info (if unsure post config here) ensure to increase config_version in that case.

Then copy it over to the new node but at another path, namely /etc/corosync/corosync.conf and restart the corosync.service (over webinterface or with systemctl restart corosync).

Then check with pvecm status if all is well, else the authkey may also be missing, copy that file from another note to /etc/corosync/ to the same location at the new node.

If all is well also restart systemctl restart pve-cluster

That'd be the first objective, getting corosync and the clustered configuration filesystem to run again. We may then also ensure the new node has a valid certificate, signed by the cluster CA - simply try to access the PVE webinterface of the new node directly after above fixing, if it doesn't work that may be the error. Hope that helps, post specific error messages if you run into some.

schoda · Jan 27, 2020

t.lamprecht said:
Can you check /etc/pve/corosync.conf and see if the new node is referenced there correctly (name and IP).
If not fix the info (if unsure post config here) ensure to increase config_version in that case.

Then copy it over to the new node but at another path, namely /etc/corosync/corosync.conf and restart the corosync.service (over webinterface or with systemctl restart corosync).

Then check with pvecm status if all is well, else the authkey may also be missing, copy that file from another note to /etc/corosync/ to the same location at the new node.

If all is well also restart systemctl restart pve-cluster

That'd be the first objective, getting corosync and the clustered configuration filesystem to run again. We may then also ensure the new node has a valid certificate, signed by the cluster CA - simply try to access the PVE webinterface of the new node directly after above fixing, if it doesn't work that may be the error. Hope that helps, post specific error messages if you run into some.

Thanks for the input. I've checked /etc/pve/corosync.conf on the cluster and the new node and they got the same checksum. The authkey file has also the correct checksum.

Cluster

root@pm-01:/etc/corosync# md5sum corosync.conf
f5861ed0e241f59b39ded42b1a88ea2b corosync.conf
root@pm-01:/etc/corosync# md5sum authkey
4474bcb18da76330cb80a4385c7c533f authkey

New Node:

root@pm-08:/etc/corosync# md5sum corosync.conf
f5861ed0e241f59b39ded42b1a88ea2b corosync.conf
root@pm-08:/etc/corosync# md5sum authkey
4474bcb18da76330cb80a4385c7c533f authkey

Node reference on new system:

node {
name: pm-08
nodeid: 8
quorum_votes: 1
ring0_addr: 192.168.52.108
}

Those information are also correct. I've already tried a complete reboot of the system but it did not help. Also the webui is not reachable anymore on the new node since i tried to join the cluster.

/etc/pve looks completly different on the cluster and the new node.

Cheers,
Daniel

Edit:
● pvesr.service loaded failed failed Proxmox VE replication runner
from /var/log/daemon.log

Jan 27 17:53:21 pm-08 pveproxy[37185]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1687.
Jan 27 17:53:21 pm-08 pveproxy[16917]: worker 37174 finished
Jan 27 17:53:21 pm-08 pveproxy[16917]: starting 1 worker(s)
Jan 27 17:53:21 pm-08 pveproxy[16917]: worker 37186 started
Jan 27 17:53:21 pm-08 pveproxy[37186]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1687.
Jan 27 17:53:26 pm-08 pveproxy[37184]: worker exit
Jan 27 17:53:26 pm-08 pveproxy[16917]: worker 37184 finished
Jan 27 17:53:26 pm-08 pveproxy[16917]: starting 1 worker(s)
Jan 27 17:53:26 pm-08 pveproxy[16917]: worker 37234 started
Jan 27 17:53:26 pm-08 pveproxy[37234]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1687.
Jan 27 17:53:26 pm-08 pveproxy[37185]: worker exit
Jan 27 17:53:26 pm-08 pveproxy[37186]: worker exit
Jan 27 17:53:26 pm-08 pveproxy[16917]: worker 37185 finished
Jan 27 17:53:26 pm-08 pveproxy[16917]: starting 1 worker(s)
Jan 27 17:53:26 pm-08 pveproxy[16917]: worker 37235 started
Jan 27 17:53:26 pm-08 pveproxy[37235]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1687.
Jan 27 17:53:26 pm-08 pveproxy[16917]: worker 37186 finished
Jan 27 17:53:26 pm-08 pveproxy[16917]: starting 1 worker(s)
Jan 27 17:53:26 pm-08 pveproxy[16917]: worker 37236 started
Jan 27 17:53:26 pm-08 pveproxy[37236]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1687.

root@pm-08:/var/log# ls /etc/pve/local/pve-ssl.key
ls: cannot access '/etc/pve/local/pve-ssl.key': No such file or directory

Might this be the reason i can't join the cluster?

t.lamprecht · Jan 28, 2020

schoda said:
Might this be the reason i can't join the cluster?

It is the reason the webinterface does not works, but not the reason that the node cannot join the cluster itself.

Are the cluster related services up and running systemctl status corosync pve-cluster?
Else please start them (same command as with "status" swapped for "start").

Once that runs, which it should if the corosync config is OK and authkey is also there, then you can (re-)generate the certs by issuing the following command on the problematic node: pvecm updatecerts

schoda · Jan 28, 2020

t.lamprecht said:
It is the reason the webinterface does not works, but not the reason that the node cannot join the cluster itself.

Are the cluster related services up and running systemctl status corosync pve-cluster?
Else please start them (same command as with "status" swapped for "start").

Once that runs, which it should if the corosync config is OK and authkey is also there, then you can (re-)generate the certs by issuing the following command on the problematic node: pvecm updatecerts

Both services are up and running for 16hours (since reboot).

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2020-01-27 17:18:06 CET; 16h ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 16669 (corosync)
Tasks: 2 (limit: 9830)
Memory: 41.3M
CPU: 6min 18.923s
CGroup: /system.slice/corosync.service
└─16669 /usr/sbin/corosync -f

Jan 27 17:18:06 pm-08 corosync[16669]: [SERV ] Service engine loaded: corosync watchdog service [7]
Jan 27 17:18:06 pm-08 corosync[16669]: [QUORUM] Using quorum provider corosync_votequorum
Jan 27 17:18:06 pm-08 corosync[16669]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jan 27 17:18:06 pm-08 corosync[16669]: [QB ] server name: votequorum
Jan 27 17:18:06 pm-08 corosync[16669]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jan 27 17:18:06 pm-08 corosync[16669]: [QB ] server name: quorum
Jan 27 17:18:06 pm-08 corosync[16669]: [TOTEM ] A new membership (192.168.52.108:32) was formed. Members joined: 8
Jan 27 17:18:06 pm-08 corosync[16669]: [CPG ] downlist left_list: 0 received
Jan 27 17:18:06 pm-08 corosync[16669]: [QUORUM] Members[1]: 8
Jan 27 17:18:06 pm-08 corosync[16669]: [MAIN ] Completed service synchronization, ready to provide service.

● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2020-01-27 17:23:17 CET; 16h ago
Process: 23940 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
Process: 23869 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 23881 (pmxcfs)
Tasks: 7 (limit: 9830)
Memory: 27.2M
CPU: 32.144s
CGroup: /system.slice/pve-cluster.service
└─23881 /usr/bin/pmxcfs

Jan 28 00:23:16 pm-08 pmxcfs[23881]: [dcdb] notice: data verification successful
Jan 28 01:23:16 pm-08 pmxcfs[23881]: [dcdb] notice: data verification successful
Jan 28 02:23:16 pm-08 pmxcfs[23881]: [dcdb] notice: data verification successful
Jan 28 03:23:16 pm-08 pmxcfs[23881]: [dcdb] notice: data verification successful
Jan 28 04:23:16 pm-08 pmxcfs[23881]: [dcdb] notice: data verification successful
Jan 28 05:23:16 pm-08 pmxcfs[23881]: [dcdb] notice: data verification successful
Jan 28 06:23:16 pm-08 pmxcfs[23881]: [dcdb] notice: data verification successful
Jan 28 07:23:16 pm-08 pmxcfs[23881]: [dcdb] notice: data verification successful
Jan 28 08:23:16 pm-08 pmxcfs[23881]: [dcdb] notice: data verification successful
Jan 28 09:23:16 pm-08 pmxcfs[23881]: [dcdb] notice: data verification successful

pvecm status still looks the same

root@pm-08:~# pvecm status
Quorum information
------------------
Date: Tue Jan 28 09:34:26 2020
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000008
Ring ID: 8/32
Quorate: No

Votequorum information
----------------------
Expected votes: 8
Highest expected: 8
Total votes: 1
Quorum: 5 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000008 1 192.168.52.108 (local)

root@pm-01:~# pvecm status
Quorum information
------------------
Date: Tue Jan 28 09:36:03 2020
Quorum provider: corosync_votequorum
Nodes: 7
Node ID: 0x00000001
Ring ID: 1/2020
Quorate: Yes

Votequorum information
----------------------
Expected votes: 8
Highest expected: 8
Total votes: 7
Quorum: 5
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.52.76 (local)
0x00000002 1 192.168.52.77
0x00000003 1 192.168.52.83
0x00000004 1 192.168.52.85
0x00000005 1 192.168.52.87
0x00000006 1 192.168.52.88
0x00000007 1 192.168.52.104

I did an "pvecm delnode pm-08" on my cluster and removed it. I tried to readd it like it is now but i got:

"trying to acquire lock...
can't lock file '/var/lock/pvecm.lock' - got timeout"

I'll reinstall the server and try to add it again.

pm-08 is still shown in my webui, how to i get rid of that?

schoda · Jan 28, 2020

I've reinstalled and run into the same issue. I'm deleting the LACP interface and try without it.

t.lamprecht · Jan 28, 2020

Jan 27 17:18:06 pm-08 corosync[16669]: [TOTEM ] A new membership (192.168.52.108:32) was formed. Members joined: 8
Jan 27 17:18:06 pm-08 corosync[16669]: [CPG ] downlist left_list: 0 received
Jan 27 17:18:06 pm-08 corosync[16669]: [QUORUM] Members[1]: 8

Yes, the node just saw itself on the network. All nodes must be able to connect to each other via UDP ports 5404 and 5405 for corosync to work, seems definitively like a network issue - all else seemed OK.

schoda · Jan 28, 2020

t.lamprecht said:
Yes, the node just saw itself on the network. All nodes must be able to connect to each other via UDP ports 5404 and 5405 for corosync to work, seems definitively like a network issue - all else seemed OK.

Alright than it is a problem with the LACP interface. They are already on the same network and i also can ssh from one proxmox host to another.

schoda · Jan 28, 2020

We found the issue. Cisco sucks balls. We got new cisco nexus switches and we had to configure the vlan like that to get multicast working:

vlan configuration 7
ip igmp snooping querier X.X.X.X

Where X.X.X.X is a free ip address in the subnet of your proxmox hosts.

t.lamprecht · Jan 28, 2020

Wait, do you use PVE 5.x ? Because the newer Proxmox VE 6.y series doesn't requires multicast traffic anymore.

But yeah enabling a IGMP querier or disabling IGMP snooping is definitively a common problem - I just am so used to the new Version already that it didn't even come to my mind that you may not use that one..

Glad you could solve it, though!

schoda · Jan 28, 2020

t.lamprecht said:
Wait, do you use PVE 5.x ? Because the newer Proxmox VE 6.y series doesn't requires multicast traffic anymore.

But yeah enabling a IGMP querier or disabling IGMP snooping is definitively a common problem - I just am so used to the new Version already that it didn't even come to my mind that you may not use that one..

Glad you could solve it, though!

We plan to upgrade, and yes i've seen that unicast instead of multicast is used in newer versions. However i can't upgrade atm because first i need a server to migrate VMs to. Cluster got 8 nodes now and we plan to do an RAM/Network upgrade before going to Proxmox 6.x. Migrating over a single 1Gbit interface is just to slow. It all will happen withhin the next few weeks i hope.

Search

Search

tried to add a new server to our cluster - it failed

schoda

Member

t.lamprecht

Proxmox Staff Member

schoda

Member

t.lamprecht

Proxmox Staff Member

schoda

Member

schoda

Member

t.lamprecht

Proxmox Staff Member

schoda

Member

schoda

Member

t.lamprecht

Proxmox Staff Member

schoda

Member