PVE 3.4 Cluster - failed node recovery

ronsrussell

Renowned Member
Mar 9, 2011
51
0
71
I have a four node PVE. All nodes are licensed with PVE Community Subscription.
Node one has failed and must be replaced.
The cluster and all VM's are working fine on the remaining three nodes.
Following suggestions on other posts I have re-installed Proxmox on new hardware to replace failed Node one.
But when I add it to the cluster although the node shows up in the GUI, the CLI of node keeps timeing out trying to connect.

Here is last part of the message log (I could not add all of it due to size) -

Sep 27 21:51:05 pmc1 kernel: DLM (built Sep 12 2015 12:55:41) installed
Sep 27 21:51:05 pmc1 corosync[4028]: [MAIN ] Corosync Cluster Engine ('1.4.7'): started and ready to provide service.
Sep 27 21:51:05 pmc1 corosync[4028]: [MAIN ] Corosync built-in features: nss
Sep 27 21:51:05 pmc1 corosync[4028]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf
Sep 27 21:51:05 pmc1 corosync[4028]: [MAIN ] Successfully parsed cman config
Sep 27 21:51:05 pmc1 corosync[4028]: [MAIN ] Successfully configured openais services to load
Sep 27 21:51:05 pmc1 corosync[4028]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Sep 27 21:51:05 pmc1 corosync[4028]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Sep 27 21:51:05 pmc1 corosync[4028]: [TOTEM ] The network interface is down.
Sep 27 21:51:05 pmc1 corosync[4028]: [QUORUM] Using quorum provider quorum_cman
Sep 27 21:51:05 pmc1 corosync[4028]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Sep 27 21:51:05 pmc1 corosync[4028]: [CMAN ] CMAN 1364188437 (built Mar 25 2013 06:14:01) started
Sep 27 21:51:05 pmc1 corosync[4028]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90
Sep 27 21:51:05 pmc1 corosync[4028]: [SERV ] Service engine loaded: openais cluster membership service B.01.01
Sep 27 21:51:05 pmc1 corosync[4028]: [SERV ] Service engine loaded: openais event service B.01.01
Sep 27 21:51:05 pmc1 corosync[4028]: [SERV ] Service engine loaded: openais checkpoint service B.01.01
Sep 27 21:51:05 pmc1 corosync[4028]: [SERV ] Service engine loaded: openais message service B.03.01
Sep 27 21:51:05 pmc1 corosync[4028]: [SERV ] Service engine loaded: openais distributed locking service B.03.01
Sep 27 21:51:05 pmc1 corosync[4028]: [SERV ] Service engine loaded: openais timer service A.01.01
Sep 27 21:51:05 pmc1 corosync[4028]: [SERV ] Service engine loaded: corosync extended virtual synchrony service
Sep 27 21:51:05 pmc1 corosync[4028]: [SERV ] Service engine loaded: corosync configuration service
Sep 27 21:51:05 pmc1 corosync[4028]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01
Sep 27 21:51:05 pmc1 corosync[4028]: [SERV ] Service engine loaded: corosync cluster config database access v1.01
Sep 27 21:51:05 pmc1 corosync[4028]: [SERV ] Service engine loaded: corosync profile loading service
Sep 27 21:51:05 pmc1 corosync[4028]: [QUORUM] Using quorum provider quorum_cman
Sep 27 21:51:05 pmc1 corosync[4028]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Sep 27 21:51:05 pmc1 corosync[4028]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
Sep 27 21:51:05 pmc1 corosync[4028]: [CLM ] CLM CONFIGURATION CHANGE
Sep 27 21:51:05 pmc1 corosync[4028]: [CLM ] New Configuration:
Sep 27 21:51:05 pmc1 corosync[4028]: [CLM ] Members Left:
Sep 27 21:51:05 pmc1 corosync[4028]: [CLM ] Members Joined:
Sep 27 21:51:05 pmc1 corosync[4028]: [CLM ] CLM CONFIGURATION CHANGE
Sep 27 21:51:05 pmc1 corosync[4028]: [CLM ] New Configuration:
Sep 27 21:51:05 pmc1 corosync[4028]: [CLM ] #011r(0) ip(127.0.0.1)
Sep 27 21:51:05 pmc1 corosync[4028]: [CLM ] Members Left:
Sep 27 21:51:05 pmc1 corosync[4028]: [CLM ] Members Joined:
Sep 27 21:51:05 pmc1 corosync[4028]: [CLM ] #011r(0) ip(127.0.0.1)
Sep 27 21:51:05 pmc1 corosync[4028]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 27 21:51:05 pmc1 corosync[4028]: [QUORUM] Members[1]: 1
Sep 27 21:51:05 pmc1 corosync[4028]: [QUORUM] Members[1]: 1
Sep 27 21:51:05 pmc1 corosync[4028]: [CPG ] chosen downlist: sender r(0) ip(127.0.0.1) ; members(old:0 left:0)
Sep 27 21:51:05 pmc1 corosync[4028]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 27 21:51:55 pmc1 kernel: Netfilter messages via NETLINK v0.30.
Sep 27 21:51:55 pmc1 kernel: kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL does not work properly. Using workaround
Sep 27 21:51:55 pmc1 kernel: ip_tables: (C) 2000-2006 Netfilter Core Team
Sep 27 21:51:55 pmc1 kernel: tun: Universal TUN/TAP device driver, 1.6
Sep 27 21:51:55 pmc1 kernel: tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
Sep 27 21:51:55 pmc1 kernel: ip6_tables: (C) 2000-2006 Netfilter Core Team
Sep 27 21:51:55 pmc1 kernel: Enabling conntracks and NAT for ve0
Sep 27 21:51:55 pmc1 kernel: nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
Sep 27 21:51:55 pmc1 kernel: ploop_dev: module loaded
Sep 27 21:51:56 pmc1 kernel: ip_set: protocol 6
Sep 27 21:51:58 pmc1 pvesh: <root@pam> starting task UPID:pmc1:000010DD:000021E6:56089D3E:startall::root@pam:

Here is the continuous repeating of the syslog -

Sep 29 07:26:44 pmc1 pveproxy[225574]: worker exit
Sep 29 07:26:44 pmc1 pveproxy[225575]: worker exit
Sep 29 07:26:44 pmc1 pveproxy[4291]: worker 225574 finished
Sep 29 07:26:44 pmc1 pveproxy[4291]: starting 1 worker(s)
Sep 29 07:26:44 pmc1 pveproxy[4291]: worker 225575 finished
Sep 29 07:26:44 pmc1 pveproxy[4291]: worker 225579 started
Sep 29 07:26:44 pmc1 pveproxy[225579]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/HTTPServer.pm line 1634

I'm looking for some guidance to get node one back into the cluster.
 
Here is output from Node one which is the one we are attempting to get back into the cluster -

root@pmc1:~# pveversion -v
proxmox-ve-2.6.32: 3.4-163 (running kernel: 2.6.32-41-pve)
pve-manager: 3.4-11 (running version: 3.4-11/6502936f)
pve-kernel-2.6.32-41-pve: 2.6.32-163
pve-kernel-2.6.32-37-pve: 2.6.32-150
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-3
pve-cluster: 3.0-19
qemu-server: 3.4-6
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-33
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-11
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1
root@pmc1:~#

And here is one of the three nodes that are still in the cluster -

root@pmc2:~# pveversion -v
proxmox-ve-2.6.32: 3.4-163 (running kernel: 2.6.32-41-pve)
pve-manager: 3.4-11 (running version: 3.4-11/6502936f)
pve-kernel-2.6.32-40-pve: 2.6.32-160
pve-kernel-2.6.32-39-pve: 2.6.32-157
pve-kernel-2.6.32-41-pve: 2.6.32-163
pve-kernel-2.6.32-37-pve: 2.6.32-150
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-3
pve-cluster: 3.0-19
qemu-server: 3.4-6
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-33
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-11
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1
root@pmc2:~#
 
Please verify that /etc/pve/cluster.conf is the same on all nodes. If so, please post the file here.
 
Yes /etc/pve/cluster.conf is the same on all four nodes.

<?xml version="1.0"?>
<cluster name="PPC-Office" config_version="6">

<cman keyfile="/var/lib/pve-cluster/corosync.authkey">
</cman>

<clusternodes>

<clusternode name="pmc2" votes="1" nodeid="2"/><clusternode name="pmc3" votes="1" nodeid="3"/><clusternode name="pmc4" votes="1" nodeid="4"/><clusternode name="pmc1" votes="1" nodeid="1"/></clusternodes>

</cluster>
 
Also check if your hostname is resolvable using /etc/hosts. Please can you post the /etc/hosts file?
 
The omping test was successful.
But the /etc/hosts file revealed the problem. We had installed Proxmox while this server was on a different network and then changed the ip addresses by editing the /etc/network/interfaces file. We never modified the hosts file. Now that the hosts file is correct cman will start and pvecm status looks good -

root@pmc1:/# pvecm status
Version: 6.2.0
Config Version: 6
Cluster Name: PPC-Office
Cluster Id: 13055
Cluster Member: Yes
Cluster Generation: 1256
Membership state: Cluster-Member
Nodes: 4
Expected votes: 4
Total votes: 4
Node votes: 1
Quorum: 3
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: pmc1
Node ID: 1
Multicast addresses: 239.192.50.50
Node addresses: 192.168.35.231

Thanks much for your support - Ron
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!