[SOLVED] 2 node cluster failed when adding 3rd node

MRosu

Renowned Member
Mar 27, 2016
30
3
73
39
I had a nicely working 2 node cluster (vmhost1 and vmhost2)

I installed a 3rd node (vmhost3) and used the join information to add the node, but the cluster is broken:

vmhost2 syslog:

Code:
Jun 01 07:48:21 vmhost2 pvescheduler[2452666]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Jun 01 07:48:21 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 50
Jun 01 07:48:22 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 60
Jun 01 07:48:23 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 70
Jun 01 07:48:24 vmhost2 corosync[1407]:   [QUORUM] Sync members[2]: 1 2
Jun 01 07:48:24 vmhost2 corosync[1407]:   [TOTEM ] A new membership (1.116d) was formed. Members
Jun 01 07:48:24 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 80
Jun 01 07:48:25 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 90
Jun 01 07:48:26 vmhost2 corosync[1407]:   [TOTEM ] Token has not been received in 2250 ms
Jun 01 07:48:26 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 100
Jun 01 07:48:26 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retried 100 times
Jun 01 07:48:26 vmhost2 pmxcfs[1294]: [status] crit: cpg_send_message failed: 6
Jun 01 07:48:26 vmhost2 pve-firewall[1440]: firewall update time (20.056 seconds)
Jun 01 07:48:27 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 10
Jun 01 07:48:28 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 20
Jun 01 07:48:29 vmhost2 corosync[1407]:   [TOTEM ] Token has not been received in 5252 ms
Jun 01 07:48:29 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 30
Jun 01 07:48:30 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 40
Jun 01 07:48:31 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 50
Jun 01 07:48:32 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 60
Jun 01 07:48:32 vmhost2 corosync[1407]:   [QUORUM] Sync members[2]: 1 2
Jun 01 07:48:32 vmhost2 corosync[1407]:   [TOTEM ] A new membership (1.1181) was formed. Members
Jun 01 07:48:33 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 70
Jun 01 07:48:34 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 80
Jun 01 07:48:35 vmhost2 corosync[1407]:   [TOTEM ] Token has not been received in 2251 ms
Jun 01 07:48:35 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 90
Jun 01 07:48:36 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 100
Jun 01 07:48:36 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retried 100 times
Jun 01 07:48:36 vmhost2 pmxcfs[1294]: [status] crit: cpg_send_message failed: 6
Jun 01 07:48:36 vmhost2 pvestatd[1439]: status update time (90.177 seconds)

vmhost3 syslog:

Code:
Jun  1 08:04:17 vmhost3 pveproxy[6064]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun  1 08:04:17 vmhost3 pveproxy[6065]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun  1 08:04:19 vmhost3 corosync[860]:   [QUORUM] Sync members[1]: 3
Jun  1 08:04:19 vmhost3 corosync[860]:   [TOTEM ] A new membership (3.19f1) was formed. Members
Jun  1 08:04:19 vmhost3 corosync[860]:   [QUORUM] Members[1]: 3
Jun  1 08:04:19 vmhost3 corosync[860]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun  1 08:04:22 vmhost3 pveproxy[6063]: worker exit
Jun  1 08:04:22 vmhost3 pveproxy[915]: worker 6063 finished
Jun  1 08:04:22 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun  1 08:04:22 vmhost3 pveproxy[915]: worker 6066 started
Jun  1 08:04:22 vmhost3 pveproxy[6066]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun  1 08:04:22 vmhost3 pveproxy[6064]: worker exit
Jun  1 08:04:22 vmhost3 pveproxy[6065]: worker exit
Jun  1 08:04:22 vmhost3 pveproxy[915]: worker 6064 finished
Jun  1 08:04:22 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun  1 08:04:22 vmhost3 pveproxy[915]: worker 6067 started
Jun  1 08:04:22 vmhost3 pveproxy[915]: worker 6065 finished
Jun  1 08:04:22 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun  1 08:04:22 vmhost3 pveproxy[915]: worker 6068 started
Jun  1 08:04:22 vmhost3 pveproxy[6067]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun  1 08:04:22 vmhost3 pveproxy[6068]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun  1 08:04:27 vmhost3 pveproxy[6066]: worker exit
Jun  1 08:04:27 vmhost3 pveproxy[915]: worker 6066 finished
Jun  1 08:04:27 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun  1 08:04:27 vmhost3 pveproxy[915]: worker 6088 started
Jun  1 08:04:27 vmhost3 pveproxy[6088]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun  1 08:04:27 vmhost3 pveproxy[6067]: worker exit
Jun  1 08:04:27 vmhost3 pveproxy[6068]: worker exit
Jun  1 08:04:27 vmhost3 pveproxy[915]: worker 6067 finished
Jun  1 08:04:27 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun  1 08:04:27 vmhost3 pveproxy[915]: worker 6089 started
Jun  1 08:04:27 vmhost3 pveproxy[915]: worker 6068 finished
Jun  1 08:04:27 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun  1 08:04:27 vmhost3 pveproxy[915]: worker 6090 started
Jun  1 08:04:27 vmhost3 pveproxy[6089]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun  1 08:04:27 vmhost3 pveproxy[6090]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.

On vmhost3, the folder /etc/pve/nodes is completely missing.

corosync:

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: vmhost1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.45.10
  }
  node {
    name: vmhost2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.45.11
  }
  node {
    name: vmhost3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.45.12
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: MIHAILAND
  config_version: 9
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

systemctl status -l pve-cluster

Code:
 pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2022-06-01 07:34:44 CDT; 31min ago
    Process: 732 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 749 (pmxcfs)
      Tasks: 6 (limit: 18974)
     Memory: 32.3M
        CPU: 928ms
     CGroup: /system.slice/pve-cluster.service
             └─749 /usr/bin/pmxcfs

Jun 01 07:34:43 vmhost3 pmxcfs[749]: [dcdb] crit: cpg_initialize failed: 2
Jun 01 07:34:43 vmhost3 pmxcfs[749]: [dcdb] crit: can't initialize service
Jun 01 07:34:43 vmhost3 pmxcfs[749]: [status] crit: cpg_initialize failed: 2
Jun 01 07:34:43 vmhost3 pmxcfs[749]: [status] crit: can't initialize service
Jun 01 07:34:44 vmhost3 systemd[1]: Started The Proxmox VE cluster filesystem.
Jun 01 07:34:49 vmhost3 pmxcfs[749]: [status] notice: update cluster info (cluster name  MIHAILAND, version = 9)
Jun 01 07:34:57 vmhost3 pmxcfs[749]: [dcdb] notice: members: 3/749
Jun 01 07:34:57 vmhost3 pmxcfs[749]: [dcdb] notice: all data is up to date
Jun 01 07:34:57 vmhost3 pmxcfs[749]: [status] notice: members: 3/749
Jun 01 07:34:57 vmhost3 pmxcfs[749]: [status] notice: all data is up to date

vmhost3 hosts file:

Code:
127.0.0.1 localhost.localdomain localhost
192.168.45.12 vmhost3.internal.mihailand.com vmhost3

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

vmhost3 hostname file:

Code:
vmhost3

The network is a flat network on 192.168.45.0/24, as can be seen from the corosync.conf above.

The networks connections worked perfectly fine prior to this. As I mentioned I was able to log into vmhost3's GUI to join the cluster. SSH still works, so I don't know how it could be a network problem.

I have a dumb switch connecting all 3 nodes.

I can ping from each host to each other host.

All nodes are very close in versions:

vmhost3:

Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-3
pve-kernel-helper: 7.2-3
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.1-1
proxmox-backup-file-restore: 2.2.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

vmhost3 network interface:

Code:
auto lo
iface lo inet loopback

iface eno1 inet manual

auto vmbr0
iface vmbr0 inet static
    address 192.168.45.12/24
    gateway 192.168.45.1
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0

I tried to update vmhost1 but it's getting timeout when processing triggers for pve-manager, so I think I kind of messed up there:

Code:
Processing triggers for pve-manager (7.2-4) ...
got timeout

Restarting vmhost3 did not help.

Is there a way to salvage this cluster? I would prefer not to have to re-create it.
 
Last edited:
Okay I was able to remove vmhost3 from the cluster using the following instructions:

Code:
systemctl stop pve-cluster corosync
pmxcfs -l
rm /etc/corosync/*
rm /etc/pve/corosync.conf
killall pmxcfs
systemctl start pve-cluster

The original cluster is now working again... I will try and remove all traces of vmhost3 from the original cluster so that I can try to re-add it hopefully.
 
SOLVED: I did not choose the Cluster network when joining the cluster.

Probably should not allow one to continue to join the cluster without making the choice, but it's my fault.
 
  • Like
Reactions: szhan and Spirog