[SOLVED] 2 node cluster failed when adding 3rd node

MRosu

Renowned Member
Mar 27, 2016
29
3
68
38
I had a nicely working 2 node cluster (vmhost1 and vmhost2)

I installed a 3rd node (vmhost3) and used the join information to add the node, but the cluster is broken:

vmhost2 syslog:

Code:
Jun 01 07:48:21 vmhost2 pvescheduler[2452666]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Jun 01 07:48:21 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 50
Jun 01 07:48:22 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 60
Jun 01 07:48:23 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 70
Jun 01 07:48:24 vmhost2 corosync[1407]:   [QUORUM] Sync members[2]: 1 2
Jun 01 07:48:24 vmhost2 corosync[1407]:   [TOTEM ] A new membership (1.116d) was formed. Members
Jun 01 07:48:24 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 80
Jun 01 07:48:25 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 90
Jun 01 07:48:26 vmhost2 corosync[1407]:   [TOTEM ] Token has not been received in 2250 ms
Jun 01 07:48:26 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 100
Jun 01 07:48:26 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retried 100 times
Jun 01 07:48:26 vmhost2 pmxcfs[1294]: [status] crit: cpg_send_message failed: 6
Jun 01 07:48:26 vmhost2 pve-firewall[1440]: firewall update time (20.056 seconds)
Jun 01 07:48:27 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 10
Jun 01 07:48:28 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 20
Jun 01 07:48:29 vmhost2 corosync[1407]:   [TOTEM ] Token has not been received in 5252 ms
Jun 01 07:48:29 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 30
Jun 01 07:48:30 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 40
Jun 01 07:48:31 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 50
Jun 01 07:48:32 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 60
Jun 01 07:48:32 vmhost2 corosync[1407]:   [QUORUM] Sync members[2]: 1 2
Jun 01 07:48:32 vmhost2 corosync[1407]:   [TOTEM ] A new membership (1.1181) was formed. Members
Jun 01 07:48:33 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 70
Jun 01 07:48:34 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 80
Jun 01 07:48:35 vmhost2 corosync[1407]:   [TOTEM ] Token has not been received in 2251 ms
Jun 01 07:48:35 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 90
Jun 01 07:48:36 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 100
Jun 01 07:48:36 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retried 100 times
Jun 01 07:48:36 vmhost2 pmxcfs[1294]: [status] crit: cpg_send_message failed: 6
Jun 01 07:48:36 vmhost2 pvestatd[1439]: status update time (90.177 seconds)

vmhost3 syslog:

Code:
Jun  1 08:04:17 vmhost3 pveproxy[6064]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun  1 08:04:17 vmhost3 pveproxy[6065]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun  1 08:04:19 vmhost3 corosync[860]:   [QUORUM] Sync members[1]: 3
Jun  1 08:04:19 vmhost3 corosync[860]:   [TOTEM ] A new membership (3.19f1) was formed. Members
Jun  1 08:04:19 vmhost3 corosync[860]:   [QUORUM] Members[1]: 3
Jun  1 08:04:19 vmhost3 corosync[860]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun  1 08:04:22 vmhost3 pveproxy[6063]: worker exit
Jun  1 08:04:22 vmhost3 pveproxy[915]: worker 6063 finished
Jun  1 08:04:22 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun  1 08:04:22 vmhost3 pveproxy[915]: worker 6066 started
Jun  1 08:04:22 vmhost3 pveproxy[6066]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun  1 08:04:22 vmhost3 pveproxy[6064]: worker exit
Jun  1 08:04:22 vmhost3 pveproxy[6065]: worker exit
Jun  1 08:04:22 vmhost3 pveproxy[915]: worker 6064 finished
Jun  1 08:04:22 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun  1 08:04:22 vmhost3 pveproxy[915]: worker 6067 started
Jun  1 08:04:22 vmhost3 pveproxy[915]: worker 6065 finished
Jun  1 08:04:22 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun  1 08:04:22 vmhost3 pveproxy[915]: worker 6068 started
Jun  1 08:04:22 vmhost3 pveproxy[6067]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun  1 08:04:22 vmhost3 pveproxy[6068]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun  1 08:04:27 vmhost3 pveproxy[6066]: worker exit
Jun  1 08:04:27 vmhost3 pveproxy[915]: worker 6066 finished
Jun  1 08:04:27 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun  1 08:04:27 vmhost3 pveproxy[915]: worker 6088 started
Jun  1 08:04:27 vmhost3 pveproxy[6088]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun  1 08:04:27 vmhost3 pveproxy[6067]: worker exit
Jun  1 08:04:27 vmhost3 pveproxy[6068]: worker exit
Jun  1 08:04:27 vmhost3 pveproxy[915]: worker 6067 finished
Jun  1 08:04:27 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun  1 08:04:27 vmhost3 pveproxy[915]: worker 6089 started
Jun  1 08:04:27 vmhost3 pveproxy[915]: worker 6068 finished
Jun  1 08:04:27 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun  1 08:04:27 vmhost3 pveproxy[915]: worker 6090 started
Jun  1 08:04:27 vmhost3 pveproxy[6089]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun  1 08:04:27 vmhost3 pveproxy[6090]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.

On vmhost3, the folder /etc/pve/nodes is completely missing.

corosync:

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: vmhost1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.45.10
  }
  node {
    name: vmhost2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.45.11
  }
  node {
    name: vmhost3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.45.12
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: MIHAILAND
  config_version: 9
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

systemctl status -l pve-cluster

Code:
 pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2022-06-01 07:34:44 CDT; 31min ago
    Process: 732 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 749 (pmxcfs)
      Tasks: 6 (limit: 18974)
     Memory: 32.3M
        CPU: 928ms
     CGroup: /system.slice/pve-cluster.service
             └─749 /usr/bin/pmxcfs

Jun 01 07:34:43 vmhost3 pmxcfs[749]: [dcdb] crit: cpg_initialize failed: 2
Jun 01 07:34:43 vmhost3 pmxcfs[749]: [dcdb] crit: can't initialize service
Jun 01 07:34:43 vmhost3 pmxcfs[749]: [status] crit: cpg_initialize failed: 2
Jun 01 07:34:43 vmhost3 pmxcfs[749]: [status] crit: can't initialize service
Jun 01 07:34:44 vmhost3 systemd[1]: Started The Proxmox VE cluster filesystem.
Jun 01 07:34:49 vmhost3 pmxcfs[749]: [status] notice: update cluster info (cluster name  MIHAILAND, version = 9)
Jun 01 07:34:57 vmhost3 pmxcfs[749]: [dcdb] notice: members: 3/749
Jun 01 07:34:57 vmhost3 pmxcfs[749]: [dcdb] notice: all data is up to date
Jun 01 07:34:57 vmhost3 pmxcfs[749]: [status] notice: members: 3/749
Jun 01 07:34:57 vmhost3 pmxcfs[749]: [status] notice: all data is up to date

vmhost3 hosts file:

Code:
127.0.0.1 localhost.localdomain localhost
192.168.45.12 vmhost3.internal.mihailand.com vmhost3

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

vmhost3 hostname file:

Code:
vmhost3

The network is a flat network on 192.168.45.0/24, as can be seen from the corosync.conf above.

The networks connections worked perfectly fine prior to this. As I mentioned I was able to log into vmhost3's GUI to join the cluster. SSH still works, so I don't know how it could be a network problem.

I have a dumb switch connecting all 3 nodes.

I can ping from each host to each other host.

All nodes are very close in versions:

vmhost3:

Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-3
pve-kernel-helper: 7.2-3
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.1-1
proxmox-backup-file-restore: 2.2.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

vmhost3 network interface:

Code:
auto lo
iface lo inet loopback

iface eno1 inet manual

auto vmbr0
iface vmbr0 inet static
    address 192.168.45.12/24
    gateway 192.168.45.1
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0

I tried to update vmhost1 but it's getting timeout when processing triggers for pve-manager, so I think I kind of messed up there:

Code:
Processing triggers for pve-manager (7.2-4) ...
got timeout

Restarting vmhost3 did not help.

Is there a way to salvage this cluster? I would prefer not to have to re-create it.
 
Last edited:
Okay I was able to remove vmhost3 from the cluster using the following instructions:

Code:
systemctl stop pve-cluster corosync
pmxcfs -l
rm /etc/corosync/*
rm /etc/pve/corosync.conf
killall pmxcfs
systemctl start pve-cluster

The original cluster is now working again... I will try and remove all traces of vmhost3 from the original cluster so that I can try to re-add it hopefully.
 
SOLVED: I did not choose the Cluster network when joining the cluster.

Probably should not allow one to continue to join the cluster without making the choice, but it's my fault.
 
  • Like
Reactions: szhan and Spirog

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!