I had a nicely working 2 node cluster (vmhost1 and vmhost2)
I installed a 3rd node (vmhost3) and used the join information to add the node, but the cluster is broken:
vmhost2 syslog:
vmhost3 syslog:
On vmhost3, the folder /etc/pve/nodes is completely missing.
corosync:
systemctl status -l pve-cluster
vmhost3 hosts file:
vmhost3 hostname file:
The network is a flat network on 192.168.45.0/24, as can be seen from the corosync.conf above.
The networks connections worked perfectly fine prior to this. As I mentioned I was able to log into vmhost3's GUI to join the cluster. SSH still works, so I don't know how it could be a network problem.
I have a dumb switch connecting all 3 nodes.
I can ping from each host to each other host.
All nodes are very close in versions:
vmhost3:
vmhost3 network interface:
I tried to update vmhost1 but it's getting timeout when processing triggers for pve-manager, so I think I kind of messed up there:
Restarting vmhost3 did not help.
Is there a way to salvage this cluster? I would prefer not to have to re-create it.
I installed a 3rd node (vmhost3) and used the join information to add the node, but the cluster is broken:
vmhost2 syslog:
Code:
Jun 01 07:48:21 vmhost2 pvescheduler[2452666]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Jun 01 07:48:21 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 50
Jun 01 07:48:22 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 60
Jun 01 07:48:23 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 70
Jun 01 07:48:24 vmhost2 corosync[1407]: [QUORUM] Sync members[2]: 1 2
Jun 01 07:48:24 vmhost2 corosync[1407]: [TOTEM ] A new membership (1.116d) was formed. Members
Jun 01 07:48:24 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 80
Jun 01 07:48:25 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 90
Jun 01 07:48:26 vmhost2 corosync[1407]: [TOTEM ] Token has not been received in 2250 ms
Jun 01 07:48:26 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 100
Jun 01 07:48:26 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retried 100 times
Jun 01 07:48:26 vmhost2 pmxcfs[1294]: [status] crit: cpg_send_message failed: 6
Jun 01 07:48:26 vmhost2 pve-firewall[1440]: firewall update time (20.056 seconds)
Jun 01 07:48:27 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 10
Jun 01 07:48:28 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 20
Jun 01 07:48:29 vmhost2 corosync[1407]: [TOTEM ] Token has not been received in 5252 ms
Jun 01 07:48:29 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 30
Jun 01 07:48:30 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 40
Jun 01 07:48:31 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 50
Jun 01 07:48:32 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 60
Jun 01 07:48:32 vmhost2 corosync[1407]: [QUORUM] Sync members[2]: 1 2
Jun 01 07:48:32 vmhost2 corosync[1407]: [TOTEM ] A new membership (1.1181) was formed. Members
Jun 01 07:48:33 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 70
Jun 01 07:48:34 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 80
Jun 01 07:48:35 vmhost2 corosync[1407]: [TOTEM ] Token has not been received in 2251 ms
Jun 01 07:48:35 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 90
Jun 01 07:48:36 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retry 100
Jun 01 07:48:36 vmhost2 pmxcfs[1294]: [status] notice: cpg_send_message retried 100 times
Jun 01 07:48:36 vmhost2 pmxcfs[1294]: [status] crit: cpg_send_message failed: 6
Jun 01 07:48:36 vmhost2 pvestatd[1439]: status update time (90.177 seconds)
vmhost3 syslog:
Code:
Jun 1 08:04:17 vmhost3 pveproxy[6064]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun 1 08:04:17 vmhost3 pveproxy[6065]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun 1 08:04:19 vmhost3 corosync[860]: [QUORUM] Sync members[1]: 3
Jun 1 08:04:19 vmhost3 corosync[860]: [TOTEM ] A new membership (3.19f1) was formed. Members
Jun 1 08:04:19 vmhost3 corosync[860]: [QUORUM] Members[1]: 3
Jun 1 08:04:19 vmhost3 corosync[860]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 1 08:04:22 vmhost3 pveproxy[6063]: worker exit
Jun 1 08:04:22 vmhost3 pveproxy[915]: worker 6063 finished
Jun 1 08:04:22 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun 1 08:04:22 vmhost3 pveproxy[915]: worker 6066 started
Jun 1 08:04:22 vmhost3 pveproxy[6066]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun 1 08:04:22 vmhost3 pveproxy[6064]: worker exit
Jun 1 08:04:22 vmhost3 pveproxy[6065]: worker exit
Jun 1 08:04:22 vmhost3 pveproxy[915]: worker 6064 finished
Jun 1 08:04:22 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun 1 08:04:22 vmhost3 pveproxy[915]: worker 6067 started
Jun 1 08:04:22 vmhost3 pveproxy[915]: worker 6065 finished
Jun 1 08:04:22 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun 1 08:04:22 vmhost3 pveproxy[915]: worker 6068 started
Jun 1 08:04:22 vmhost3 pveproxy[6067]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun 1 08:04:22 vmhost3 pveproxy[6068]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun 1 08:04:27 vmhost3 pveproxy[6066]: worker exit
Jun 1 08:04:27 vmhost3 pveproxy[915]: worker 6066 finished
Jun 1 08:04:27 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun 1 08:04:27 vmhost3 pveproxy[915]: worker 6088 started
Jun 1 08:04:27 vmhost3 pveproxy[6088]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun 1 08:04:27 vmhost3 pveproxy[6067]: worker exit
Jun 1 08:04:27 vmhost3 pveproxy[6068]: worker exit
Jun 1 08:04:27 vmhost3 pveproxy[915]: worker 6067 finished
Jun 1 08:04:27 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun 1 08:04:27 vmhost3 pveproxy[915]: worker 6089 started
Jun 1 08:04:27 vmhost3 pveproxy[915]: worker 6068 finished
Jun 1 08:04:27 vmhost3 pveproxy[915]: starting 1 worker(s)
Jun 1 08:04:27 vmhost3 pveproxy[915]: worker 6090 started
Jun 1 08:04:27 vmhost3 pveproxy[6089]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
Jun 1 08:04:27 vmhost3 pveproxy[6090]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1917.
On vmhost3, the folder /etc/pve/nodes is completely missing.
corosync:
Code:
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: vmhost1
nodeid: 1
quorum_votes: 1
ring0_addr: 192.168.45.10
}
node {
name: vmhost2
nodeid: 2
quorum_votes: 1
ring0_addr: 192.168.45.11
}
node {
name: vmhost3
nodeid: 3
quorum_votes: 1
ring0_addr: 192.168.45.12
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: MIHAILAND
config_version: 9
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
systemctl status -l pve-cluster
Code:
pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2022-06-01 07:34:44 CDT; 31min ago
Process: 732 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 749 (pmxcfs)
Tasks: 6 (limit: 18974)
Memory: 32.3M
CPU: 928ms
CGroup: /system.slice/pve-cluster.service
└─749 /usr/bin/pmxcfs
Jun 01 07:34:43 vmhost3 pmxcfs[749]: [dcdb] crit: cpg_initialize failed: 2
Jun 01 07:34:43 vmhost3 pmxcfs[749]: [dcdb] crit: can't initialize service
Jun 01 07:34:43 vmhost3 pmxcfs[749]: [status] crit: cpg_initialize failed: 2
Jun 01 07:34:43 vmhost3 pmxcfs[749]: [status] crit: can't initialize service
Jun 01 07:34:44 vmhost3 systemd[1]: Started The Proxmox VE cluster filesystem.
Jun 01 07:34:49 vmhost3 pmxcfs[749]: [status] notice: update cluster info (cluster name MIHAILAND, version = 9)
Jun 01 07:34:57 vmhost3 pmxcfs[749]: [dcdb] notice: members: 3/749
Jun 01 07:34:57 vmhost3 pmxcfs[749]: [dcdb] notice: all data is up to date
Jun 01 07:34:57 vmhost3 pmxcfs[749]: [status] notice: members: 3/749
Jun 01 07:34:57 vmhost3 pmxcfs[749]: [status] notice: all data is up to date
vmhost3 hosts file:
Code:
127.0.0.1 localhost.localdomain localhost
192.168.45.12 vmhost3.internal.mihailand.com vmhost3
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
vmhost3 hostname file:
Code:
vmhost3
The network is a flat network on 192.168.45.0/24, as can be seen from the corosync.conf above.
The networks connections worked perfectly fine prior to this. As I mentioned I was able to log into vmhost3's GUI to join the cluster. SSH still works, so I don't know how it could be a network problem.
I have a dumb switch connecting all 3 nodes.
I can ping from each host to each other host.
All nodes are very close in versions:
vmhost3:
Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-3
pve-kernel-helper: 7.2-3
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.1-1
proxmox-backup-file-restore: 2.2.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
vmhost3 network interface:
Code:
auto lo
iface lo inet loopback
iface eno1 inet manual
auto vmbr0
iface vmbr0 inet static
address 192.168.45.12/24
gateway 192.168.45.1
bridge-ports eno1
bridge-stp off
bridge-fd 0
I tried to update vmhost1 but it's getting timeout when processing triggers for pve-manager, so I think I kind of messed up there:
Code:
Processing triggers for pve-manager (7.2-4) ...
got timeout
Restarting vmhost3 did not help.
Is there a way to salvage this cluster? I would prefer not to have to re-create it.
Last edited: