I have a 5 node Proxmox cluster co-located in a data center with ~100 KVMs that has been running happily the last year+.
The ISP needed to move the servers to another building (sigh).
Everything came back online, but two of the nodes, node2 and node5, are not connecting to the cluster and give this error in the syslog:
Each of the nodes has identical hardware. They have separate ethernet jacks/switches for WAN, Corosync 1, Corosync 2, Migration, and Ceph. The Ceph cluster is healthy. I can ssh to every node, and every node can ssh to the other nodes on every Interface (e.g. I can ssh to other nodes via Corosync 1, for example, or the main interface). Every node can ping every other node on via all interfaces and all switches. So it appears everything is fine with the way the network is plugged in.
If I log in to the web gui to node1, it shows it connects fine with node3 and node4. Those three seem to be happy together (e.g. green check mark next to them).
On nodes 2 and 5, they both show red X's next to all other nodes, and green check boxes for themselves. Logging into those nodes web interfaces directly, they both say under summary "Standalone node - no cluster defined". But on those same node2 and node5, if I go to Datacenter -> Cluster, it shows "Number of nodes: 5" and lists all five nodes. So it says "no cluster defined" yet it does list all the other nodes.
corosync.conf on all five nodes is identical (confirmed with sha1sum):
On node1 and node2, I tried `pvecm expected 3` but for both I got `Unable to set expected votes: CS_ERR_INVALID_PARAM`.
pvecm status is the same on all nodes (except where they say "local"):
I tried rebooting the broken nodes, node2 and node5, but that didn't help.
All nodes have the identical versions of all software installed:
status node2 (bad one):
Status node1 (good one):
This may be a bit of a red herring (perhaps I earlier ran the command as my regular user as UID 1000 and not with sudo):
Any hints or advice most welcome.
Thanks,
-Jeff
The ISP needed to move the servers to another building (sigh).
Everything came back online, but two of the nodes, node2 and node5, are not connecting to the cluster and give this error in the syslog:
Code:
Cluster not quorate - extending auth key lifetime!
Each of the nodes has identical hardware. They have separate ethernet jacks/switches for WAN, Corosync 1, Corosync 2, Migration, and Ceph. The Ceph cluster is healthy. I can ssh to every node, and every node can ssh to the other nodes on every Interface (e.g. I can ssh to other nodes via Corosync 1, for example, or the main interface). Every node can ping every other node on via all interfaces and all switches. So it appears everything is fine with the way the network is plugged in.
If I log in to the web gui to node1, it shows it connects fine with node3 and node4. Those three seem to be happy together (e.g. green check mark next to them).
On nodes 2 and 5, they both show red X's next to all other nodes, and green check boxes for themselves. Logging into those nodes web interfaces directly, they both say under summary "Standalone node - no cluster defined". But on those same node2 and node5, if I go to Datacenter -> Cluster, it shows "Number of nodes: 5" and lists all five nodes. So it says "no cluster defined" yet it does list all the other nodes.
corosync.conf on all five nodes is identical (confirmed with sha1sum):
Code:
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: nh1
nodeid: 1
quorum_votes: 1
ring0_addr: 10.22.22.1
ring1_addr: 10.33.33.1
}
node {
name: nh2
nodeid: 2
quorum_votes: 1
ring0_addr: 10.22.22.2
ring1_addr: 10.33.33.2
}
node {
name: nh3
nodeid: 3
quorum_votes: 1
ring0_addr: 10.22.22.3
ring1_addr: 10.33.33.3
}
node {
name: nh4
nodeid: 4
quorum_votes: 1
ring0_addr: 10.22.22.4
ring1_addr: 10.33.33.4
}
node {
name: nh5
nodeid: 5
quorum_votes: 1
ring0_addr: 10.22.22.5
ring1_addr: 10.33.33.5
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: nh
config_version: 5
interface {
linknumber: 0
}
interface {
linknumber: 1
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
On node1 and node2, I tried `pvecm expected 3` but for both I got `Unable to set expected votes: CS_ERR_INVALID_PARAM`.
pvecm status is the same on all nodes (except where they say "local"):
Code:
root@nh2:~# pvecm status
Cluster information
-------------------
Name: nh
Config Version: 5
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Wed May 20 21:25:40 2026
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000002
Ring ID: 1.2a9
Quorate: Yes
Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 5
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.22.22.1
0x00000002 1 10.22.22.2 (local)
0x00000003 1 10.22.22.3
0x00000004 1 10.22.22.4
0x00000005 1 10.22.22.5
I tried rebooting the broken nodes, node2 and node5, but that didn't help.
All nodes have the identical versions of all software installed:
Code:
root@nh2:~# pveversion -v
proxmox-ve: 8.4.0 (running kernel: 6.8.12-14-pve)
pve-manager: 8.4.12 (running version: 8.4.12/c2ea8261d32a5020)
proxmox-kernel-helper: 8.1.4
proxmox-kernel-6.8.12-14-pve-signed: 6.8.12-14
proxmox-kernel-6.8: 6.8.12-14
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
amd64-microcode: 3.20240820.1~deb12u1
ceph: 18.2.7-pve1
ceph-fuse: 18.2.7-pve1
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.2
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.2
libpve-cluster-perl: 8.1.2
libpve-common-perl: 8.3.4
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.7
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.6-1
proxmox-backup-file-restore: 3.4.6-1
proxmox-backup-restore-image: 0.7.0
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.4
proxmox-mail-forward: 0.3.3
proxmox-mini-journalreader: 1.5
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.13
pve-cluster: 8.1.2
pve-container: 5.3.0
pve-docs: 8.4.1
pve-edk2-firmware: not correctly installed
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.2
pve-firmware: 3.16-3
pve-ha-manager: 4.0.7
pve-i18n: 3.4.5
pve-qemu-kvm: 9.2.0-7
pve-xtermjs: 5.5.0-2
qemu-server: 8.4.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.8-pve1
status node2 (bad one):
Code:
pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
Active: active (running) since Wed 2026-05-20 21:01:53 MDT; 31min ago
Main PID: 3183 (pmxcfs)
Tasks: 6 (limit: 309003)
Memory: 52.4M
CPU: 1.650s
CGroup: /system.slice/pve-cluster.service
└─3183 /usr/bin/pmxcfs
Jun 20 21:11:00 nh2 pmxcfs[3183]: [confdb] crit: cmap_initialize failed: 2
Jun 20 21:11:00 nh2 pmxcfs[3183]: [confdb] crit: can't initialize service
Jun 20 21:11:00 nh2 pmxcfs[3183]: [dcdb] crit: cpg_initialize failed: 2
Jun 20 21:11:00 nh2 pmxcfs[3183]: [dcdb] crit: can't initialize service
Jun 20 21:11:00 nh2 pmxcfs[3183]: [status] crit: cpg_initialize failed: 2
Jun 20 21:11:00 nh2 pmxcfs[3183]: [status] crit: can't initialize service
May 20 21:01:53 nh2 systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
May 20 21:03:16 nh2 pmxcfs[3183]: [main] notice: ignore insert of duplicate cluster log
May 20 21:03:33 nh2 pmxcfs[3183]: [main] notice: ignore insert of duplicate cluster log
May 20 21:18:34 nh2 pmxcfs[3183]: [main] notice: ignore insert of duplicate cluster log
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: active (running) since Wed 2026-05-20 21:01:53 MDT; 31min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 3257 (corosync)
Tasks: 9 (limit: 309003)
Memory: 125.0M
CPU: 12.723s
CGroup: /system.slice/corosync.service
└─3257 /usr/sbin/corosync -f
May 20 21:01:56 nh2 corosync[3257]: [MAIN ] Completed service synchronization, ready to provide service.
May 20 21:01:56 nh2 corosync[3257]: [KNET ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]: [KNET ] pmtud: PMTUD link change for host: 5 link: 1 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]: [KNET ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]: [KNET ] pmtud: PMTUD link change for host: 4 link: 1 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]: [KNET ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]: [KNET ] pmtud: PMTUD link change for host: 3 link: 1 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]: [KNET ] pmtud: PMTUD link change for host: 1 link: 1 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]: [KNET ] pmtud: Global data MTU changed to: 1397
Status node1 (good one):
Code:
pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
Active: active (running) since Wed 2026-05-20 17:26:23 MDT; 4h 6min ago
Main PID: 3198 (pmxcfs)
Tasks: 8 (limit: 309001)
Memory: 67.2M
CPU: 26.807s
CGroup: /system.slice/pve-cluster.service
└─3198 /usr/bin/pmxcfs
May 20 21:21:17 nh1 pmxcfs[3198]: [status] notice: received log
May 20 21:24:00 nh1 pmxcfs[3198]: [status] notice: received log
May 20 21:24:30 nh1 pmxcfs[3198]: [ipcs] crit: connection from bad user 1000! - rejected
May 20 21:24:30 nh1 pmxcfs[3198]: [libqb] error: Error in connection setup (/dev/shm/qb-3198-101768-34-q26pgV/qb): Unknown error -1 (-1)
May 20 21:24:30 nh1 pmxcfs[3198]: [ipcs] crit: connection from bad user 1000! - rejected
May 20 21:24:30 nh1 pmxcfs[3198]: [libqb] error: Error in connection setup (/dev/shm/qb-3198-101768-34-EsdRlo/qb): Unknown error -1 (-1)
May 20 21:24:30 nh1 pmxcfs[3198]: [ipcs] crit: connection from bad user 1000! - rejected
May 20 21:24:30 nh1 pmxcfs[3198]: [libqb] error: Error in connection setup (/dev/shm/qb-3198-101768-34-MzH56K/qb): Unknown error -1 (-1)
May 20 21:24:30 nh1 pmxcfs[3198]: [ipcs] crit: connection from bad user 1000! - rejected
May 20 21:24:30 nh1 pmxcfs[3198]: [libqb] error: Error in connection setup (/dev/shm/qb-3198-101768-34-GFSi3H/qb): Unknown error -1 (-1)
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: active (running) since Wed 2026-05-20 17:26:23 MDT; 4h 5min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 3264 (corosync)
Tasks: 9 (limit: 309001)
Memory: 141.6M
CPU: 1min 49.356s
CGroup: /system.slice/corosync.service
└─3264 /usr/sbin/corosync -f
May 20 21:01:56 nh1 corosync[3264]: [QUORUM] Sync members[5]: 1 2 3 4 5
May 20 21:01:56 nh1 corosync[3264]: [QUORUM] Sync joined[1]: 2
May 20 21:01:56 nh1 corosync[3264]: [TOTEM ] A new membership (1.2a9) was formed. Members joined: 2
May 20 21:01:56 nh1 corosync[3264]: [QUORUM] Members[5]: 1 2 3 4 5
May 20 21:01:56 nh1 corosync[3264]: [MAIN ] Completed service synchronization, ready to provide service.
May 20 21:01:56 nh1 corosync[3264]: [KNET ] pmtud: Global data MTU changed to: 1397
May 20 21:01:56 nh1 corosync[3264]: [KNET ] rx: host: 2 link: 1 is up
May 20 21:01:56 nh1 corosync[3264]: [KNET ] link: Resetting MTU for link 1 because host 2 joined
May 20 21:01:56 nh1 corosync[3264]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 20 21:01:56 nh1 corosync[3264]: [KNET ] pmtud: Global data MTU changed to: 1397
This may be a bit of a red herring (perhaps I earlier ran the command as my regular user as UID 1000 and not with sudo):
Code:
May 20 21:24:30 nh1 pmxcfs[3198]: [ipcs] crit: connection from bad user 1000! - rejected
Any hints or advice most welcome.
Thanks,
-Jeff
Last edited: