Cluster not quorate - extending auth key lifetime!

jebbam

Well-Known Member
Sep 8, 2019
67
28
58
I have a 5 node Proxmox cluster co-located in a data center with ~100 KVMs that has been running happily the last year+.

The ISP needed to move the servers to another building (sigh).

Everything came back online, but two of the nodes, node2 and node5, are not connecting to the cluster and give this error in the syslog:

Code:
Cluster not quorate - extending auth key lifetime!


Each of the nodes has identical hardware. They have separate ethernet jacks/switches for WAN, Corosync 1, Corosync 2, Migration, and Ceph. The Ceph cluster is healthy. I can ssh to every node, and every node can ssh to the other nodes on every Interface (e.g. I can ssh to other nodes via Corosync 1, for example, or the main interface). Every node can ping every other node on via all interfaces and all switches. So it appears everything is fine with the way the network is plugged in.

If I log in to the web gui to node1, it shows it connects fine with node3 and node4. Those three seem to be happy together (e.g. green check mark next to them).

On nodes 2 and 5, they both show red X's next to all other nodes, and green check boxes for themselves. Logging into those nodes web interfaces directly, they both say under summary "Standalone node - no cluster defined". But on those same node2 and node5, if I go to Datacenter -> Cluster, it shows "Number of nodes: 5" and lists all five nodes. So it says "no cluster defined" yet it does list all the other nodes.

corosync.conf on all five nodes is identical (confirmed with sha1sum):
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: nh1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.22.22.1
    ring1_addr: 10.33.33.1
  }
  node {
    name: nh2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.22.22.2
    ring1_addr: 10.33.33.2
  }
  node {
    name: nh3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.22.22.3
    ring1_addr: 10.33.33.3
  }
  node {
    name: nh4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.22.22.4
    ring1_addr: 10.33.33.4
  }
  node {
    name: nh5
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.22.22.5
    ring1_addr: 10.33.33.5
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: nh
  config_version: 5
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

On node1 and node2, I tried `pvecm expected 3` but for both I got `Unable to set expected votes: CS_ERR_INVALID_PARAM`.

pvecm status is the same on all nodes (except where they say "local"):
Code:
root@nh2:~# pvecm status
Cluster information
-------------------
Name:             nh
Config Version:   5
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed May 20 21:25:40 2026
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000002
Ring ID:          1.2a9
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.22.22.1
0x00000002          1 10.22.22.2 (local)
0x00000003          1 10.22.22.3
0x00000004          1 10.22.22.4
0x00000005          1 10.22.22.5

I tried rebooting the broken nodes, node2 and node5, but that didn't help.

All nodes have the identical versions of all software installed:
Code:
root@nh2:~# pveversion -v
proxmox-ve: 8.4.0 (running kernel: 6.8.12-14-pve)
pve-manager: 8.4.12 (running version: 8.4.12/c2ea8261d32a5020)
proxmox-kernel-helper: 8.1.4
proxmox-kernel-6.8.12-14-pve-signed: 6.8.12-14
proxmox-kernel-6.8: 6.8.12-14
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
amd64-microcode: 3.20240820.1~deb12u1
ceph: 18.2.7-pve1
ceph-fuse: 18.2.7-pve1
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.2
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.2
libpve-cluster-perl: 8.1.2
libpve-common-perl: 8.3.4
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.7
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.6-1
proxmox-backup-file-restore: 3.4.6-1
proxmox-backup-restore-image: 0.7.0
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.4
proxmox-mail-forward: 0.3.3
proxmox-mini-journalreader: 1.5
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.13
pve-cluster: 8.1.2
pve-container: 5.3.0
pve-docs: 8.4.1
pve-edk2-firmware: not correctly installed
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.2
pve-firmware: 3.16-3
pve-ha-manager: 4.0.7
pve-i18n: 3.4.5
pve-qemu-kvm: 9.2.0-7
pve-xtermjs: 5.5.0-2
qemu-server: 8.4.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.8-pve1

status node2 (bad one):
Code:
pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
     Active: active (running) since Wed 2026-05-20 21:01:53 MDT; 31min ago
   Main PID: 3183 (pmxcfs)
      Tasks: 6 (limit: 309003)
     Memory: 52.4M
        CPU: 1.650s
     CGroup: /system.slice/pve-cluster.service
             └─3183 /usr/bin/pmxcfs

Jun 20 21:11:00 nh2 pmxcfs[3183]: [confdb] crit: cmap_initialize failed: 2
Jun 20 21:11:00 nh2 pmxcfs[3183]: [confdb] crit: can't initialize service
Jun 20 21:11:00 nh2 pmxcfs[3183]: [dcdb] crit: cpg_initialize failed: 2
Jun 20 21:11:00 nh2 pmxcfs[3183]: [dcdb] crit: can't initialize service
Jun 20 21:11:00 nh2 pmxcfs[3183]: [status] crit: cpg_initialize failed: 2
Jun 20 21:11:00 nh2 pmxcfs[3183]: [status] crit: can't initialize service
May 20 21:01:53 nh2 systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
May 20 21:03:16 nh2 pmxcfs[3183]: [main] notice: ignore insert of duplicate cluster log
May 20 21:03:33 nh2 pmxcfs[3183]: [main] notice: ignore insert of duplicate cluster log
May 20 21:18:34 nh2 pmxcfs[3183]: [main] notice: ignore insert of duplicate cluster log

● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: active (running) since Wed 2026-05-20 21:01:53 MDT; 31min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 3257 (corosync)
      Tasks: 9 (limit: 309003)
     Memory: 125.0M
        CPU: 12.723s
     CGroup: /system.slice/corosync.service
             └─3257 /usr/sbin/corosync -f

May 20 21:01:56 nh2 corosync[3257]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: PMTUD link change for host: 5 link: 1 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 1 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 1 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 1 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: Global data MTU changed to: 1397

Status node1 (good one):
Code:
pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
     Active: active (running) since Wed 2026-05-20 17:26:23 MDT; 4h 6min ago
   Main PID: 3198 (pmxcfs)
      Tasks: 8 (limit: 309001)
     Memory: 67.2M
        CPU: 26.807s
     CGroup: /system.slice/pve-cluster.service
             └─3198 /usr/bin/pmxcfs

May 20 21:21:17 nh1 pmxcfs[3198]: [status] notice: received log
May 20 21:24:00 nh1 pmxcfs[3198]: [status] notice: received log
May 20 21:24:30 nh1 pmxcfs[3198]: [ipcs] crit: connection from bad user 1000! - rejected
May 20 21:24:30 nh1 pmxcfs[3198]: [libqb] error: Error in connection setup (/dev/shm/qb-3198-101768-34-q26pgV/qb): Unknown error -1 (-1)
May 20 21:24:30 nh1 pmxcfs[3198]: [ipcs] crit: connection from bad user 1000! - rejected
May 20 21:24:30 nh1 pmxcfs[3198]: [libqb] error: Error in connection setup (/dev/shm/qb-3198-101768-34-EsdRlo/qb): Unknown error -1 (-1)
May 20 21:24:30 nh1 pmxcfs[3198]: [ipcs] crit: connection from bad user 1000! - rejected
May 20 21:24:30 nh1 pmxcfs[3198]: [libqb] error: Error in connection setup (/dev/shm/qb-3198-101768-34-MzH56K/qb): Unknown error -1 (-1)
May 20 21:24:30 nh1 pmxcfs[3198]: [ipcs] crit: connection from bad user 1000! - rejected
May 20 21:24:30 nh1 pmxcfs[3198]: [libqb] error: Error in connection setup (/dev/shm/qb-3198-101768-34-GFSi3H/qb): Unknown error -1 (-1)

● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: active (running) since Wed 2026-05-20 17:26:23 MDT; 4h 5min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 3264 (corosync)
      Tasks: 9 (limit: 309001)
     Memory: 141.6M
        CPU: 1min 49.356s
     CGroup: /system.slice/corosync.service
             └─3264 /usr/sbin/corosync -f

May 20 21:01:56 nh1 corosync[3264]:   [QUORUM] Sync members[5]: 1 2 3 4 5
May 20 21:01:56 nh1 corosync[3264]:   [QUORUM] Sync joined[1]: 2
May 20 21:01:56 nh1 corosync[3264]:   [TOTEM ] A new membership (1.2a9) was formed. Members joined: 2
May 20 21:01:56 nh1 corosync[3264]:   [QUORUM] Members[5]: 1 2 3 4 5
May 20 21:01:56 nh1 corosync[3264]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 20 21:01:56 nh1 corosync[3264]:   [KNET  ] pmtud: Global data MTU changed to: 1397
May 20 21:01:56 nh1 corosync[3264]:   [KNET  ] rx: host: 2 link: 1 is up
May 20 21:01:56 nh1 corosync[3264]:   [KNET  ] link: Resetting MTU for link 1 because host 2 joined
May 20 21:01:56 nh1 corosync[3264]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 20 21:01:56 nh1 corosync[3264]:   [KNET  ] pmtud: Global data MTU changed to: 1397

This may be a bit of a red herring (perhaps I earlier ran the command as my regular user as UID 1000 and not with sudo):
Code:
May 20 21:24:30 nh1 pmxcfs[3198]: [ipcs] crit: connection from bad user 1000! - rejected

Any hints or advice most welcome.

Thanks,

-Jeff
 
Last edited:
please provide the journal of all 5 nodes covering the bootup, and the full journal for the corosync and pve-cluster units on all 5 nodes for the same boot.
 
please provide the journal of all 5 nodes covering the bootup, and the full journal for the corosync and pve-cluster units on all 5 nodes for the same boot.
See attached.

The node 2 and node 5 boot logs were "too large for the server to process" on upload. I truncated those as they just repeat the same things over and over, so they are small enough to upload.

Thanks for your consideration. :)
 

Attachments

could you double check and post your network configuration/setup, including the switch config? in particular of the two "problematic" nodes? this looks like a network misconfiguration problem, though the logs don't give a clear indication *what* is going wrong..

you could also try the following:
- start all nodes
- stop PVE services and corosync on nodes 2 and 5
- nodes 1, 3, 4 should work correctly now, as far as I understand
- start corosync on node 2
- check with "corosync-quorumtool -s" that corosync is happy on all 4 nodes where it is running
- start "pve-cluster" on node 2
- post the logs up to that point of the 4 nodes
 
The network configs were all written last in January, 2025. So none of them have been touched in over a year. They are all the exact same size. So if it is a network issue, it could be maybe a flakey switch or something like that (?). Maybe MTU? But this network config has run perfectly fine for a year+. They all follow this pattern (identical hardware):
Code:
/etc/network/interfaces.d/*

auto lo
iface lo inet loopback

auto enp129s0f0np0
iface enp129s0f0np0 inet manual
    dns-nameservers 10.10.10.251
    dns-search libre.is
#Public Interface

auto eno1np0
iface eno1np0 inet static
    address 10.22.22.1/24
#Corosync 1

auto eno2np1
iface eno2np1 inet static
    address 10.33.33.1/24
#Corosync 2

auto enp129s0f1np1
iface enp129s0f1np1 inet manual
#Migrate Interface

iface enp129s0f2np2 inet manual

iface enp129s0f3np3 inet manual

auto enp197s0np0
iface enp197s0np0 inet static
    address 10.99.99.1/24
#Ceph

iface enxbe3af2b6059f inet manual

auto vmbr0
iface vmbr0 inet static
    address 70.39.73.131/25
    gateway 70.39.73.254
    bridge-ports enp129s0f0np0
    bridge-stp off
    bridge-fd 0

auto vmbr1
iface vmbr1 inet static
    address 10.68.68.1/24
    bridge-ports enp129s0f1np1
    bridge-stp off
    bridge-fd 0
#Migrate Bridge

- start all nodes

Ok, they have all been running for 24 hours+.

- stop PVE services and corosync on nodes 2 and 5

OK.

Code:
systemctl stop corosync.service pvebanner.service pve-cluster.service pvedaemon.service pve-daily-update.timer pve-firewall-commit.service pve-firewall.service pvefw-logger.service pve-ha-crm.service pve-ha-lrm.service pve-lxc-syscalld.service pvenetcommit.service pveproxy.service pve-query-machine-capabilities.service pve-sdn-commit.service pvestatd.service pve-storage.target

- nodes 1, 3, 4 should work correctly now, as far as I understand

Nodes 1, 3, 4 all have green check marks next to them, when I log into any of them (e.g. they are all happy with one another). Ceph is working with all 5 nodes.

- start corosync on node 2

Code:
systemctl start corosync.service

Code:
root@nh2:~# systemctl status corosync.service 
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: active (running) since Fri 2026-05-22 09:19:58 MDT; 20s ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 571688 (corosync)
      Tasks: 9 (limit: 309003)
     Memory: 121.0M
        CPU: 219ms
     CGroup: /system.slice/corosync.service
             └─571688 /usr/sbin/corosync -f

May 22 09:20:01 nh2 corosync[571688]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: PMTUD link change for host: 5 link: 1 from 469 to 1397
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 1 from 469 to 1397
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 1 from 469 to 1397
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 1 from 469 to 1397
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: Global data MTU changed to: 1397

- check with "corosync-quorumtool -s" that corosync is happy on all 4 nodes where it is running

Code:
root@nh2:~# corosync-quorumtool -s
Quorum information
------------------
Date:             Fri May 22 09:21:24 2026
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          2
Ring ID:          1.2b2
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
         1          1 nh1
         2          1 nh2 (local)
         3          1 nh3
         4          1 nh4
         5          1 nh5

Code:
root@nh1:~# corosync-quorumtool -s
Quorum information
------------------
Date:             Fri May 22 09:22:00 2026
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          1
Ring ID:          1.2b2
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
         1          1 nh1 (local)
         2          1 nh2
         3          1 nh3
         4          1 nh4
         5          1 nh5

Code:
root@nh3:~# corosync-quorumtool -s
Quorum information
------------------
Date:             Fri May 22 09:22:26 2026
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          3
Ring ID:          1.2b2
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
         1          1 nh1
         2          1 nh2
         3          1 nh3 (local)
         4          1 nh4
         5          1 nh5

Code:
root@nh4:~# corosync-quorumtool -s
Quorum information
------------------
Date:             Fri May 22 09:22:55 2026
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          4
Ring ID:          1.2b2
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
         1          1 nh1
         2          1 nh2
         3          1 nh3
         4          1 nh4 (local)
         5          1 nh5

- start "pve-cluster" on node 2

Code:
root@nh2:~# systemctl start pve-cluster

Code:
root@nh2:~# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
     Active: active (running) since Fri 2026-05-22 09:24:37 MDT; 12s ago
    Process: 571930 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 571931 (pmxcfs)
      Tasks: 5 (limit: 309003)
     Memory: 7.9M
        CPU: 50ms
     CGroup: /system.slice/pve-cluster.service
             └─571931 /usr/bin/pmxcfs

May 22 09:24:36 nh2 pmxcfs[571931]: [dcdb] notice: received all states
May 22 09:24:36 nh2 pmxcfs[571931]: [dcdb] notice: leader is 1/3198
May 22 09:24:36 nh2 pmxcfs[571931]: [dcdb] notice: synced members: 1/3198, 3/3171, 4/3149
May 22 09:24:36 nh2 pmxcfs[571931]: [dcdb] notice: waiting for updates from leader
May 22 09:24:36 nh2 pmxcfs[571931]: [status] notice: received sync request (epoch 1/3198/00000002)
May 22 09:24:36 nh2 pmxcfs[571931]: [dcdb] notice: update complete - trying to commit (got 7 inode updates)
May 22 09:24:36 nh2 pmxcfs[571931]: [dcdb] notice: all data is up to date
May 22 09:24:36 nh2 pmxcfs[571931]: [status] notice: received all states
May 22 09:24:36 nh2 pmxcfs[571931]: [status] notice: all data is up to date
May 22 09:24:37 nh2 systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.

Now, when logged into the nh1 web GUI, instead of a red X next to nh2, it has a grey "?".

See attached logs.

Thanks again!

-Jeff
 

Attachments

that looks good so far, the question mark is probably because pvestatd is not running yet on that node. could you try starting it and see if it goes "green" then? if it does, you can start the other services as well on that node.

please then try starting pve-cluster on the last remaining node and post the logs once more.