Cluster not quorate - extending auth key lifetime!

jebbam · May 21, 2026

I have a 5 node Proxmox cluster co-located in a data center with ~100 KVMs that has been running happily the last year+.

The ISP needed to move the servers to another building (sigh).

Everything came back online, but two of the nodes, node2 and node5, are not connecting to the cluster and give this error in the syslog:

Code:

Cluster not quorate - extending auth key lifetime!

Each of the nodes has identical hardware. They have separate ethernet jacks/switches for WAN, Corosync 1, Corosync 2, Migration, and Ceph. The Ceph cluster is healthy. I can ssh to every node, and every node can ssh to the other nodes on every Interface (e.g. I can ssh to other nodes via Corosync 1, for example, or the main interface). Every node can ping every other node on via all interfaces and all switches. So it appears everything is fine with the way the network is plugged in.

If I log in to the web gui to node1, it shows it connects fine with node3 and node4. Those three seem to be happy together (e.g. green check mark next to them).

On nodes 2 and 5, they both show red X's next to all other nodes, and green check boxes for themselves. Logging into those nodes web interfaces directly, they both say under summary "Standalone node - no cluster defined". But on those same node2 and node5, if I go to Datacenter -> Cluster, it shows "Number of nodes: 5" and lists all five nodes. So it says "no cluster defined" yet it does list all the other nodes.

corosync.conf on all five nodes is identical (confirmed with sha1sum):

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: nh1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.22.22.1
    ring1_addr: 10.33.33.1
  }
  node {
    name: nh2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.22.22.2
    ring1_addr: 10.33.33.2
  }
  node {
    name: nh3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.22.22.3
    ring1_addr: 10.33.33.3
  }
  node {
    name: nh4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.22.22.4
    ring1_addr: 10.33.33.4
  }
  node {
    name: nh5
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.22.22.5
    ring1_addr: 10.33.33.5
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: nh
  config_version: 5
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

On node1 and node2, I tried `pvecm expected 3` but for both I got `Unable to set expected votes: CS_ERR_INVALID_PARAM`.

pvecm status is the same on all nodes (except where they say "local"):

Code:

root@nh2:~# pvecm status
Cluster information
-------------------
Name:             nh
Config Version:   5
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed May 20 21:25:40 2026
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000002
Ring ID:          1.2a9
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.22.22.1
0x00000002          1 10.22.22.2 (local)
0x00000003          1 10.22.22.3
0x00000004          1 10.22.22.4
0x00000005          1 10.22.22.5

I tried rebooting the broken nodes, node2 and node5, but that didn't help.

All nodes have the identical versions of all software installed:

Code:

root@nh2:~# pveversion -v
proxmox-ve: 8.4.0 (running kernel: 6.8.12-14-pve)
pve-manager: 8.4.12 (running version: 8.4.12/c2ea8261d32a5020)
proxmox-kernel-helper: 8.1.4
proxmox-kernel-6.8.12-14-pve-signed: 6.8.12-14
proxmox-kernel-6.8: 6.8.12-14
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
amd64-microcode: 3.20240820.1~deb12u1
ceph: 18.2.7-pve1
ceph-fuse: 18.2.7-pve1
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.2
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.2
libpve-cluster-perl: 8.1.2
libpve-common-perl: 8.3.4
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.7
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.6-1
proxmox-backup-file-restore: 3.4.6-1
proxmox-backup-restore-image: 0.7.0
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.4
proxmox-mail-forward: 0.3.3
proxmox-mini-journalreader: 1.5
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.13
pve-cluster: 8.1.2
pve-container: 5.3.0
pve-docs: 8.4.1
pve-edk2-firmware: not correctly installed
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.2
pve-firmware: 3.16-3
pve-ha-manager: 4.0.7
pve-i18n: 3.4.5
pve-qemu-kvm: 9.2.0-7
pve-xtermjs: 5.5.0-2
qemu-server: 8.4.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.8-pve1

status node2 (bad one):

Code:

pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
     Active: active (running) since Wed 2026-05-20 21:01:53 MDT; 31min ago
   Main PID: 3183 (pmxcfs)
      Tasks: 6 (limit: 309003)
     Memory: 52.4M
        CPU: 1.650s
     CGroup: /system.slice/pve-cluster.service
             └─3183 /usr/bin/pmxcfs

Jun 20 21:11:00 nh2 pmxcfs[3183]: [confdb] crit: cmap_initialize failed: 2
Jun 20 21:11:00 nh2 pmxcfs[3183]: [confdb] crit: can't initialize service
Jun 20 21:11:00 nh2 pmxcfs[3183]: [dcdb] crit: cpg_initialize failed: 2
Jun 20 21:11:00 nh2 pmxcfs[3183]: [dcdb] crit: can't initialize service
Jun 20 21:11:00 nh2 pmxcfs[3183]: [status] crit: cpg_initialize failed: 2
Jun 20 21:11:00 nh2 pmxcfs[3183]: [status] crit: can't initialize service
May 20 21:01:53 nh2 systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
May 20 21:03:16 nh2 pmxcfs[3183]: [main] notice: ignore insert of duplicate cluster log
May 20 21:03:33 nh2 pmxcfs[3183]: [main] notice: ignore insert of duplicate cluster log
May 20 21:18:34 nh2 pmxcfs[3183]: [main] notice: ignore insert of duplicate cluster log

● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: active (running) since Wed 2026-05-20 21:01:53 MDT; 31min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 3257 (corosync)
      Tasks: 9 (limit: 309003)
     Memory: 125.0M
        CPU: 12.723s
     CGroup: /system.slice/corosync.service
             └─3257 /usr/sbin/corosync -f

May 20 21:01:56 nh2 corosync[3257]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: PMTUD link change for host: 5 link: 1 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 1 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 1 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 1 from 469 to 1397
May 20 21:01:56 nh2 corosync[3257]:   [KNET  ] pmtud: Global data MTU changed to: 1397

Status node1 (good one):

Code:

pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
     Active: active (running) since Wed 2026-05-20 17:26:23 MDT; 4h 6min ago
   Main PID: 3198 (pmxcfs)
      Tasks: 8 (limit: 309001)
     Memory: 67.2M
        CPU: 26.807s
     CGroup: /system.slice/pve-cluster.service
             └─3198 /usr/bin/pmxcfs

May 20 21:21:17 nh1 pmxcfs[3198]: [status] notice: received log
May 20 21:24:00 nh1 pmxcfs[3198]: [status] notice: received log
May 20 21:24:30 nh1 pmxcfs[3198]: [ipcs] crit: connection from bad user 1000! - rejected
May 20 21:24:30 nh1 pmxcfs[3198]: [libqb] error: Error in connection setup (/dev/shm/qb-3198-101768-34-q26pgV/qb): Unknown error -1 (-1)
May 20 21:24:30 nh1 pmxcfs[3198]: [ipcs] crit: connection from bad user 1000! - rejected
May 20 21:24:30 nh1 pmxcfs[3198]: [libqb] error: Error in connection setup (/dev/shm/qb-3198-101768-34-EsdRlo/qb): Unknown error -1 (-1)
May 20 21:24:30 nh1 pmxcfs[3198]: [ipcs] crit: connection from bad user 1000! - rejected
May 20 21:24:30 nh1 pmxcfs[3198]: [libqb] error: Error in connection setup (/dev/shm/qb-3198-101768-34-MzH56K/qb): Unknown error -1 (-1)
May 20 21:24:30 nh1 pmxcfs[3198]: [ipcs] crit: connection from bad user 1000! - rejected
May 20 21:24:30 nh1 pmxcfs[3198]: [libqb] error: Error in connection setup (/dev/shm/qb-3198-101768-34-GFSi3H/qb): Unknown error -1 (-1)

● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: active (running) since Wed 2026-05-20 17:26:23 MDT; 4h 5min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 3264 (corosync)
      Tasks: 9 (limit: 309001)
     Memory: 141.6M
        CPU: 1min 49.356s
     CGroup: /system.slice/corosync.service
             └─3264 /usr/sbin/corosync -f

May 20 21:01:56 nh1 corosync[3264]:   [QUORUM] Sync members[5]: 1 2 3 4 5
May 20 21:01:56 nh1 corosync[3264]:   [QUORUM] Sync joined[1]: 2
May 20 21:01:56 nh1 corosync[3264]:   [TOTEM ] A new membership (1.2a9) was formed. Members joined: 2
May 20 21:01:56 nh1 corosync[3264]:   [QUORUM] Members[5]: 1 2 3 4 5
May 20 21:01:56 nh1 corosync[3264]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 20 21:01:56 nh1 corosync[3264]:   [KNET  ] pmtud: Global data MTU changed to: 1397
May 20 21:01:56 nh1 corosync[3264]:   [KNET  ] rx: host: 2 link: 1 is up
May 20 21:01:56 nh1 corosync[3264]:   [KNET  ] link: Resetting MTU for link 1 because host 2 joined
May 20 21:01:56 nh1 corosync[3264]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 20 21:01:56 nh1 corosync[3264]:   [KNET  ] pmtud: Global data MTU changed to: 1397

This may be a bit of a red herring (perhaps I earlier ran the command as my regular user as UID 1000 and not with sudo):

Code:

May 20 21:24:30 nh1 pmxcfs[3198]: [ipcs] crit: connection from bad user 1000! - rejected

Any hints or advice most welcome.

Thanks,

-Jeff

fabian · May 21, 2026

please provide the journal of all 5 nodes covering the bootup, and the full journal for the corosync and pve-cluster units on all 5 nodes for the same boot.

jebbam · May 21, 2026

fabian said:
please provide the journal of all 5 nodes covering the bootup, and the full journal for the corosync and pve-cluster units on all 5 nodes for the same boot.

See attached.

The node 2 and node 5 boot logs were "too large for the server to process" on upload. I truncated those as they just repeat the same things over and over, so they are small enough to upload.

Thanks for your consideration.

fabian · May 22, 2026

could you double check and post your network configuration/setup, including the switch config? in particular of the two "problematic" nodes? this looks like a network misconfiguration problem, though the logs don't give a clear indication *what* is going wrong..

you could also try the following:
- start all nodes
- stop PVE services and corosync on nodes 2 and 5
- nodes 1, 3, 4 should work correctly now, as far as I understand
- start corosync on node 2
- check with "corosync-quorumtool -s" that corosync is happy on all 4 nodes where it is running
- start "pve-cluster" on node 2
- post the logs up to that point of the 4 nodes

jebbam · May 22, 2026

The network configs were all written last in January, 2025. So none of them have been touched in over a year. They are all the exact same size. So if it is a network issue, it could be maybe a flakey switch or something like that (?). Maybe MTU? But this network config has run perfectly fine for a year+. They all follow this pattern (identical hardware):

Code:

/etc/network/interfaces.d/*

auto lo
iface lo inet loopback

auto enp129s0f0np0
iface enp129s0f0np0 inet manual
    dns-nameservers 10.10.10.251
    dns-search libre.is
#Public Interface

auto eno1np0
iface eno1np0 inet static
    address 10.22.22.1/24
#Corosync 1

auto eno2np1
iface eno2np1 inet static
    address 10.33.33.1/24
#Corosync 2

auto enp129s0f1np1
iface enp129s0f1np1 inet manual
#Migrate Interface

iface enp129s0f2np2 inet manual

iface enp129s0f3np3 inet manual

auto enp197s0np0
iface enp197s0np0 inet static
    address 10.99.99.1/24
#Ceph

iface enxbe3af2b6059f inet manual

auto vmbr0
iface vmbr0 inet static
    address 70.39.73.131/25
    gateway 70.39.73.254
    bridge-ports enp129s0f0np0
    bridge-stp off
    bridge-fd 0

auto vmbr1
iface vmbr1 inet static
    address 10.68.68.1/24
    bridge-ports enp129s0f1np1
    bridge-stp off
    bridge-fd 0
#Migrate Bridge

- start all nodes

Ok, they have all been running for 24 hours+.

- stop PVE services and corosync on nodes 2 and 5

OK.

Code:

systemctl stop corosync.service pvebanner.service pve-cluster.service pvedaemon.service pve-daily-update.timer pve-firewall-commit.service pve-firewall.service pvefw-logger.service pve-ha-crm.service pve-ha-lrm.service pve-lxc-syscalld.service pvenetcommit.service pveproxy.service pve-query-machine-capabilities.service pve-sdn-commit.service pvestatd.service pve-storage.target

- nodes 1, 3, 4 should work correctly now, as far as I understand

Nodes 1, 3, 4 all have green check marks next to them, when I log into any of them (e.g. they are all happy with one another). Ceph is working with all 5 nodes.

- start corosync on node 2

Code:

systemctl start corosync.service

Code:

root@nh2:~# systemctl status corosync.service 
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: active (running) since Fri 2026-05-22 09:19:58 MDT; 20s ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 571688 (corosync)
      Tasks: 9 (limit: 309003)
     Memory: 121.0M
        CPU: 219ms
     CGroup: /system.slice/corosync.service
             └─571688 /usr/sbin/corosync -f

May 22 09:20:01 nh2 corosync[571688]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: PMTUD link change for host: 5 link: 1 from 469 to 1397
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 1 from 469 to 1397
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 1 from 469 to 1397
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 1 from 469 to 1397
May 22 09:20:01 nh2 corosync[571688]:   [KNET  ] pmtud: Global data MTU changed to: 1397

- check with "corosync-quorumtool -s" that corosync is happy on all 4 nodes where it is running

Code:

root@nh2:~# corosync-quorumtool -s
Quorum information
------------------
Date:             Fri May 22 09:21:24 2026
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          2
Ring ID:          1.2b2
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
         1          1 nh1
         2          1 nh2 (local)
         3          1 nh3
         4          1 nh4
         5          1 nh5

Code:

root@nh1:~# corosync-quorumtool -s
Quorum information
------------------
Date:             Fri May 22 09:22:00 2026
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          1
Ring ID:          1.2b2
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
         1          1 nh1 (local)
         2          1 nh2
         3          1 nh3
         4          1 nh4
         5          1 nh5

Code:

root@nh3:~# corosync-quorumtool -s
Quorum information
------------------
Date:             Fri May 22 09:22:26 2026
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          3
Ring ID:          1.2b2
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
         1          1 nh1
         2          1 nh2
         3          1 nh3 (local)
         4          1 nh4
         5          1 nh5

Code:

root@nh4:~# corosync-quorumtool -s
Quorum information
------------------
Date:             Fri May 22 09:22:55 2026
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          4
Ring ID:          1.2b2
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
         1          1 nh1
         2          1 nh2
         3          1 nh3
         4          1 nh4 (local)
         5          1 nh5

- start "pve-cluster" on node 2

Code:

root@nh2:~# systemctl start pve-cluster

Code:

root@nh2:~# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
     Active: active (running) since Fri 2026-05-22 09:24:37 MDT; 12s ago
    Process: 571930 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 571931 (pmxcfs)
      Tasks: 5 (limit: 309003)
     Memory: 7.9M
        CPU: 50ms
     CGroup: /system.slice/pve-cluster.service
             └─571931 /usr/bin/pmxcfs

May 22 09:24:36 nh2 pmxcfs[571931]: [dcdb] notice: received all states
May 22 09:24:36 nh2 pmxcfs[571931]: [dcdb] notice: leader is 1/3198
May 22 09:24:36 nh2 pmxcfs[571931]: [dcdb] notice: synced members: 1/3198, 3/3171, 4/3149
May 22 09:24:36 nh2 pmxcfs[571931]: [dcdb] notice: waiting for updates from leader
May 22 09:24:36 nh2 pmxcfs[571931]: [status] notice: received sync request (epoch 1/3198/00000002)
May 22 09:24:36 nh2 pmxcfs[571931]: [dcdb] notice: update complete - trying to commit (got 7 inode updates)
May 22 09:24:36 nh2 pmxcfs[571931]: [dcdb] notice: all data is up to date
May 22 09:24:36 nh2 pmxcfs[571931]: [status] notice: received all states
May 22 09:24:36 nh2 pmxcfs[571931]: [status] notice: all data is up to date
May 22 09:24:37 nh2 systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.

Now, when logged into the nh1 web GUI, instead of a red X next to nh2, it has a grey "?".

See attached logs.

Thanks again!

-Jeff

fabian · May 26, 2026

that looks good so far, the question mark is probably because pvestatd is not running yet on that node. could you try starting it and see if it goes "green" then? if it does, you can start the other services as well on that node.

please then try starting pve-cluster on the last remaining node and post the logs once more.

jebbam · May 26, 2026

fabian said:
that looks good so far, the question mark is probably because pvestatd is not running yet on that node. could you try starting it and see if it goes "green" then? if it does, you can start the other services as well on that node.

Awesome, that order did the trick. All looks good. I did the same for nh5 too, so they are all up. Thanks 100x for all your help!

Do you know why this may have happened? I've rebooted Proxmox nodes many times, including entire clusters that were powered off, and I never had this issue before.

Thanks again,

-Jeff

jebbam · May 27, 2026

@fabian

So everything was fine for a couple days, then I thought I would reboot nh5 to make sure things came back up ok (and checking before doing upgrades).

But when it rebooted, it had the same issue. I am able to recover it by doing this:

Code:

systemctl stop corosync.service pvebanner.service pve-cluster.service pvedaemon.service pve-daily-update.timer pve-firewall-commit.service pve-firewall.service pvefw-logger.service pve-ha-crm.service pve-ha-lrm.service pve-lxc-syscalld.service pvenetcommit.service pveproxy.service pve-query-machine-capabilities.service pve-sdn-commit.service pvestatd.service pve-storage.target

systemctl start corosync.service

systemctl start pvestatd.service

systemctl start pve-cluster.service

systemctl start pvebanner.service pvedaemon.service pve-daily-update.timer pve-firewall-commit.service pve-firewall.service pvefw-logger.service pve-ha-crm.service pve-ha-lrm.service pve-lxc-syscalld.service pvenetcommit.service pveproxy.service pve-query-machine-capabilities.service pve-sdn-commit.service pve-storage.target

But somehow something is still wonky since it doesn't come back online happy after a reboot. Any more cluebats for me? I'm hesitant to do upgrades while this is an issue.

Thanks again,

-Jeff

fabian · May 28, 2026

did you just reboot nh5?

jebbam · May 28, 2026

fabian said:
did you just reboot nh5?

Yes, only nh5 was rebooted. It didn't come up correctly when it booted up, but running the above command sequence got it going correctly again.

fabian · May 28, 2026

could you post the journal once more of the reboot up till you starting the services again after stopping them?

jebbam · May 28, 2026

fabian said:
could you post the journal once more of the reboot up till you starting the services again after stopping them?

Ok.

I rebooted nh5, then created the nh5-journal-boot-fresh.txt file. Note: the PVE services start on boot, so they try to run.

Then I stopped the services, this it the nh5-journal-boot-stopped.txt file.

Then I started the services again, in the order in comment 8, and that is the nh5-journal-boot-started.txt file.

So the order is:

nh5-journal-boot-fresh.txt

nh5-journal-boot-stopped.txt

nh5-journal-boot-started.txt

fabian · May 29, 2026

I think I found the likely culprit

Code:

May 28 07:35:16 nh5 nh-ntpdate[3084]: CLOCK: time stepped by -934.341954

could you try to get that sorted and see if the problem goes away then?

jebbam · May 29, 2026

fabian said:
Code:

May 28 07:35:16 nh5 nh-ntpdate[3084]: CLOCK: time stepped by -934.341954

could you try to get that sorted and see if the problem goes away then?

I think that may have solved it. The machines had their times sync'd after boot (e.g. after running a minute), but there is a very small window where it was out of sync in the first few seconds presumably during services starting. I guess the services then don't care if the clocks get corrected after that, but just at their initial startup. I'm still a bit perplexed why this issue never showed up before, since I have updated kernels etc and rebooted these systems in the last couple years ok. Maybe the difference between a reboot and a cold boot, but even that resets the BIOS clock. Anyway, I set the clocks more accurately in the nh2/nh5 BIOS and rebooted them and they came up happy. Reflecting now, I think those two were the ones that didn't have their WAN connections correct when initially moved, so perhaps that some how factored in.

Thanks again!

-Jeff

janus57 · May 29, 2026

Hi,

Why have you "ntpdate" in your log ?

If I remember correctly, PVE use Chrony as ntp client ?

Best regards,

Cluster not quorate - extending auth key lifetime!

jebbam

Well-Known Member

fabian

Proxmox Staff Member

jebbam

Well-Known Member

Attachments

fabian

Proxmox Staff Member

jebbam

Well-Known Member

Attachments

fabian

Proxmox Staff Member

jebbam

Well-Known Member

jebbam

Well-Known Member

fabian

Proxmox Staff Member

jebbam

Well-Known Member

fabian

Proxmox Staff Member

jebbam

Well-Known Member

Attachments

fabian

Proxmox Staff Member

jebbam

Well-Known Member

janus57

Renowned Member

We value your privacy