Problem with authkey? pveproxy failing to start

Piotr K · Sep 10, 2022

Hi.
I have a homelab cluster with 8 nodes, ceph storage, not very special config. Yesterday i took two nodes out (of course removed OSDs too) because i need them for other tasks. During some tests i clicked on the Replication tab in the admin frontend. Then the node becomes unresponsive. Later i saw my whole cluster is kind of broken, the nodes does not see each other, but each node is working. No network changes were done. After some longer time and some tries of restarting services (without touching the running VMs) most nodes came up and can see each other, but not one node.

Now on all nodes i can start and stop VMs but i can not go to console, can not migrate between nodes.

Also after the time of the problem the history of cpu usage etc is gone on all nodes.

So all nodes had in the log:

Code:

Sep 09 21:06:08 pve01 pvesr[2703526]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 09 21:06:09 pve01 pvesr[2703526]: error during cfs-locked 'file-replication_cfg' operation: no quorum!
Sep 09 21:06:09 pve01 systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Sep 09 21:06:09 pve01 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Sep 09 21:06:09 pve01 systemd[1]: Failed to start Proxmox VE replication runner.
Sep 09 21:07:00 pve01 systemd[1]: Starting Proxmox VE replication runner...
Sep 09 21:09:39 pve01 pvesr[2703976]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 09 21:09:40 pve01 pvesr[2703976]: trying to acquire cfs lock 'file-replication_cfg' ...

Quorum seems ok:

Code:

# pvecm status
Cluster information
-------------------
Name:             Cluster02
Config Version:   9
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sat Sep 10 09:24:53 2022
Quorum provider:  corosync_votequorum
Nodes:            6
Node ID:          0x00000001
Ring ID:          1.386f
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   8
Highest expected: 8
Total votes:      6
Quorum:           5
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.164 (local)
0x00000002          1 192.168.1.166
0x00000003          1 192.168.1.165
0x00000005          1 192.168.1.160
0x00000006          1 192.168.1.161
0x00000008          1 192.168.1.162

The most problematic node seems to be on quorum too. (.162)

The failing node can not start pveproxy and pvesr.

Code:

# systemctl list-units --type=service | grep pve
ceph-mgr@pve03.service               loaded active     running         Ceph cluster manager daemon
ceph-mon@pve03.service               loaded active     running         Ceph cluster monitor daemon
pve-cluster.service                  loaded active     running         The Proxmox VE cluster filesystem
pve-firewall.service                 loaded active     running         Proxmox VE firewall
pve-guests.service                   loaded inactive   dead      start PVE guests
pve-ha-crm.service                   loaded active     running         PVE Cluster HA Resource Manager Daemon
pve-ha-lrm.service                   loaded active     running         PVE Local HA Resource Manager Daemon
pve-lxc-syscalld.service             loaded active     running         Proxmox VE LXC Syscall Daemon
pvebanner.service                    loaded active     exited          Proxmox VE Login Banner
pvedaemon.service                    loaded active     running         PVE API Daemon
pvefw-logger.service                 loaded active     running         Proxmox VE firewall logger
pvenetcommit.service                 loaded active     exited          Commit Proxmox VE network changes
pveproxy.service                     loaded activating start-pre start PVE API Proxy Server
pvesr.service                        loaded activating start     start Proxmox VE replication runner
pvestatd.service                     loaded active     running         PVE Status Daemon

Code:

# journalctl -u pveproxy
-- Logs begin at Sat 2022-09-10 09:27:01 CEST, end at Sat 2022-09-10 09:34:58 CEST. --
Sep 10 09:27:27 pve03 systemd[1]: Starting PVE API Proxy Server...
Sep 10 09:27:57 pve03 pvecm[1671]: got timeout
Sep 10 09:28:57 pve03 systemd[1]: pveproxy.service: Start-pre operation timed out. Terminating.
Sep 10 09:30:27 pve03 systemd[1]: pveproxy.service: State 'stop-sigterm' timed out. Killing.
Sep 10 09:30:27 pve03 systemd[1]: pveproxy.service: Killing process 1671 (pvecm) with signal SIGKILL.
Sep 10 09:30:27 pve03 systemd[1]: pveproxy.service: Killing process 1676 (pvecm) with signal SIGKILL.
Sep 10 09:30:27 pve03 systemd[1]: pveproxy.service: Control process exited, code=killed, status=9/KILL
Sep 10 09:31:57 pve03 systemd[1]: pveproxy.service: State 'stop-final-sigterm' timed out. Killing.
Sep 10 09:31:57 pve03 systemd[1]: pveproxy.service: Killing process 1676 (pvecm) with signal SIGKILL.
Sep 10 09:33:28 pve03 systemd[1]: pveproxy.service: Processes still around after final SIGKILL. Entering failed mode.
Sep 10 09:33:28 pve03 systemd[1]: pveproxy.service: Failed with result 'timeout'.
Sep 10 09:33:28 pve03 systemd[1]: Failed to start PVE API Proxy Server.
Sep 10 09:33:28 pve03 systemd[1]: pveproxy.service: Service RestartSec=100ms expired, scheduling restart.
Sep 10 09:33:28 pve03 systemd[1]: pveproxy.service: Scheduled restart job, restart counter is at 1.
Sep 10 09:33:28 pve03 systemd[1]: Stopped PVE API Proxy Server.
Sep 10 09:33:28 pve03 systemd[1]: pveproxy.service: Found left-over process 1676 (pvecm) in control group while starting unit. Ignoring.
Sep 10 09:33:28 pve03 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Sep 10 09:33:28 pve03 systemd[1]: Starting PVE API Proxy Server...
Sep 10 09:33:58 pve03 pvecm[2747]: got timeout
Sep 10 09:34:58 pve03 systemd[1]: pveproxy.service: Start-pre operation timed out. Terminating.

Code:

# systemctl status pveproxy.service
● pveproxy.service - PVE API Proxy Server
   Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: enabled)
   Active: deactivating (final-sigterm) (Result: timeout)
  Process: 2747 ExecStartPre=/usr/bin/pvecm updatecerts --silent (code=killed, signal=KILL)
    Tasks: 2 (limit: 4915)
   Memory: 84.6M
   CGroup: /system.slice/pveproxy.service
           ├─1676 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           └─2750 /usr/bin/perl /usr/bin/pvecm updatecerts --silent

Sep 10 09:33:28 pve03 systemd[1]: Starting PVE API Proxy Server...
Sep 10 09:33:58 pve03 pvecm[2747]: got timeout
Sep 10 09:34:58 pve03 systemd[1]: pveproxy.service: Start-pre operation timed out. Terminating.
Sep 10 09:36:28 pve03 systemd[1]: pveproxy.service: State 'stop-sigterm' timed out. Killing.
Sep 10 09:36:28 pve03 systemd[1]: pveproxy.service: Killing process 2747 (pvecm) with signal SIGKILL.
Sep 10 09:36:28 pve03 systemd[1]: pveproxy.service: Killing process 1676 (pvecm) with signal SIGKILL.
Sep 10 09:36:28 pve03 systemd[1]: pveproxy.service: Killing process 2750 (pvecm) with signal SIGKILL.
Sep 10 09:36:28 pve03 systemd[1]: pveproxy.service: Control process exited, code=killed, status=9/KILL

EDIT:

Code:

root@pve02:/etc/pve# ls -l
total 7
-rw-r----- 1 root www-data  451 Sep  9 17:26 authkey.pub
-rw-r----- 1 root www-data  451 Sep  9 17:26 authkey.pub.old
-rw-r----- 1 root www-data 2212 Jan  3  2021 ceph.conf
-rw-r----- 1 root www-data 2061 Feb  8  2019 ceph.conf.backup
-rw-r----- 1 root www-data 2121 Jan  3  2021 ceph.conf.bak
-rw-r----- 1 root www-data 1015 Feb  8  2019 corosync.conf
-rw-r----- 1 root www-data 1015 Feb  8  2019 corosync.conf.bak
-rw-r----- 1 root www-data   48 Aug 21  2019 datacenter.cfg
drwxr-xr-x 2 root www-data    0 Apr 15  2021 firewall
drwxr-xr-x 2 root www-data    0 Jan  2  2021 ha
lrwxr-xr-x 1 root www-data    0 Jan  1  1970 local -> nodes/pve02
lrwxr-xr-x 1 root www-data    0 Jan  1  1970 lxc -> nodes/pve02/lxc
drwxr-xr-x 2 root www-data    0 Dec 23  2018 nodes
lrwxr-xr-x 1 root www-data    0 Jan  1  1970 openvz -> nodes/pve02/openvz
drwx------ 2 root www-data    0 Dec 23  2018 priv
-rw-r----- 1 root www-data 2057 Dec 23  2018 pve-root-ca.pem
-rw-r----- 1 root www-data 1675 Dec 23  2018 pve-www.key
lrwxr-xr-x 1 root www-data    0 Jan  1  1970 qemu-server -> nodes/pve02/qemu-server
-rw-r----- 1 root www-data    0 Sep  9 13:02 replication.cfg
drwxr-xr-x 2 root www-data    0 Jan  2  2021 sdn
-rw-r----- 1 root www-data  629 Sep  8 17:36 storage.cfg
-rw-r----- 1 root www-data  498 Jan 11  2019 tmpfile
-rw-r----- 1 root www-data   51 Dec 23  2018 user.cfg
drwxr-xr-x 2 root www-data    0 Jan  2  2021 virtual-guest
-rw-r----- 1 root www-data  330 Sep  9 13:02 vzdump.cron
root@pve02:/etc/pve#

I see the authkey.pub file changes, about at the time i removed the two nodes. Seems to be an issue with that?

Also i can not really browse the pve directories, it hangs after a try of listing files

Code:

root@pve01:/etc/pve# ls
authkey.pub       ceph.conf.bak      firewall  nodes            pve-www.key      storage.cfg    vzdump.cron
authkey.pub.old   corosync.conf      ha        openvz           qemu-server      tmpfile
ceph.conf         corosync.conf.bak  local     priv             replication.cfg  user.cfg
ceph.conf.backup  datacenter.cfg     lxc       pve-root-ca.pem  sdn              virtual-guest
root@pve01:/etc/pve# cd local
root@pve01:/etc/pve/local# ls

Any ideas how to get out of this problem?

Many thx.

Piotr K · Sep 13, 2022

I was searching a long time for a solution. Finally a restart of the services on all machines solved it. Doing the restart line by line did not solve, possibly it was too slow, i have prepared a simple script to restart all services, may be it will be helpfull for others too so posting below:

Code:

# cat restart.sh
#!/bin/bash
killall -9 corosync
systemctl restart pve-cluster
systemctl restart pvedaemon
systemctl restart pvestatd
systemctl restart pveproxy

Search

Search

Problem with authkey? pveproxy failing to start

Piotr K

Active Member

Piotr K

Active Member

We value your privacy