Hi.
I have a homelab cluster with 8 nodes, ceph storage, not very special config. Yesterday i took two nodes out (of course removed OSDs too) because i need them for other tasks. During some tests i clicked on the Replication tab in the admin frontend. Then the node becomes unresponsive. Later i saw my whole cluster is kind of broken, the nodes does not see each other, but each node is working. No network changes were done. After some longer time and some tries of restarting services (without touching the running VMs) most nodes came up and can see each other, but not one node.
Now on all nodes i can start and stop VMs but i can not go to console, can not migrate between nodes.
Also after the time of the problem the history of cpu usage etc is gone on all nodes.
So all nodes had in the log:
Quorum seems ok:
The most problematic node seems to be on quorum too. (.162)
The failing node can not start pveproxy and pvesr.
EDIT:
I see the authkey.pub file changes, about at the time i removed the two nodes. Seems to be an issue with that?
Also i can not really browse the pve directories, it hangs after a try of listing files
Any ideas how to get out of this problem?
Many thx.
I have a homelab cluster with 8 nodes, ceph storage, not very special config. Yesterday i took two nodes out (of course removed OSDs too) because i need them for other tasks. During some tests i clicked on the Replication tab in the admin frontend. Then the node becomes unresponsive. Later i saw my whole cluster is kind of broken, the nodes does not see each other, but each node is working. No network changes were done. After some longer time and some tries of restarting services (without touching the running VMs) most nodes came up and can see each other, but not one node.
Now on all nodes i can start and stop VMs but i can not go to console, can not migrate between nodes.
Also after the time of the problem the history of cpu usage etc is gone on all nodes.
So all nodes had in the log:
Code:
Sep 09 21:06:08 pve01 pvesr[2703526]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 09 21:06:09 pve01 pvesr[2703526]: error during cfs-locked 'file-replication_cfg' operation: no quorum!
Sep 09 21:06:09 pve01 systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Sep 09 21:06:09 pve01 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Sep 09 21:06:09 pve01 systemd[1]: Failed to start Proxmox VE replication runner.
Sep 09 21:07:00 pve01 systemd[1]: Starting Proxmox VE replication runner...
Sep 09 21:09:39 pve01 pvesr[2703976]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 09 21:09:40 pve01 pvesr[2703976]: trying to acquire cfs lock 'file-replication_cfg' ...
Quorum seems ok:
Code:
# pvecm status
Cluster information
-------------------
Name: Cluster02
Config Version: 9
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Sat Sep 10 09:24:53 2022
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000001
Ring ID: 1.386f
Quorate: Yes
Votequorum information
----------------------
Expected votes: 8
Highest expected: 8
Total votes: 6
Quorum: 5
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.1.164 (local)
0x00000002 1 192.168.1.166
0x00000003 1 192.168.1.165
0x00000005 1 192.168.1.160
0x00000006 1 192.168.1.161
0x00000008 1 192.168.1.162
The most problematic node seems to be on quorum too. (.162)
The failing node can not start pveproxy and pvesr.
Code:
# systemctl list-units --type=service | grep pve
ceph-mgr@pve03.service loaded active running Ceph cluster manager daemon
ceph-mon@pve03.service loaded active running Ceph cluster monitor daemon
pve-cluster.service loaded active running The Proxmox VE cluster filesystem
pve-firewall.service loaded active running Proxmox VE firewall
pve-guests.service loaded inactive dead start PVE guests
pve-ha-crm.service loaded active running PVE Cluster HA Resource Manager Daemon
pve-ha-lrm.service loaded active running PVE Local HA Resource Manager Daemon
pve-lxc-syscalld.service loaded active running Proxmox VE LXC Syscall Daemon
pvebanner.service loaded active exited Proxmox VE Login Banner
pvedaemon.service loaded active running PVE API Daemon
pvefw-logger.service loaded active running Proxmox VE firewall logger
pvenetcommit.service loaded active exited Commit Proxmox VE network changes
pveproxy.service loaded activating start-pre start PVE API Proxy Server
pvesr.service loaded activating start start Proxmox VE replication runner
pvestatd.service loaded active running PVE Status Daemon
Code:
# journalctl -u pveproxy
-- Logs begin at Sat 2022-09-10 09:27:01 CEST, end at Sat 2022-09-10 09:34:58 CEST. --
Sep 10 09:27:27 pve03 systemd[1]: Starting PVE API Proxy Server...
Sep 10 09:27:57 pve03 pvecm[1671]: got timeout
Sep 10 09:28:57 pve03 systemd[1]: pveproxy.service: Start-pre operation timed out. Terminating.
Sep 10 09:30:27 pve03 systemd[1]: pveproxy.service: State 'stop-sigterm' timed out. Killing.
Sep 10 09:30:27 pve03 systemd[1]: pveproxy.service: Killing process 1671 (pvecm) with signal SIGKILL.
Sep 10 09:30:27 pve03 systemd[1]: pveproxy.service: Killing process 1676 (pvecm) with signal SIGKILL.
Sep 10 09:30:27 pve03 systemd[1]: pveproxy.service: Control process exited, code=killed, status=9/KILL
Sep 10 09:31:57 pve03 systemd[1]: pveproxy.service: State 'stop-final-sigterm' timed out. Killing.
Sep 10 09:31:57 pve03 systemd[1]: pveproxy.service: Killing process 1676 (pvecm) with signal SIGKILL.
Sep 10 09:33:28 pve03 systemd[1]: pveproxy.service: Processes still around after final SIGKILL. Entering failed mode.
Sep 10 09:33:28 pve03 systemd[1]: pveproxy.service: Failed with result 'timeout'.
Sep 10 09:33:28 pve03 systemd[1]: Failed to start PVE API Proxy Server.
Sep 10 09:33:28 pve03 systemd[1]: pveproxy.service: Service RestartSec=100ms expired, scheduling restart.
Sep 10 09:33:28 pve03 systemd[1]: pveproxy.service: Scheduled restart job, restart counter is at 1.
Sep 10 09:33:28 pve03 systemd[1]: Stopped PVE API Proxy Server.
Sep 10 09:33:28 pve03 systemd[1]: pveproxy.service: Found left-over process 1676 (pvecm) in control group while starting unit. Ignoring.
Sep 10 09:33:28 pve03 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Sep 10 09:33:28 pve03 systemd[1]: Starting PVE API Proxy Server...
Sep 10 09:33:58 pve03 pvecm[2747]: got timeout
Sep 10 09:34:58 pve03 systemd[1]: pveproxy.service: Start-pre operation timed out. Terminating.
Code:
# systemctl status pveproxy.service
● pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: enabled)
Active: deactivating (final-sigterm) (Result: timeout)
Process: 2747 ExecStartPre=/usr/bin/pvecm updatecerts --silent (code=killed, signal=KILL)
Tasks: 2 (limit: 4915)
Memory: 84.6M
CGroup: /system.slice/pveproxy.service
├─1676 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
└─2750 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
Sep 10 09:33:28 pve03 systemd[1]: Starting PVE API Proxy Server...
Sep 10 09:33:58 pve03 pvecm[2747]: got timeout
Sep 10 09:34:58 pve03 systemd[1]: pveproxy.service: Start-pre operation timed out. Terminating.
Sep 10 09:36:28 pve03 systemd[1]: pveproxy.service: State 'stop-sigterm' timed out. Killing.
Sep 10 09:36:28 pve03 systemd[1]: pveproxy.service: Killing process 2747 (pvecm) with signal SIGKILL.
Sep 10 09:36:28 pve03 systemd[1]: pveproxy.service: Killing process 1676 (pvecm) with signal SIGKILL.
Sep 10 09:36:28 pve03 systemd[1]: pveproxy.service: Killing process 2750 (pvecm) with signal SIGKILL.
Sep 10 09:36:28 pve03 systemd[1]: pveproxy.service: Control process exited, code=killed, status=9/KILL
EDIT:
Code:
root@pve02:/etc/pve# ls -l
total 7
-rw-r----- 1 root www-data 451 Sep 9 17:26 authkey.pub
-rw-r----- 1 root www-data 451 Sep 9 17:26 authkey.pub.old
-rw-r----- 1 root www-data 2212 Jan 3 2021 ceph.conf
-rw-r----- 1 root www-data 2061 Feb 8 2019 ceph.conf.backup
-rw-r----- 1 root www-data 2121 Jan 3 2021 ceph.conf.bak
-rw-r----- 1 root www-data 1015 Feb 8 2019 corosync.conf
-rw-r----- 1 root www-data 1015 Feb 8 2019 corosync.conf.bak
-rw-r----- 1 root www-data 48 Aug 21 2019 datacenter.cfg
drwxr-xr-x 2 root www-data 0 Apr 15 2021 firewall
drwxr-xr-x 2 root www-data 0 Jan 2 2021 ha
lrwxr-xr-x 1 root www-data 0 Jan 1 1970 local -> nodes/pve02
lrwxr-xr-x 1 root www-data 0 Jan 1 1970 lxc -> nodes/pve02/lxc
drwxr-xr-x 2 root www-data 0 Dec 23 2018 nodes
lrwxr-xr-x 1 root www-data 0 Jan 1 1970 openvz -> nodes/pve02/openvz
drwx------ 2 root www-data 0 Dec 23 2018 priv
-rw-r----- 1 root www-data 2057 Dec 23 2018 pve-root-ca.pem
-rw-r----- 1 root www-data 1675 Dec 23 2018 pve-www.key
lrwxr-xr-x 1 root www-data 0 Jan 1 1970 qemu-server -> nodes/pve02/qemu-server
-rw-r----- 1 root www-data 0 Sep 9 13:02 replication.cfg
drwxr-xr-x 2 root www-data 0 Jan 2 2021 sdn
-rw-r----- 1 root www-data 629 Sep 8 17:36 storage.cfg
-rw-r----- 1 root www-data 498 Jan 11 2019 tmpfile
-rw-r----- 1 root www-data 51 Dec 23 2018 user.cfg
drwxr-xr-x 2 root www-data 0 Jan 2 2021 virtual-guest
-rw-r----- 1 root www-data 330 Sep 9 13:02 vzdump.cron
root@pve02:/etc/pve#
I see the authkey.pub file changes, about at the time i removed the two nodes. Seems to be an issue with that?
Also i can not really browse the pve directories, it hangs after a try of listing files
Code:
root@pve01:/etc/pve# ls
authkey.pub ceph.conf.bak firewall nodes pve-www.key storage.cfg vzdump.cron
authkey.pub.old corosync.conf ha openvz qemu-server tmpfile
ceph.conf corosync.conf.bak local priv replication.cfg user.cfg
ceph.conf.backup datacenter.cfg lxc pve-root-ca.pem sdn virtual-guest
root@pve01:/etc/pve# cd local
root@pve01:/etc/pve/local# ls
Any ideas how to get out of this problem?
Many thx.
Last edited: