cluster issues: pvestatd and pvedaeamon timeout

erwinvank

New Member
Aug 30, 2023
2
0
1
Hi everyone,

Have been going through a lot of posts on this forum for a solution, but none seem to resolve the issue I'm facing complety...

Since a powerfailure our Proxmox cluster (3 nodes) is having issues... pve02 and pve03 work fine after a reboot but pve01 fails to start PVESTATD and PVEDAEMON, they hang for a long time and eventually, after killling "pmxcfs", they show:

Code:
~# journalctl -r -u pvedaemon.service
Dec 02 14:41:33 proxmox01 systemd[1]: pvedaemon.service: start operation timed out. Terminating.
Dec 02 14:40:03 proxmox01 systemd[1]: Starting pvedaemon.service - PVE API Daemon...
Dec 02 14:40:03 proxmox01 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Dec 02 14:40:03 proxmox01 systemd[1]: pvedaemon.service: Found left-over process 1262143 (pvedaemon) in control group while starting unit. Ignoring.
Dec 02 14:40:03 proxmox01 systemd[1]: pvedaemon.service: Consumed 6.576s CPU time.
Dec 02 14:40:03 proxmox01 systemd[1]: Stopped pvedaemon.service - PVE API Daemon.
Dec 02 14:40:03 proxmox01 systemd[1]: pvedaemon.service: Scheduled restart job, restart counter is at 1.
Dec 02 14:40:02 proxmox01 systemd[1]: pvedaemon.service: Consumed 6.576s CPU time.

Code:
# journalctl -r -u pvestatd.service
Dec 02 14:40:55 proxmox01 systemd[1]: pvestatd.service: Consumed 6.171s CPU time.
Dec 02 14:40:55 proxmox01 systemd[1]: Failed to start pvestatd.service - PVE Status Daemon.
Dec 02 14:40:55 proxmox01 systemd[1]: pvestatd.service: Unit process 1262343 (pvestatd) remains running after unit stopped.
Dec 02 14:40:55 proxmox01 systemd[1]: pvestatd.service: Failed with result 'timeout'.
Dec 02 14:40:55 proxmox01 systemd[1]: pvestatd.service: Processes still around after final SIGKILL. Entering failed mode.
Dec 02 14:39:25 proxmox01 systemd[1]: pvestatd.service: Killing process 1262343 (pvestatd) with signal SIGKILL.

I turned on debugging but nothing gives me a clear explanation of what is wrong.
The only analysis so far is that certain directories under "/etc/pve" hang... like /etc/pve/local for example. Hence, kill "pmxcfs" and runing it in "local mode" shows nothing wrong with the directory structure.

Has anyone encountered this before? Thanks!
 
The cluster is OK again.

In short, it turned out that it was pve-ha-crm / pve-ha-lrm service related... I was unable to get a dir listing of /etc/pve/local and the service pmxcfs prevented me from restarting these services. It hung.

So, I copy/pasted the following commands in my terminal to execute them faster (example for LRM):

Code:
systemctl stop corosync.service pve-cluster.service
ps ax | grep pmx | cut -d" " -f1 | xargs kill -9
service pve-ha-lrm restart
systemctl start corosync.service pve-cluster.service
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!