cluster issues: pvestatd and pvedaeamon timeout

erwinvank

New Member
Aug 30, 2023
2
1
3
Hi everyone,

Have been going through a lot of posts on this forum for a solution, but none seem to resolve the issue I'm facing complety...

Since a powerfailure our Proxmox cluster (3 nodes) is having issues... pve02 and pve03 work fine after a reboot but pve01 fails to start PVESTATD and PVEDAEMON, they hang for a long time and eventually, after killling "pmxcfs", they show:

Code:
~# journalctl -r -u pvedaemon.service
Dec 02 14:41:33 proxmox01 systemd[1]: pvedaemon.service: start operation timed out. Terminating.
Dec 02 14:40:03 proxmox01 systemd[1]: Starting pvedaemon.service - PVE API Daemon...
Dec 02 14:40:03 proxmox01 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Dec 02 14:40:03 proxmox01 systemd[1]: pvedaemon.service: Found left-over process 1262143 (pvedaemon) in control group while starting unit. Ignoring.
Dec 02 14:40:03 proxmox01 systemd[1]: pvedaemon.service: Consumed 6.576s CPU time.
Dec 02 14:40:03 proxmox01 systemd[1]: Stopped pvedaemon.service - PVE API Daemon.
Dec 02 14:40:03 proxmox01 systemd[1]: pvedaemon.service: Scheduled restart job, restart counter is at 1.
Dec 02 14:40:02 proxmox01 systemd[1]: pvedaemon.service: Consumed 6.576s CPU time.

Code:
# journalctl -r -u pvestatd.service
Dec 02 14:40:55 proxmox01 systemd[1]: pvestatd.service: Consumed 6.171s CPU time.
Dec 02 14:40:55 proxmox01 systemd[1]: Failed to start pvestatd.service - PVE Status Daemon.
Dec 02 14:40:55 proxmox01 systemd[1]: pvestatd.service: Unit process 1262343 (pvestatd) remains running after unit stopped.
Dec 02 14:40:55 proxmox01 systemd[1]: pvestatd.service: Failed with result 'timeout'.
Dec 02 14:40:55 proxmox01 systemd[1]: pvestatd.service: Processes still around after final SIGKILL. Entering failed mode.
Dec 02 14:39:25 proxmox01 systemd[1]: pvestatd.service: Killing process 1262343 (pvestatd) with signal SIGKILL.

I turned on debugging but nothing gives me a clear explanation of what is wrong.
The only analysis so far is that certain directories under "/etc/pve" hang... like /etc/pve/local for example. Hence, kill "pmxcfs" and runing it in "local mode" shows nothing wrong with the directory structure.

Has anyone encountered this before? Thanks!
 
The cluster is OK again.

In short, it turned out that it was pve-ha-crm / pve-ha-lrm service related... I was unable to get a dir listing of /etc/pve/local and the service pmxcfs prevented me from restarting these services. It hung.

So, I copy/pasted the following commands in my terminal to execute them faster (example for LRM):

Code:
systemctl stop corosync.service pve-cluster.service
ps ax | grep pmx | cut -d" " -f1 | xargs kill -9
service pve-ha-lrm restart
systemctl start corosync.service pve-cluster.service
 
  • Like
Reactions: etheriault
The cluster is OK again.

In short, it turned out that it was pve-ha-crm / pve-ha-lrm service related... I was unable to get a dir listing of /etc/pve/local and the service pmxcfs prevented me from restarting these services. It hung.

So, I copy/pasted the following commands in my terminal to execute them faster (example for LRM):

Code:
systemctl stop corosync.service pve-cluster.service
ps ax | grep pmx | cut -d" " -f1 | xargs kill -9
service pve-ha-lrm restart
systemctl start corosync.service pve-cluster.service
This also saved the situation for me after having mistakenly fulled the OS Drive with bulk migration. Thanks !