[SOLVED] pvestatd keep failing.

iSotopeOfAdmiralty

New Member
Apr 26, 2024
8
0
1
Hello. I have a 7-node PVE cluster, and every once in a while, one or more nodes, along with all VMs and storage running on that node, will show a question mark. However, all VMs, storage, and the nodes themselves are running without any errors. I did my own research and found threads of other people with similar issues, and they all managed to resolve the issue by restarting a few services suggested in the response. I did the same thing, and the issue was resolved for an hour or so, but then it happened again on a different node. I then executed the commands on all nodes, and the question marks turned into green checkmarks for another hour or two, and then it happened again. This time, 3 nodes turned into question marks at the same time, so I started to do some of my own investigation. I checked the status of the services mentioned in the other thread one by one to see which one actually resolved my problem, and I soon figured out that the only problematic service for me is "pvestatd," which explains why all statuses related to affected nodes turn into question marks while the actual nodes are running just fine. I restarted that service and only that service, and sure enough, it worked for another hour or so before turning into question marks again . Here's the output of the service pvestatd status:

○ pvestatd.service - PVE Status Daemon
Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; preset: enabled)
Active: inactive (dead) since Sat 2024-04-27 06:47:11 CDT; 37min ago
Duration: 2h 19min 42.186s
Process: 84812 ExecStop=/usr/bin/pvestatd stop (code=exited, status=0/SUCCESS)
Main PID: 2820 (code=exited, status=0/SUCCESS)
CPU: 8min 44.782s

Apr 27 06:47:10 WAVM02 pvestatd[84780]: ipcc_send_rec[2] failed: Too many open files
Apr 27 06:47:10 WAVM02 pvestatd[84780]: ipcc_send_rec[3] failed: Too many open files
Apr 27 06:47:10 WAVM02 pvestatd[84780]: got inotify poll request in wrong process - disabling inotify
Apr 27 06:47:10 WAVM02 pvestatd[84780]: ipcc_send_rec[4] failed: Too many open files
Apr 27 06:47:10 WAVM02 pvestatd[84780]: status update error: Too many open files
Apr 27 06:47:11 WAVM02 pvestatd[84780]: received signal TERM
Apr 27 06:47:11 WAVM02 pvestatd[84780]: server closing
Apr 27 06:47:11 WAVM02 pvestatd[84780]: server stopped
Apr 27 06:47:11 WAVM02 systemd[1]: pvestatd.service: Deactivated successfully.
Apr 27 06:47:11 WAVM02 systemd[1]: pvestatd.service: Consumed 8min 44.782s CPU time.
 
Last edited:
Take a look at the discussion here: https://forum.proxmox.com/threads/too-many-open-files.81187/

While that user's issue is marked resolved, you may have arrived at the same result differently. I'd suggest providing similar outputs that were requested from Op in that thread.

Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Thanks but I resolved the issue.
Apparently it's because my metric server was down and the issue never occurred again after I deleted the metric server from my Proxmox configs.

I have yet to know why the two issues are related to each other but as of right now I will highlight the solution for anyone with similar issues and mark this thread as resolved.

Solution:

Step 1:
Log into PVE web portal
go to Datacenter -> Metric Server

Click on any inactive metric server and click "Remove"

Step 2:
Click the node(s) marked with a question mark
Click Shell
Run command:
service pvestatd restart
wait approximately 30 seconds then reload the webpage


I'll look into logs and come back here to post the root cause if I found them.
thanks for your reply.
 
Thanks but I resolved the issue.
Apparently it's because my metric server was down and the issue never occurred again after I deleted the metric server from my Proxmox configs.

I have yet to know why the two issues are related to each other but as of right now I will highlight the solution for anyone with similar issues and mark this thread as resolved.

Solution:

Step 1:
Log into PVE web portal
go to Datacenter -> Metric Server

Click on any inactive metric server and click "Remove"

Step 2:
Click the node(s) marked with a question mark
Click Shell
Run command:
service pvestatd restart
wait approximately 30 seconds then reload the webpage


I'll look into logs and come back here to post the root cause if I found them.
thanks for your reply.
After running service pvestatd restart nodes will get green check mark but after soome time same errors are occuring, and question mark is coming on nodes. My Metric services are active only. still this logs are coming.

Please check the logs.
@bbgeek17 @fabian

Aug 19 10:25:39 cdachci pveproxy[445209]: Clearing outdated entries from certificate cache
Aug 19 10:27:20 cdachci pvestatd[448507]: lxc status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:27:27 cdachci pveproxy[444484]: worker exit
Aug 19 10:27:27 cdachci pveproxy[2645]: worker 444484 finished
Aug 19 10:27:27 cdachci pveproxy[2645]: starting 1 worker(s)
Aug 19 10:27:27 cdachci pveproxy[2645]: worker 449347 started
Aug 19 10:27:30 cdachci pvestatd[448507]: node status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:27:30 cdachci pvestatd[448507]: qemu status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:28:50 cdachci pvestatd[448507]: qemu status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:30:01 cdachci CRON[450072]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 19 10:30:01 cdachci CRON[450073]: (root) CMD ([ -x /etc/init.d/anacron ] && if [ ! -d /run/systemd/system ]; then /usr/sbin/invoke-rc.d anacron start >/dev/null; fi)
Aug 19 10:30:01 cdachci CRON[450072]: pam_unix(cron:session): session closed for user root
Aug 19 10:34:20 cdachci systemd[1]: Started anacron.service - Run anacron jobs.
Aug 19 10:34:20 cdachci anacron[451512]: Anacron 2.3 started on 2024-08-19
Aug 19 10:34:20 cdachci anacron[451512]: Normal exit (0 jobs run)
Aug 19 10:34:20 cdachci systemd[1]: anacron.service: Deactivated successfully.
Aug 19 10:34:30 cdachci pvestatd[448507]: node status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:34:30 cdachci pvestatd[448507]: qemu status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:35:00 cdachci pvestatd[448507]: qemu status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:35:00 cdachci pvestatd[448507]: lxc status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:35:01 cdachci CRON[451759]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 19 10:35:01 cdachci CRON[451760]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 19 10:35:01 cdachci CRON[451759]: pam_unix(cron:session): session closed for user root
Aug 19 10:35:10 cdachci pvestatd[448507]: node status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:35:10 cdachci pvestatd[448507]: qemu status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:35:10 cdachci pvestatd[448507]: lxc status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:35:20 cdachci pvestatd[448507]: node status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:36:38 cdachci pvedaemon[2461]: <root@pam> starting task UPID:cdachci:0006E6E2:0168FEA1:66C2D2DE:vncshell::root@pam:
Aug 19 10:36:38 cdachci pvedaemon[452322]: starting termproxy UPID:cdachci:0006E6E2:0168FEA1:66C2D2DE:vncshell::root@pam:
Aug 19 10:36:38 cdachci pvedaemon[2462]: <root@pam> successful auth for user 'root@pam'
Aug 19 10:36:38 cdachci login[452325]: pam_unix(login:session): session opened for user root(uid=0) by root(uid=0)
Aug 19 10:36:38 cdachci systemd[1]: Created slice user-0.slice - User Slice of UID 0.
Aug 19 10:36:38 cdachci systemd[1]: Starting user-runtime-dir@0.service - User Runtime Directory /run/user/0...
Aug 19 10:36:38 cdachci systemd-logind[1186]: New session 533 of user root.
Aug 19 10:36:38 cdachci systemd[1]: Finished user-runtime-dir@0.service - User Runtime Directory /run/user/0.
Aug 19 10:36:38 cdachci systemd[1]: Starting user@0.service - User Manager for UID 0...
Aug 19 10:36:38 cdachci (systemd)[452331]: pam_unix(systemd-user:session): session opened for user root(uid=0) by (uid=0)
Aug 19 10:36:38 cdachci systemd-xdg-autostart-generator[452345]: Exec binary 'python' does not exist: No such file or directory
Aug 19 10:36:38 cdachci systemd-xdg-autostart-generator[452345]: /etc/xdg/autostart/about-boss.desktop: not generating unit, executable specified in Exec= does not exist.
Aug 19 10:36:38 cdachci systemd[452331]: Queued start job for default target default.target.
Aug 19 10:36:38 cdachci systemd[452331]: Created slice app.slice - User Application Slice.
Aug 19 10:36:38 cdachci systemd[452331]: Created slice session.slice - User Core Session Slice.
Aug 19 10:36:38 cdachci systemd[452331]: Created slice background.slice - User Background Tasks Slice.
Aug 19 10:36:38 cdachci systemd[452331]: Reached target paths.target - Paths.
Aug 19 10:36:38 cdachci systemd[452331]: Listening on dbus.socket - D-Bus User Message Bus Socket.
Aug 19 10:36:38 cdachci systemd[452331]: Reached target sockets.target - Sockets.
Aug 19 10:36:38 cdachci systemd[452331]: Reached target basic.target - Basic System.
Aug 19 10:36:38 cdachci systemd[1]: Started user@0.service - User Manager for UID 0.
Aug 19 10:36:38 cdachci systemd[452331]: Started nss-tlsd.service - NSS TLS Daemon.
Aug 19 10:36:38 cdachci systemd[1]: Started session-533.scope - Session 533 of User root.
Aug 19 10:36:38 cdachci systemd[452331]: pipewire.service - PipeWire Multimedia Service was skipped because of an unmet condition check (ConditionUser=!root).
Aug 19 10:36:38 cdachci systemd[452331]: tracker-extract-3.service - Tracker metadata extractor was skipped because of an unmet condition check (ConditionUser=!root).
Aug 19 10:36:38 cdachci systemd[452331]: wireplumber.service: Bound to unit pipewire.service, but unit isn't active.
Aug 19 10:36:38 cdachci systemd[452331]: Dependency failed for wireplumber.service - Multimedia Service Session Manager.
Aug 19 10:36:38 cdachci systemd[452331]: wireplumber.service: Job wireplumber.service/start failed with result 'dependency'.
Aug 19 10:36:38 cdachci systemd[452331]: pipewire-pulse.service - PipeWire PulseAudio was skipped because of an unmet condition check (ConditionUser=!root).
Aug 19 10:36:38 cdachci systemd[452331]: Reached target default.target - Main User Target.
Aug 19 10:36:38 cdachci systemd[452331]: Startup finished in 113ms.
Aug 19 10:36:38 cdachci login[452348]: ROOT LOGIN on '/dev/pts/0'
Aug 19 10:37:11 cdachci pveproxy[449347]: Clearing outdated entries from certificate cache
Aug 19 10:40:10 cdachci pvestatd[448507]: node status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:40:10 cdachci pvestatd[448507]: qemu status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:40:10 cdachci pvestatd[448507]: lxc status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:40:20 cdachci pvestatd[448507]: node status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:40:20 cdachci pvestatd[448507]: qemu status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:42:20 cdachci pvestatd[448507]: lxc status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:42:30 cdachci pvestatd[448507]: node status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:42:30 cdachci pvestatd[448507]: qemu status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:42:30 cdachci pvestatd[448507]: lxc status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:42:35 cdachci pvedaemon[2460]: <root@pam> successful auth for user 'root@pam'
Aug 19 10:43:14 cdachci systemd-logind[1186]: Session 533 logged out. Waiting for processes to exit.
Aug 19 10:43:14 cdachci systemd[1]: session-533.scope: Deactivated successfully.
Aug 19 10:43:14 cdachci systemd-logind[1186]: Removed session 533.
Aug 19 10:43:14 cdachci pvedaemon[2461]: <root@pam> end task UPID:cdachci:0006E6E2:0168FEA1:66C2D2DE:vncshell::root@pam: OK
Aug 19 10:43:20 cdachci pvestatd[448507]: node status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:43:20 cdachci pvestatd[448507]: qemu status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:43:20 cdachci pvestatd[448507]: lxc status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:43:24 cdachci systemd[1]: Stopping user@0.service - User Manager for UID 0...
Aug 19 10:43:24 cdachci systemd[452331]: Activating special unit exit.target...
Aug 19 10:43:24 cdachci systemd[452331]: Removed slice session.slice - User Core Session Slice.
Aug 19 10:43:24 cdachci systemd[452331]: Removed slice background.slice - User Background Tasks Slice.
Aug 19 10:43:24 cdachci systemd[452331]: Stopped target default.target - Main User Target.
Aug 19 10:43:24 cdachci systemd[452331]: Stopping nss-tlsd.service - NSS TLS Daemon...
Aug 19 10:43:24 cdachci systemd[452331]: Stopped nss-tlsd.service - NSS TLS Daemon.
Aug 19 10:43:24 cdachci systemd[452331]: Stopped target basic.target - Basic System.
Aug 19 10:43:24 cdachci systemd[452331]: Stopped target paths.target - Paths.
Aug 19 10:43:24 cdachci systemd[452331]: Stopped target sockets.target - Sockets.
Aug 19 10:43:24 cdachci systemd[452331]: Stopped target timers.target - Timers.
Aug 19 10:43:24 cdachci systemd[452331]: Closed dbus.socket - D-Bus User Message Bus Socket.
Aug 19 10:43:24 cdachci systemd[452331]: Closed dirmngr.socket - GnuPG network certificate management daemon.
Aug 19 10:43:24 cdachci systemd[452331]: Closed gcr-ssh-agent.socket - GCR ssh-agent wrapper.
Aug 19 10:43:24 cdachci systemd[452331]: Closed gnome-keyring-daemon.socket - GNOME Keyring daemon.
Aug 19 10:43:24 cdachci systemd[452331]: Closed gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
Aug 19 10:43:24 cdachci systemd[452331]: Closed gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Aug 19 10:43:24 cdachci systemd[452331]: Closed gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Aug 19 10:43:24 cdachci systemd[452331]: Closed gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Aug 19 10:43:24 cdachci systemd[452331]: Removed slice app.slice - User Application Slice.
Aug 19 10:43:24 cdachci systemd[452331]: Reached target shutdown.target - Shutdown.
Aug 19 10:43:24 cdachci systemd[452331]: Finished systemd-exit.service - Exit the Session.
Aug 19 10:43:24 cdachci systemd[452331]: Reached target exit.target - Exit the Session.
Aug 19 10:43:24 cdachci systemd[1]: user@0.service: Deactivated successfully.
Aug 19 10:43:24 cdachci systemd[1]: Stopped user@0.service - User Manager for UID 0.
Aug 19 10:43:24 cdachci systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...
Aug 19 10:43:24 cdachci systemd[1]: run-user-0.mount: Deactivated successfully.
Aug 19 10:43:24 cdachci systemd[1]: user-runtime-dir@0.service: Deactivated successfully.
Aug 19 10:43:24 cdachci systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.
Aug 19 10:43:24 cdachci systemd[1]: Removed slice user-0.slice - User Slice of UID 0.
Aug 19 10:43:30 cdachci pvestatd[448507]: node status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:43:30 cdachci pvestatd[448507]: qemu status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:43:50 cdachci pvestatd[448507]: node status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:43:50 cdachci pvestatd[448507]: qemu status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:43:50 cdachci pvestatd[448507]: lxc status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:43:53 cdachci ceph-crash[1171]: WARNING:ceph-crash:unable to read crash path /var/lib/ceph/crash/2024-06-25T10:38:21.905500Z_d08ea75a-124d-4c58-a454-9025f4425186
Aug 19 10:44:00 cdachci pvestatd[448507]: node status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:44:00 cdachci pvestatd[448507]: qemu status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
Aug 19 10:44:00 cdachci pvestatd[448507]: lxc status update error: metrics send error 'cdachci': failed to send metrics: Connection refused
 
After running service pvestatd restart nodes will get green check mark but after soome time same errors are occuring, and question mark is coming on nodes. My Metric services are active only. still this logs are coming.
Hi @d_singh , its possible you have something misconfigured, or perhaps a firewall is interfering.
The "connection refused" message indicates a TCP layer issue and you should use usual Linux network tools and methods to troubleshoot it.

Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!