[bug] Cluster not displaying all nodes as up

Singman

Well-Known Member
Sep 13, 2019
39
1
48
57
Hi,

My clusters is made of 4 members. Sometimes, the display is bugged and dont display one member as up. It's still active (I could SSH to it, VM are running, some stats are working).
I've tried to restart pveproxy service, no effect.
1741422301168.png
 
Side note : I noticed the value of IO delay pretty high compared to others hosts (<1%).
Storage is Ceph pool, full nvme SSD, dedicated network with 2.5 Gpbs switch.
 
Last edited:
Code:
root@pve10:~# systemctl status pvestatd.service
● pvestatd.service - PVE Status Daemon
     Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; preset: enabled)
     Active: active (running) since Thu 2025-03-06 22:06:20 CET; 1 day 15h ago
   Main PID: 934 (pvestatd)
      Tasks: 2 (limit: 38116)
     Memory: 177.6M
        CPU: 55min 6.602s
     CGroup: /system.slice/pvestatd.service
             ├─    934 pvestatd
             └─1205310 /usr/bin/python3 /usr/bin/ceph --version

Mar 08 06:43:15 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:43:24 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:43:34 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:43:45 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:43:54 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:44:04 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:44:14 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:44:24 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:44:34 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:44:44 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Normal behavior, my PBS only start during backup :)

Restarting pvestat service don't fix the problem. But now, how to track why this service crashed but still show as normal with the "status" command ?
 
Last edited:
Alas ! The bug come back after 1 or 2 minutes, and service still show as running.
 
The problem is, most likely, the unreachable storage(s)...
Disable it/them in the datacenter to check.
 
The problem is, most likely, the unreachable storage(s)...
Disable it/them in the datacenter to check.
How ? Because this storage is the backup server and that computer is only powered when I need to backup...
Or maybe do you know a command I could run with cron to mount / unmount that volume ?
 
I meant to disable it/them temporarily to confirm, that this is the actual (only) culprit.
If it is, here in the forum are posts from people which workaround that situation with hookscripts, iirc.
 
Leaving my PBS server online don"t resolve anything. The 1st node (pve10) still show as down and the service still crash.
Where could I find a log or a debug option ?
 
Code:
Mar 10 13:40:51 pve10 systemd[1]: Stopping pvestatd.service - PVE Status Daemon...
Mar 10 13:40:51 pve10 pvestatd[1863634]: received signal TERM
Mar 10 13:40:51 pve10 pvestatd[1863634]: server closing
Mar 10 13:40:51 pve10 pvestatd[1863634]: server stopped
Mar 10 13:40:52 pve10 systemd[1]: pvestatd.service: Deactivated successfully.
Mar 10 13:40:52 pve10 systemd[1]: Stopped pvestatd.service - PVE Status Daemon.
Mar 10 13:40:52 pve10 systemd[1]: pvestatd.service: Consumed 1.679s CPU time.
Mar 10 13:40:52 pve10 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Mar 10 13:40:52 pve10 pvestatd[1893341]: starting server
Mar 10 13:40:52 pve10 systemd[1]: Started pvestatd.service - PVE Status Daemon.
Mar 10 13:41:32 pve10 pmxcfs[789]: [status] notice: received log
Mar 10 13:56:33 pve10 pmxcfs[789]: [status] notice: received log
Mar 10 14:00:34 pve10 systemd[1]: Starting apt-daily.service - Daily apt download activities...
Mar 10 14:00:34 pve10 systemd[1]: apt-daily.service: Deactivated successfully.
Mar 10 14:00:34 pve10 systemd[1]: Finished apt-daily.service - Daily apt download activities.
Mar 10 14:06:18 pve10 pmxcfs[789]: [dcdb] notice: data verification successful
Mar 10 14:11:34 pve10 pmxcfs[789]: [status] notice: received log
Mar 10 14:17:01 pve10 CRON[1900090]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Mar 10 14:17:01 pve10 CRON[1900091]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Mar 10 14:17:01 pve10 CRON[1900090]: pam_unix(cron:session): session closed for user root
Mar 10 14:27:28 pve10 pmxcfs[789]: [status] notice: received log

Still nothing noticeable...