[bug] Cluster not displaying all nodes as up

Singman · Mar 8, 2025

Hi,

My clusters is made of 4 members. Sometimes, the display is bugged and dont display one member as up. It's still active (I could SSH to it, VM are running, some stats are working).
I've tried to restart pveproxy service, no effect.

Singman · Mar 8, 2025

Side note : I noticed the value of IO delay pretty high compared to others hosts (<1%).
Storage is Ceph pool, full nvme SSD, dedicated network with 2.5 Gpbs switch.

UdoB · Mar 8, 2025

Singman said:
Sometimes, the display is bugged and dont display one member as up.

Status information is managed by "pvestatd", running on each node. ~# systemctl status pvestatd.service may confirm the absence of that service while ~# systemctl start pvestatd.service will... start it

This does not eliminate the reason why it died...

Edit: see also https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_important_service_daemons

Singman · Mar 8, 2025

Code:

root@pve10:~# systemctl status pvestatd.service
● pvestatd.service - PVE Status Daemon
     Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; preset: enabled)
     Active: active (running) since Thu 2025-03-06 22:06:20 CET; 1 day 15h ago
   Main PID: 934 (pvestatd)
      Tasks: 2 (limit: 38116)
     Memory: 177.6M
        CPU: 55min 6.602s
     CGroup: /system.slice/pvestatd.service
             ├─    934 pvestatd
             └─1205310 /usr/bin/python3 /usr/bin/ceph --version

Mar 08 06:43:15 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:43:24 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:43:34 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:43:45 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:43:54 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:44:04 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:44:14 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:44:24 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:44:34 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)
Mar 08 06:44:44 pve10 pvestatd[934]: PBS_pool: error fetching datastores - 500 Can't connect to pbs:8007 (No route to host)

Normal behavior, my PBS only start during backup

Restarting pvestat service ~~don't~~ fix the problem. But now, how to track why this service crashed but still show as normal with the "status" command ?

Singman · Mar 8, 2025

Alas ! The bug come back after 1 or 2 minutes, and service still show as running.

Neobin · Mar 8, 2025

The problem is, most likely, the unreachable storage(s)...
Disable it/them in the datacenter to check.

Singman · Mar 9, 2025

Neobin said:
The problem is, most likely, the unreachable storage(s)...
Disable it/them in the datacenter to check.

How ? Because this storage is the backup server and that computer is only powered when I need to backup...
Or maybe do you know a command I could run with cron to mount / unmount that volume ?

Neobin · Mar 9, 2025

I meant to disable it/them temporarily to confirm, that this is the actual (only) culprit.
If it is, here in the forum are posts from people which workaround that situation with hookscripts, iirc.

Singman · Mar 10, 2025

Leaving my PBS server online don"t resolve anything. The 1st node (pve10) still show as down and the service still crash.
Where could I find a log or a debug option ?

Gilberto Ferreira · Mar 10, 2025

Singman said:
Leaving my PBS server online don"t resolve anything. The 1st node (pve10) still show as down and the service still crash.
Where could I find a log or a debug option ?

Try

Code:

journalctl -f

to debug.

Singman · Mar 10, 2025

Code:

Mar 10 13:40:51 pve10 systemd[1]: Stopping pvestatd.service - PVE Status Daemon...
Mar 10 13:40:51 pve10 pvestatd[1863634]: received signal TERM
Mar 10 13:40:51 pve10 pvestatd[1863634]: server closing
Mar 10 13:40:51 pve10 pvestatd[1863634]: server stopped
Mar 10 13:40:52 pve10 systemd[1]: pvestatd.service: Deactivated successfully.
Mar 10 13:40:52 pve10 systemd[1]: Stopped pvestatd.service - PVE Status Daemon.
Mar 10 13:40:52 pve10 systemd[1]: pvestatd.service: Consumed 1.679s CPU time.
Mar 10 13:40:52 pve10 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Mar 10 13:40:52 pve10 pvestatd[1893341]: starting server
Mar 10 13:40:52 pve10 systemd[1]: Started pvestatd.service - PVE Status Daemon.
Mar 10 13:41:32 pve10 pmxcfs[789]: [status] notice: received log
Mar 10 13:56:33 pve10 pmxcfs[789]: [status] notice: received log
Mar 10 14:00:34 pve10 systemd[1]: Starting apt-daily.service - Daily apt download activities...
Mar 10 14:00:34 pve10 systemd[1]: apt-daily.service: Deactivated successfully.
Mar 10 14:00:34 pve10 systemd[1]: Finished apt-daily.service - Daily apt download activities.
Mar 10 14:06:18 pve10 pmxcfs[789]: [dcdb] notice: data verification successful
Mar 10 14:11:34 pve10 pmxcfs[789]: [status] notice: received log
Mar 10 14:17:01 pve10 CRON[1900090]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Mar 10 14:17:01 pve10 CRON[1900091]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Mar 10 14:17:01 pve10 CRON[1900090]: pam_unix(cron:session): session closed for user root
Mar 10 14:27:28 pve10 pmxcfs[789]: [status] notice: received log

Still nothing noticeable...

Search

Search

[bug] Cluster not displaying all nodes as up

Singman

Well-Known Member

Singman

Well-Known Member

UdoB

Distinguished Member

Singman

Well-Known Member

Singman

Well-Known Member

Neobin

Distinguished Member

Singman

Well-Known Member

Neobin

Distinguished Member

Singman

Well-Known Member

Gilberto Ferreira

Renowned Member

Singman

Well-Known Member

We value your privacy