pvestatd timeout continue logging

neroita

New Member
May 30, 2024
5
0
1
Hi all , I have a pve cluster with ceph that work perfectly but on all node logs I got a lot of:

Code:
pvestatd[PID]: got timeout

I've try to search the problem but I don't know where to search more info to debug.

Any idea ?
 
Hi,
what does pvesm status say? Do you have any slow/unreachable storages in your /etc/pve/storage.cfg that are not disabled?
 
this is pvesm status:
Code:
Name            Type     Status           Total            Used       Available        %
ARCHIVIO         nfs     active      2929886208       530334720      2398990336   18.10%
CEPH             rbd     active      9853533381      1028032709      8825500672   10.43%
CEPHFS        cephfs     active      9196146688       370647040      8825499648    4.03%
EC            cephfs     active     23839543296      6188544000     17650999296   25.96%
STORE           esxi   disabled               0               0               0      N/A
local            dir   disabled               0               0               0      N/A
pbs              pbs     active      1007855616       620127360       387728256   61.53%

my storage are all working , I've try to write some files and all write at almost 3-400 MB/s.
 
Are there any other messages in the system logs/journal around the time of the pvestatd timeout message? Do pct list and qm list return without error?
 
Are there any other messages in the system logs/journal around the time of the pvestatd timeout message?

Are all like that:
Code:
Mar 18 12:00:22 fame pvestatd[1350]: got timeout
Mar 18 12:01:21 fame pvestatd[1350]: got timeout
Mar 18 12:01:23 fame pvestatd[1350]: got timeout
Mar 18 12:02:01 fame pvestatd[1350]: got timeout
Mar 18 12:02:40 fame pmxcfs[1042]: [status] notice: received log
Mar 18 12:02:52 fame pvestatd[1350]: got timeout
Mar 18 12:03:01 fame pvestatd[1350]: got timeout
Mar 18 12:03:03 fame pvestatd[1350]: got timeout
Mar 18 12:06:16 fame pvestatd[1350]: got timeout
Mar 18 12:06:18 fame pvestatd[1350]: got timeout
Mar 18 12:06:18 fame pvestatd[1350]: status update time (9.448 seconds)
Mar 18 12:06:22 fame pvestatd[1350]: got timeout
Mar 18 12:06:24 fame pvestatd[1350]: got timeout
Mar 18 12:06:25 fame pvestatd[1350]: status update time (5.095 seconds)
Mar 18 12:07:43 fame pvestatd[1350]: got timeout
Mar 18 12:07:44 fame pvestatd[1350]: status update time (5.739 seconds)
Mar 18 12:07:51 fame pvestatd[1350]: got timeout



Do pct list and qm list return without error?

I don't see any error ( only vm , no ct here ) , all commands complete and exit with (0) .
 
Are all like that:
Code:
Mar 18 12:00:22 fame pvestatd[1350]: got timeout
Mar 18 12:01:21 fame pvestatd[1350]: got timeout
Mar 18 12:01:23 fame pvestatd[1350]: got timeout
Mar 18 12:02:01 fame pvestatd[1350]: got timeout
Mar 18 12:02:40 fame pmxcfs[1042]: [status] notice: received log
Mar 18 12:02:52 fame pvestatd[1350]: got timeout
Mar 18 12:03:01 fame pvestatd[1350]: got timeout
Mar 18 12:03:03 fame pvestatd[1350]: got timeout
Mar 18 12:06:16 fame pvestatd[1350]: got timeout
Mar 18 12:06:18 fame pvestatd[1350]: got timeout
Mar 18 12:06:18 fame pvestatd[1350]: status update time (9.448 seconds)
Mar 18 12:06:22 fame pvestatd[1350]: got timeout
Mar 18 12:06:24 fame pvestatd[1350]: got timeout
Mar 18 12:06:25 fame pvestatd[1350]: status update time (5.095 seconds)
Mar 18 12:07:43 fame pvestatd[1350]: got timeout
Mar 18 12:07:44 fame pvestatd[1350]: status update time (5.739 seconds)
Mar 18 12:07:51 fame pvestatd[1350]: got timeout
Most likely, one of the network storages sometimes takes too long to respond for the status daemon. You can try to monitor/check the latency to the network storages to find out which one it is. Maybe it's also generally the load on the network, do all storages use the same interfaces?
 
There are:
* one ceph storage for vm
* one cephfs Replica 3/2
* one cephfs EC 6/4
* one pbs
* one nfs for vm template archive and isos

Network is two bond 10GB for ceph and two bond 2.5GB for vm.

All the storage seem to work ok , is there a way to raise debug level to print which storage that too long to respond ?

The performance of I/O is really good on all storage , I read and write on all more than 200MB/s and iops are good since I also run some db.
 
ps: The only storage that can be slow is boot drive since is usb ( but a good one that r/w at 450/180 MB/s ) but I use only for proxmox os.
 
Why do all threads around this issue just end without solution? I've been having this same exact issue for months, even upgraded the kernel to the latest version and still no solution.