Getting client timeouts in PDM system log

deebsr

Member
Nov 17, 2025
32
13
8
In the system log I'm seeing the following entries for all the nodes in one of my clusters ( x25 nodes )

marking client node01 as unreachable
client timed out on request /api2/extjs/cluster/metrics/export?history=1&local%2Donly=0&start%2Dtime=0, trying another remote


I have another cluster that seems to not having this issue.
I have tried removing and re-adding which dos not change anything.
I notice that on the affected cluster I'm unable to see any metric data in PDM. On the cluster itself its ok.

Running version 1.1.4 PDM
PVE = 9.2.3
 
I notice that on the affected cluster I'm unable to see any metric data in PDM. On the cluster itself its ok.
This sounds like the request takes a very long time and eventually fails. What does the following return:

Code:
time pvesh get /cluster/metrics/export --history 1 --local-only 0 --start-time 0 > /dev/null

on a node of the affected cluster?
 
This sounds like the request takes a very long time and eventually fails. What does the following return:

Code:
time pvesh get /cluster/metrics/export --history 1 --local-only 0 --start-time 0 > /dev/null

on a node of the affected cluster?

Here is the output on the cluster that is affected ( took a while to actually bring up these results btw ):

real 2m43.476s
user 0m21.497s
sys 0m14.301s


On a similar cluster ( 10 nodes instead of 25 and only about 100VMs vs 2700Vms ) using the same network ( same firewall access, same switches etc ) I get the following:

real 0m6.748s
user 0m2.281s
sys 0m0.811s
 
Also just for context the reason I started investigating this was that I realized there was zero metrics being brought up into PDM for this cluster.
This is for both Node and VM metrics.




1781732534242.png


I'm still able to get metrics via an InfluxDB no problems....
 
Yeah, it sounds like the metrics collection task is struggling with the initial import here. There is just so much data on that remote, that the request takes almost three minutes to complete. That runs into a timeout on the PDM side. So yeah, we'll need to increase the timeout here or make the mechanism more flexible on bigger data sets/slower remotes. We'll look into it.

Thanks for the information!
 
Yeah, it sounds like the metrics collection task is struggling with the initial import here. There is just so much data on that remote, that the request takes almost three minutes to complete. That runs into a timeout on the PDM side. So yeah, we'll need to increase the timeout here or make the mechanism more flexible on bigger data sets/slower remotes. We'll look into it.

Thanks for the information!
OK thanks for looking into this. Let me know if you need any other information from this cluster.

As a note this one cluster is 25 nodes and has about 2800VMs in it. Sending metrics using the cluster setting ( sending to an InfluxDB ) seems to be working ok with no issues.

Also should I be setting up a bug report for this? or are you going to add this internally?
 
Also should I be setting up a bug report for this? or are you going to add this internally?
Feel free to open a bug report for this, it might be sensible and give other user a heads-up that we are already tracking this. However, I did already add this to our internal issue tracking as a point for improvement. So hopefully it won't get lost either way.


As a note this one cluster is 25 nodes and has about 2800VMs in it. Sending metrics using the cluster setting ( sending to an InfluxDB ) seems to be working ok with no issues.
To my knowledge InfluxDB uses an entirely different code path here. So yes, it makes sense that even if PDM is struggling InfluxDB can still work in this scenario.
 
Feel free to open a bug report for this, it might be sensible and give other user a heads-up that we are already tracking this. However, I did already add this to our internal issue tracking as a point for improvement. So hopefully it won't get lost either way.



To my knowledge InfluxDB uses an entirely different code path here. So yes, it makes sense that even if PDM is struggling InfluxDB can still work in this scenario.
Ok bug report filed:

https://bugzilla.proxmox.com/show_bug.cgi?id=7731

Please add any other relevent infomation



Hopefully this can be fixed soon. I feel PDM has great potential to be used at some point as a centralized log and metrics location similar to how vCenter is currently.
Even better if PDM could also gain functionality similar to Aria Operations/Logs.