Timeouts when calling a node's qemu endpoint

Oct 6, 2025
2
0
1
Hi,

after upgrading to PVE 9 we realised that for some cluster nodes our calls to a node's qemu endpoint (api2/json/nodes/<node>/qemu) started to fail (HTTP 592/593). It looks like a timeout as it's consistent. They terminate after roughly 30s. Some work but need >> 10s to complete.

We added some monitoring and saw some interesting figures: after a reboot it typically is quite ok (way below 10s). When starting up all VMs (about 30-50) it still was ok... after some time it grew worse. We overcommit intentionally but keep the load as well as CPU and memory usage in a viable range. Most nodes do have a swap partition on an Optane disk (~700 GB), some use a dedicated NVMe disk. KSM is max 50 GB, typically much lower.

There was no issue with PVE 8.

Our rough node cluster specs are:
* 13 Nodes
* 64 physical cores on each node (2x EPYC 7502 or EPYC 7H12 or EPYC 9354)
* 1TB RAM each
* All-NVMe
* Ceph + small local ZFS pool
* Frontend 10 GbE
* Ceph 100 GbE

With no change to the spec it got worse just after upgrading to PVE 9. Any idea what to look for or check?

Thanks

phb
 

Attachments

  • qemu-call-degradation.png
    qemu-call-degradation.png
    77.7 KB · Views: 9
Last edited:
Hi,
Proxmox VE 9 collects more stats about virtual machines, which requires a bit more time:

General improvements for virtual guests​

  • Enhanced metrics for virtual guests for a more detailed overview over resource usage.The Memory Usage graph in the VM/CT summary panel now additionally reports the host memory usage of the guest's cgroup (issue 6068).This is useful because VMs usually consume a higher amount of memory on the host than the amount that is reported from inside the guest.New graphs show CPU, IO, and memory pressure stall information of the guest's cgroup to facilitate troubleshooting.
  • Increase the RRD guest metrics aggregation window to provide greater temporal granularity.The following resolutions are now available: one point per minute for a day, one point every 30 minutes for a month, one point every six hours for a year, and one point per week for a decade.These options now match the Proxmox Backup Server metric aggregation.

However, there was a recent improvement in qemu-server >= 9.0.23, currently available in the pve-test repository, that can help in certain situations (from apt changelog qemu-server):
* vm status: also queue query-proxmox-support QMP commands to avoid stacking
timeouts.