Timeouts when calling a node's qemu endpoint

phabaer · Oct 7, 2025

Hi,

after upgrading to PVE 9 we realised that for some cluster nodes our calls to a node's qemu endpoint (api2/json/nodes/<node>/qemu) started to fail (HTTP 592/593). It looks like a timeout as it's consistent. They terminate after roughly 30s. Some work but need >> 10s to complete.

We added some monitoring and saw some interesting figures: after a reboot it typically is quite ok (way below 10s). When starting up all VMs (about 30-50) it still was ok... after some time it grew worse. We overcommit intentionally but keep the load as well as CPU and memory usage in a viable range. Most nodes do have a swap partition on an Optane disk (~700 GB), some use a dedicated NVMe disk. KSM is max 50 GB, typically much lower.

There was no issue with PVE 8.

Our rough node cluster specs are:
* 13 Nodes
* 64 physical cores on each node (2x EPYC 7502 or EPYC 7H12 or EPYC 9354)
* 1TB RAM each
* All-NVMe
* Ceph + small local ZFS pool
* Frontend 10 GbE
* Ceph 100 GbE

With no change to the spec it got worse just after upgrading to PVE 9. Any idea what to look for or check?

Thanks

phb

fiona · Oct 7, 2025

Hi,
Proxmox VE 9 collects more stats about virtual machines, which requires a bit more time:

General improvements for virtual guests

Enhanced metrics for virtual guests for a more detailed overview over resource usage.The Memory Usage graph in the VM/CT summary panel now additionally reports the host memory usage of the guest's cgroup (issue 6068).This is useful because VMs usually consume a higher amount of memory on the host than the amount that is reported from inside the guest.New graphs show CPU, IO, and memory pressure stall information of the guest's cgroup to facilitate troubleshooting.

Increase the RRD guest metrics aggregation window to provide greater temporal granularity.The following resolutions are now available: one point per minute for a day, one point every 30 minutes for a month, one point every six hours for a year, and one point per week for a decade.These options now match the Proxmox Backup Server metric aggregation.

However, there was a recent improvement in qemu-server >= 9.0.23, currently available in the pve-test repository, that can help in certain situations (from apt changelog qemu-server):

* vm status: also queue query-proxmox-support QMP commands to avoid stacking
timeouts.

phabaer · Oct 7, 2025

Great, thanks for the hint. Will try the test version and report back!

phabaer · Oct 13, 2025

I installed 9.0.23 from test but unfortunately it didn't solve it really. We got less timeouts but still requests are back to ~30s response times.

If there anything we can do about it? Are there configuration options available? For the time being, we will implement a workaround for all datapoints we need.

Thanks

phb

fiona · Oct 13, 2025

phabaer said:
If there anything we can do about it? Are there configuration options available? For the time being, we will implement a workaround for all datapoints we need.

There is no way to opt-out of collecting pressure stall information currently. Those stats will be used for planned features like dynamic resource scheduling (DRS). I haven't looked into detail if things could be optimized or if configuration options might be sensible, so feel free to open a feature request to better keep track of the issue: https://bugzilla.proxmox.com/

Search

Search

Timeouts when calling a node's qemu endpoint

phabaer

New Member

Attachments

fiona

Proxmox Staff Member

General improvements for virtual guests

phabaer

New Member

phabaer

New Member

fiona

Proxmox Staff Member

We value your privacy

Timeouts when calling a node's qemu endpoint

phabaer

New Member

Attachments

fiona

Proxmox Staff Member

General improvements for virtual guests​

phabaer

New Member

phabaer

New Member

fiona

Proxmox Staff Member

We value your privacy

General improvements for virtual guests