Disk migration speed is capped by disk read/write limits in PVE 9

jnkraft

Member
Mar 30, 2021
13
4
23
38
I recently upgraded my cluster from PVE8 to PVE9 and noticed that offline and online disk migration (qcow2) between storage backends (NFS) has become SLOW. After a few experiments, I observed that the speed is constant and, surprisingly, matches the Read and Write limits set in the settings of the disk being migrated. On PVE8, on my 10Gb network, I had a decent migration speed (close to saturating the link), but now it was 100 MB/s, which exactly matched the limits configured for that disk. After removing the limits, the speed became what it should be. If I set limits to 30 MB/s, I get a migration speed of 30 MB/s. I didn’t check the impact of the IOPS limits or the burst limits for both limit types.

Since I have a lot of VMs in the cluster (it’s a work cluster), it’s a highly competitive virtualization environment in terms of IOPS and bandwidth, so most VMs have some kind of limits configured.

I can say with complete certainty that neither PVE7 nor PVE8 showed this behavior. A very unexpected new "feature". What did I miss in the changelog?
 
Last edited:
Hi @jnkraft , this thread may be somewhat helpful:



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: _gabriel and waltar
Hi @jnkraft , this thread may be somewhat helpful:



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Hi, thanks for the link — I read through that thread. My limits are set to default/none, and, as I mentioned, I have a reproducible correlation between migration speed and the disk limits. And I upgraded from PVE 8 to PVE 9 last weekend, so the contrast is stark and very recent.
 
Hi,
thank you for the report! I can reproduce the issue here and will look into it. It's a regression from the switch to the QEMU -blockdev command line with machine versions >= 10.0. We accidentally use the throttled block node as the source of the mirror. Mirroring in QEMU currently only allows using top-level nodes as the source and those include the throttling, so fixing this will require patching QEMU itself too.
 
  • Like
Reactions: _gabriel and waltar
Hi,
thank you for the report! I can reproduce the issue here and will look into it. It's a regression from the switch to the QEMU -blockdev command line with machine versions >= 10.0. We accidentally use the throttled block node as the source of the mirror. Mirroring in QEMU currently only allows using top-level nodes as the source and those include the throttling, so fixing this will require patching QEMU itself too.
Hello!
Thanks for confirming the bug. Will wait for a fix.
 
The plan to fix this is to go back to using the limits only on the VM-facing front-end devices rather than the block nodes where the mirror job operates. I sent a preparatory patch upstream.

Two possible workarounds for now:
1. use machine version 9.2 for the VMs (this cannot be hot-applied)
2. unapply the IO limits before the clone and reapply them afterwards (this can be done while the VM is running)
 
  • Like
Reactions: jnkraft
I am confused is this fixed?

I have two proxmox v9 clusters and both are apt update / upgraded and only one of the two clusters is showing this behavior where the migration is limited to the VM disk bandwidth limits.


Both workarounds are not feasible for me.

Edit: I think the reason why one cluster works and the other doesnt is the newer cluster VMs were deployed with the machine version 10+
 
Last edited: