[TUTORIAL] Proxmox VE 7.2 Benchmark: aio native, io_uring, and iothreads

bbgeek17

Distinguished Member
Nov 20, 2020
3,356
809
153
Blockbridge
www.blockbridge.com
Hey everyone, a common question in the forum and to us is which settings are best for storage performance. We took a comprehensive look at performance on PVE 7.2 (kernel=5.15.53-1-pve) with aio=native, aio=io_uring, and iothreads over several weeks of benchmarking on an AMD EPYC system with 100G networking running in a datacenter environment with moderate to heavy load.

Here's an overview of the findings:
  • iothreads significantly improve performance for most workloads.
  • aio=native and aio=io_uring offer similar performance.
  • aio=native has a slight latency advantage for QD1 workloads.
  • aio=io_uring performance degrades in extreme load conditions.

Here's a link to full analysis with lots of graphs and data |https://kb.blockbridge.com/technote/proxmox-aio-vs-iouring/

tldr: The test data shows a clear and significant performance improvement that supports the use of IOThreads. Performance differences between aio=native and aio=io_uring were less significant. Except for unusual behavior reported in our results for QD=2, aio=native offers slightly better performance (when deployed with an IOThread) and gets our vote for the top pick.

attention: Our recommendation for aio=native applies to unbuffered, O_DIRECT, raw block storage only; the disk cache policy must be set to none. Raw block storage types include iSCSI, NVMe, and CEPH/RBD. For thin-LVM, anything stacked on top of software RAID, and file-based solutions (including NFS and ZFS), aio=io_uring (plus an IOThread) is preferred because aio=native can block in these configurations.

If you find this helpful, please let me know. I’ve got a bit more that I can share in the performance and tuning space. Questions, comments, and corrections are welcome.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
My only feedback of significance here is I now routinely switch away from the default io_uring, was fouind to be the cause of data corruption in my windows guests.

by iothreads I assume you mean aio=threads?
 
Hi,
My only feedback of significance here is I now routinely switch away from the default io_uring, was fouind to be the cause of data corruption in my windows guests.
what kind of data corruption? Could you provide a few more details about the configuration, i.e. what storage was used, what disk controllers, disk settings? Best to open a separate thread an mention me there with @fiona

by iothreads I assume you mean aio=threads?
Those are different settings. You can select iothread for a (VirtIO or SCSI with the VirtIO SCSI single controller) disk to have QEMU handle the IO in a separate thread (documentation). The aio settings rather decides on what low-level API is used to issue the requests.
 
hi @chrcoluk

aio=threads uses a thread pool to execute synchronous system calls to perform I/O operations. As such, the number of threads grows proportionally with queue depth, and observed latency increases significantly due to context switching overhead. The use of aio=threads is a legacy low-performance "aio" mechanism for QEMU. Frankly, it is not included in the performance comparison because it does not offer competitive performance compared to aio=native (i.e., posix_aio) or aio=io_uring (i.e., linux io_uring) in a direct access storage configuration.

If you are experiencing data corruption with io_uring, we encourage you to report it to the Proxmox and/or Linux folks. Our guidance towards aio=native is based on a history of anecdotal reports like yours. That said, we've never seen data corruption related to io_uring in the field.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Appreciate the explanation, sadly aio native cannot be used on all cache modes.
 
Last edited:
@bbgeek17

I stumbled on the benchmark via Google, thanks a lot for all your efforts. :)

attention: Our recommendation for aio=native applies to unbuffered, O_DIRECT, raw block storage only; the disk cache policy must be set to none. Raw block storage types include iSCSI, NVMe, and CEPH/RBD. For thin-LVM, anything stacked on top of software RAID, and file-based solutions (including NFS and ZFS), aio=io_uring (plus an IOThread) is preferred because aio=native can block in these configurations.
The quoted section makes me wonder....

"Raw block storage" - till now I thought that with LVM thick, nocache and "raw" as qemu image format one would use a "raw block storage access". Yes, LVM is in-between, but still...

Or am I wrong? I'd never had problems with aio=native and the iothreads=1 options...

In case of LVM thin or using the image format QCOW2, I could probably understand that due to allocation on demand (missing preallocation) the underlying storage could get hammered pretty easily, but that's not the case I think with LVM thick and raw afaik?

Could you elaborate maybe?

Thanks in advance **
 
Thanks for your question!

First, it is essential to understand that the issues with blocking are less to do with large numbers of I/Os and more to do with logical dependencies that can only be resolved by performing I/O. However, it is worth stating that overwhelming a storage device's queue depth can also result in blocking.

The core takeaway should be this:

If *anything* in the I/O path can block inside the Linux kernel on submission, then aio=native can block on submission. For example, suppose metadata is needed to resolve a virtual to physical translation for lvm_thin (i.e., where the metadata identifies where to direct a read or write data request on disk). In that case, the aio=native submission syscall will block for the duration of metadata i/o. With aio=io_uring, the io submission will not block.

With lvm_thick, no persistent metadata is needed for virtual to physical translations. On disk, addresses are "directly mapped." Therefore, blocking is not expected in the general case. Blocking should only be expected in exceptional cases like lock contention or memory allocation in the I/O submission path. For full disclosure, I'm not an expert on the specific implementation of lvm_thick. But, for general use, I expect it not to block.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
@bbgeek17 Thx for the explanation, makes sense :)

I think there are only a few chosen ones who truly understand the topic in depth, but your explanation sums up with my own "explanation".
Though, my idea of IO being a problem was going in the wrong direction.
 
@bbgeek17 Thanks for giving us that report.

However, I have a case here that contradicts your findings and I guess i'm going to have to start experimenting.

I'm using Proxmox 8 (no-subscription repos and up to date as of Oct 19, 2023) and I see the "BUG: soft lockup - CPU#3 stuck for 22s" issue in a VM that is on an NFS storage (10GB connection to a TrueNAS Scale NFS share) and it happens when i delete a snapshot.

Server that i'm running is:
Dell R620, 24 x Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz (2 Sockets) with 256GB RAM.

CPU load is low and RAM usage is only 30GB.

I have a 6 hypervisors using this NFS datastore.
- 3 VM's are on this NFS Datastore, each with 3 DISKS totaling approx 500GB's.
- 2 of those VMs are using QCow and 1 of the VM's is using RAW as the disk format

This QCOW disk are creating the problem:


1697779502967.png


I wanted it on NFS because of QUICK Migration times when I do an update on one of the nodes on the cluster.

Further detail relative to this thread (no punn intended), is:
- Async IO: Default (io_uring)
- Cache: Default (No cache)

You mentioned about this would be the recommended setting for NFS, but it doesn't look like its the case.

Do you believe if I turn off Discard or change the Cache, or turn off SSD emulation would help?

I found in a few other threads:
- https://forum.proxmox.com/threads/vm-cpu-issues-watchdog-bug-soft-lockup-cpu-7-stuck-for-22s.107379/
- https://forum.proxmox.com/threads/c...og-bug-soft-lockup-cpu-0-stock-for-24s.84212/

That 'aio=threads' helped the performance.

I'm going to give this a shot and see if it improves things, and I guess I'll keep iterating different permutations

That being said, I welcome your feedback and recommendations.
 
Hi,
I'm using Proxmox 8 (no-subscription repos and up to date as of Oct 19, 2023) and I see the "BUG: soft lockup - CPU#3 stuck for 22s" issue in a VM that is on an NFS storage (10GB connection to a TrueNAS Scale NFS share) and it happens when i delete a snapshot.
for qcow2, removing snapshots live is unfortunately a synchronous operation during which the vCPUs are paused. And with NFS as the backing storage, that can take a while. If you experience issues with it, it's recommended to remove snapshots while the VM is shut down instead.

That 'aio=threads' helped the performance.
How did you benchmark this? Or do you mean for removing snapshots?
 
Hello Zubin,

The technote describes network-attached block storage, with testing performed on a synthetically busy host (i.e., 32 concurrently operating VMs). The goal was to target more of what we see in the real world rather than lab conditions. We can re-affirm our guidance regarding NFS storage:
"Warning: For thin-LVM, anything stacked on top of software RAID, and file-based solutions (including NFS), aio=io_uring (plus an IOThread) is preferred because aio=native can block in these configurations."

Indeed, aio=threads will block on NFS if the queue depth is high enough and the storage is slow enough. Blocking is more likely with multi-host NFS environments since NFS lacks inbuilt queue depth controls, leading to high response times under load.

Regarding aio=threads: On an unloaded system, you may see improved performance (usually bandwidth) using aio=threads. NFS carries more kernel overhead than block-based storage, resulting in higher inline processing latency. Using aio=threads can provide additional CPU concurrency, which can help hide some of this latency. However, in non-lab conditions, using aio=threads results in increased guest latency, increased host CPU utilization, and decreased responsiveness of neighboring VMs. The resulting lag is primarily the result of scheduling resource contention.

Related to aio=threads, you may find some of our upcoming research on high-bandwidth VMs with NVMe interesting. We can confirm that peak bandwidth is achieved with aio=threads due to its ability to take advantage of native multi-queue devices, such as NVMe. We're seeing >20GB/s in Windows VMs on 200G ethernet networks. However, the additional bandwidth gained from aio=threads comes at a significant CPU cost due to the inefficiencies of the aio=threads models. As the old saying goes: "Just because you can doesn't mean you should."

Regarding QCOW on NFS, Fiona is 100% correct. If you use QCOW, expect your VMs to block during storage management operations that affect the QCOW image. Specific to snapshot removal, expect delays that correlate with the size of the disk and the number of snapshots created. If you run an enterprise environment, avoiding NFS may be best. Consider block-based alternatives, including CEPH, which handles snapshot reclaim asynchronously.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi,

for qcow2, removing snapshots live is unfortunately a synchronous operation during which the vCPUs are paused. And with NFS as the backing storage, that can take a while. If you experience issues with it, it's recommended to remove snapshots while the VM is shut down instead.


How did you benchmark this? Or do you mean for removing snapshots?
Hi @fiona ,

Thanks for the response.
The benchmark I was doing was pretty primitive:
1. Start with a Snapshot
2. Transfer a 4 GB ISO to the VM
3. Rollback to the Snapshot
4. Check if the VM is responsive

With aio=threads, the VM was responding, however, I noticed that @bbgeek17 mentioned that on unloaded systems (like mine) i would see improved performance, so I'm guessing that this is not an ideal scenario.

I'd really like to find the ideal settngs for Proxmox on NFS datastore where I will have a Copy on Write Filesystem capabilities.
 
Hello Zubin,

The technote describes network-attached block storage, with testing performed on a synthetically busy host (i.e., 32 concurrently operating VMs). The goal was to target more of what we see in the real world rather than lab conditions. We can re-affirm our guidance regarding NFS storage:
"Warning: For thin-LVM, anything stacked on top of software RAID, and file-based solutions (including NFS), aio=io_uring (plus an IOThread) is preferred because aio=native can block in these configurations."

Indeed, aio=threads will block on NFS if the queue depth is high enough and the storage is slow enough. Blocking is more likely with multi-host NFS environments since NFS lacks inbuilt queue depth controls, leading to high response times under load.

Regarding aio=threads: On an unloaded system, you may see improved performance (usually bandwidth) using aio=threads. NFS carries more kernel overhead than block-based storage, resulting in higher inline processing latency. Using aio=threads can provide additional CPU concurrency, which can help hide some of this latency. However, in non-lab conditions, using aio=threads results in increased guest latency, increased host CPU utilization, and decreased responsiveness of neighboring VMs. The resulting lag is primarily the result of scheduling resource contention.

Related to aio=threads, you may find some of our upcoming research on high-bandwidth VMs with NVMe interesting. We can confirm that peak bandwidth is achieved with aio=threads due to its ability to take advantage of native multi-queue devices, such as NVMe. We're seeing >20GB/s in Windows VMs on 200G ethernet networks. However, the additional bandwidth gained from aio=threads comes at a significant CPU cost due to the inefficiencies of the aio=threads models. As the old saying goes: "Just because you can doesn't mean you should."

Regarding QCOW on NFS, Fiona is 100% correct. If you use QCOW, expect your VMs to block during storage management operations that affect the QCOW image. Specific to snapshot removal, expect delays that correlate with the size of the disk and the number of snapshots created. If you run an enterprise environment, avoiding NFS may be best. Consider block-based alternatives, including CEPH, which handles snapshot reclaim asynchronously.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Hi @bbgeek17 ,

Thanks for your response.


My Objective is to find the best VM settings for each Storage type and very much open to running tests and configurations that you suggest.
I'll run them and then post them in this thread.

Do you have suggested benchmarks I can run? I'd be happy to contribute my findings as i have a pretty advanced lab I'm running.
Let me describe my setup:

SHARED STORAGE:
TrueNAS Scale

6 x 4TB - ZFS Raid Z2
2 x 256 GB NVME Zil
1 x 256 GB NVME L2ARC
- NFS Share (ZFS Dataset)
- iSCSI (ZFS Dataset)
- 1 x 10GB Nic for all the shares

CLUSTER 1:
R620-001:
- Proxmox 8
- 24 x Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz (2 Sockets)
- 256 GB RAM
- 8 x 1.2TB HDD (entrprise grade) ZFS RaidZ2
- 105 GB Swap - LVM RAID6 (17GB parttition on each disk)
- vm.swappiness = 5
1697941052303.png

R620-002:
- Proxmox 8
- 32 x Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz (2 Sockets)
- 256 GB RAM
- 8 x 1.2TB HDD (entrprise grade) ZFS RaidZ2
- 105 GB Swap - LVM RAID6 (17GB parttition on each disk)
- vm.swappiness = 5
1697941107706.png

R620-003:
- Proxmox 8
- 32 x Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz (2 Sockets)
- 256 GB RAM
- 8 x 1.2TB HDD (entrprise grade) ZFS RaidZ2
- 105 GB Swap - LVM RAID6 (17GB parttition on each disk)
- vm.swappiness = 5
1697941146766.png

Here is the Storage Configurations for that cluster
1697941219822.png
- NFS (10 GBS - TrueNAS Scale)
- GlusterFS (10 GBS - 200 GB Brick on top of the ZFS with a Replica of 3 - meaning the files across all 3 servers are replicated)
- iSCSI LVM (10 GBS - TrueNAS Scale same as above - basic setup. I believe that I can create more optimum configurations - i'll need to learn)
- i.e. multipath setups, etc
- Local ZFS


NETWORK
1697942587059.png



CLUSTER 2:
SuperMicro-001 + SuperMicro-002 + SuperMicro-003 - 3 IDENTICAL SERVERS
- Proxmox 8
- 80 x Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (2 Sockets)
- 512 GB RAM
- 2 x 256 GB SSD (Samsung 870 EVO) - Proxmox OS on Raid
- 6 x 1TB SSD (Samsung 870 EVO) CEPH ( Size / min = 3 / 2 + 128 PG) (super fast)
- 1 x TB NVME (wal/db --> See output of lvs command:
1697942139567.png
- 20 GB Swap - LVM RAID1 (on the 256 GB disk)
- vm.swappiness = 5

NETWORK
1697942618527.png



My Objective is to find the best VM settings for each Storage type and very much open to running tests and configurations that you suggest.
I'll run them and then post them in this thread.

Cheers,

Zubin
 
Hi @Zubin Singh Parihar ,

I will replay only for your first cluster. Banchmarks are good only as a start point. Without know your load, any sintetic banchmark is not so usefull...

Some general ideeas:

- it not a good think to split your hdd, for 2 different storage/usage on each node(zfs and swap)
- some arc statistics, will be usefull for each node(arc_summary)
- because all of your nodes use raidz2, any load will be iops restricted at one iops hdd, so with many vm(i guess you have many vm), will be the main restriction
- you use also glusterfs, who is not so fast on top on zfs(my guess), maybe a 3 way mirror iscsi will be better.
- l2arc, can get at best 10% zfs hits, but with the cost of much RAM, who can used better for arc
- I would try to use 2x nvme as special device for zfs, and 1 x nvme as slog(without details about your setup, I can not be sure)

Good luck / Bafta !
 
Hi @fiona ,

Thanks for the response.
The benchmark I was doing was pretty primitive:
1. Start with a Snapshot
2. Transfer a 4 GB ISO to the VM
3. Rollback to the Snapshot
4. Check if the VM is responsive
Do you mean "remove the snapshot" in step 3? Because after rollback there will be a new QEMU instance with the VM state restored (or no running VM if the RAM was not included in the snapshot). The VM should always be responsive after a rollback (except if it wasn't responsive at the time the snapshot was taken :p), that should not depend on the backing storage configuration.
 
Hi @Zubin Singh Parihar ,

I will replay only for your first cluster. Banchmarks are good only as a start point. Without know your load, any sintetic banchmark is not so usefull...

Some general ideeas:

- it not a good think to split your hdd, for 2 different storage/usage on each node(zfs and swap)
- some arc statistics, will be usefull for each node(arc_summary)
- because all of your nodes use raidz2, any load will be iops restricted at one iops hdd, so with many vm(i guess you have many vm), will be the main restriction
- you use also glusterfs, who is not so fast on top on zfs(my guess), maybe a 3 way mirror iscsi will be better.
- l2arc, can get at best 10% zfs hits, but with the cost of much RAM, who can used better for arc
- I would try to use 2x nvme as special device for zfs, and 1 x nvme as slog(without details about your setup, I can not be sure)

Good luck / Bafta !
@guletz

Thanks for your suggestions.
Question for you: I've never heard of a "3-way mirror iSCSI" How would you suggest I set one up?

Thanks
 
Do you mean "remove the snapshot" in step 3? Because after rollback there will be a new QEMU instance with the VM state restored (or no running VM if the RAM was not included in the snapshot). The VM should always be responsive after a rollback (except if it wasn't responsive at the time the snapshot was taken :p), that should not depend on the backing storage configuration.
Hi @fiona , yes I meant "remove the snapshot"
Good catch!
 
One Vm, and a mdraid using 3 iscsi block devices.
Hi @guletz ,
I just want to unpack your explanation a little bit more if you don't mind...

Do you mean create an iSCSI block device on each of my Proxmox servers, then attach each iSCSI devices to each of the Proxmox servers and then put LVM on top of each attached iSCSI? Then when I create a VM that I use three 3 LVM disks of the same size and create a 3 way MDRAID and install the operating system on top of that?


Or

Do you mean create an iSCSI block device on each of my Proxmox servers, then attach each iSCSI devices to each of the Proxmox servers and then do a three-way, MD raid on the iSCSI devices and then install a VM on the MDRAID?

Please advise.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!