[TUTORIAL] Proxmox VE 7.2 Benchmark: aio native, io_uring, and iothreads

good morning everything is fine?

I have a question please, I have disks in passthrough, 2 NVME, 5 HDD, 2 SSD, all on passthrough.

The question is, can these disks be considered for use as async IO as native?

And I have 1 more partition as local (LVM - Proxmox installation) which is on NVME, can I also set Native?

Thanks
 

Attachments

  • Screenshot 2024-05-03 at 09.13.12.png
    Screenshot 2024-05-03 at 09.13.12.png
    210.6 KB · Views: 33
  • Screenshot 2024-05-03 at 09.12.55.png
    Screenshot 2024-05-03 at 09.12.55.png
    266.9 KB · Views: 34
Hey everyone, a common question in the forum and to us is which settings are best for storage performance. We took a comprehensive look at performance on PVE 7.2 (kernel=5.15.53-1-pve) with aio=native, aio=io_uring, and iothreads over several weeks of benchmarking on an AMD EPYC system with 100G networking running in a datacenter environment with moderate to heavy load.

Here's an overview of the findings:
  • iothreads significantly improve performance for most workloads.
  • aio=native and aio=io_uring offer similar performance.
  • aio=native has a slight latency advantage for QD1 workloads.
  • aio=io_uring performance degrades in extreme load conditions.

Here's a link to full analysis with lots of graphs and data |https://kb.blockbridge.com/technote/proxmox-aio-vs-iouring/

tldr: The test data shows a clear and significant performance improvement that supports the use of IOThreads. Performance differences between aio=native and aio=io_uring were less significant. Except for unusual behavior reported in our results for QD=2, aio=native offers slightly better performance (when deployed with an IOThread) and gets our vote for the top pick.

attention: Our recommendation for aio=native applies to unbuffered, O_DIRECT, raw block storage only; the disk cache policy must be set to none. Raw block storage types include iSCSI, NVMe, and CEPH/RBD. For thin-LVM, anything stacked on top of software RAID, and file-based solutions (including NFS and ZFS), aio=io_uring (plus an IOThread) is preferred because aio=native can block in these configurations.

If you find this helpful, please let me know. I’ve got a bit more that I can share in the performance and tuning space. Questions, comments, and corrections are welcome.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
good morning everything is fine?

I have a question please, I have disks in passthrough, 2 NVME, 5 HDD, 2 SSD, all on passthrough.

The question is, can these disks be considered for use as async IO as native?

And I have 1 more partition as local (LVM - Proxmox installation) which is on NVME, can I also set Native?

Thanks
 
Interesting addition here? OP recommends raw combined with iothread, however, I get this on raw storage.

Code:
Block format 'raw' does not support the option 'iothread'
 
>>>
attention: Our recommendation for aio=native applies to unbuffered, O_DIRECT, raw block storage only; the disk cache policy must be set to none. Raw block storage types include iSCSI, NVMe, and CEPH/RBD. For thin-LVM, anything stacked on top of software RAID,
and file-based
solutions (including NFS and ZFS), aio=io_uring (plus an IOThread) is preferred because aio=native can block in these configurations.
 
Last edited:
  • Like
Reactions: chrcoluk
For the benefit of others, if you add iothread normally to a raw device in Proxmox, it wont fail to start, I got the error because I did it manually as -args in the config file, if doing it via Proxmox, it will silently not add the iothread flag on raw devices to prevent the error.

Thanks @_gabriel, I think with the combination of his graphs and that description it wasnt entirely clear to me he meant only use iothread on file based storage.

I have also kept aio=threads on some i/o heavy hdd storage as that benefits from it I guess due to higher queue depth, and I noticed it performs considerably worse with both native and io_uring, it has plenty of spare cpu cycles for it, but will use these recommendations on flash storage and devices with only low i/o.
 
Last edited:
My only feedback of significance here is I now routinely switch away from the default io_uring, was fouind to be the cause of data corruption in my windows guests.

by iothreads I assume you mean aio=threads?
Hi,

What sort of data corruption issue did you face? And what setting did you change in the Proxmox to come out of the corruption situation?

I am facing Database corruption issue on Windows server 2022 guest VM which is hosting MS SQL server 2017. You may refer to the post -

 
@bbgeek17 I m going to add multithreading / multiple iothreads by disk for pve9. So if you have time to bench with a big nvme array it could be great :). Currently il planning to have a number of iothreads shared across multiple disk like redhat is doing, and I dont know if optionnal iothreads pinning by disk could help for some workload
 
Hi @aderumier great timing!

There have actually been some recent internal conversations about revisiting this topic, especially in light of what we've seen from high-performance deployments.

The impact of adding more I/O threads can be a bit nuanced. On the one hand, you should see improved IOPS (assuming the storage system can keep up); on the other hand, you might experience slightly higher latency (in non-synthetic settings) if the threads aren't pinned and are free to float.
If you're already exploring this area, it's definitely worth considering support for pinning to help users get the best possible performance. In our experience, people who are looking for extreme IOPS in the VM are likely to consider hardware and process layout to achieve their desired performance. Pinning is an essential tool.

The interaction between iothreads, io_uring, and native AIO is surprisingly complex, as the outcomes can depend on several factors, including storage, protocol, and hardware layout. That said, we're happy to help in any way we can! Are there any specific comparisons or technical details you're interested in digging into?

Also, do you know the timeline for PVE9? At least when it will hit the Test repo?


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: Johannes S
Hi @aderumier great timing!

There have actually been some recent internal conversations about revisiting this topic, especially in light of what we've seen from high-performance deployments.

The impact of adding more I/O threads can be a bit nuanced. On the one hand, you should see improved IOPS (assuming the storage system can keep up); on the other hand, you might experience slightly higher latency (in non-synthetic settings) if the threads aren't pinned and are free to float.
If you're already exploring this area, it's definitely worth considering support for pinning to help users get the best possible performance. In our experience, people who are looking for extreme IOPS in the VM are likely to consider hardware and process layout to achieve their desired performance. Pinning is an essential tool.
Yes, I known this can increase latency. I need to look to add an option to pin iothread on specific cpu, like for the vm cpu cores. (ideally iothread have dedicated cores and vm cpus other cores, on same numa node than nvme drive)

https://vmsplice.net/~stefan/stefanha-kvm-forum-2024.pdf (slides 13)

but even without pining, the performance with 2~4 iothred is already great (200~300kiops, don't have benched latency)
https://developers.redhat.com/artic...isk-io-iothread-virtqueue-mapping#performance
The interaction between iothreads, io_uring, and native AIO is surprisingly complex, as the outcomes can depend on several factors, including storage, protocol, and hardware layout. That said, we're happy to help in any way we can! Are there any specific comparisons or technical details you're interested in digging into?

Also, do you know the timeline for PVE9? At least when it will hit the Test repo?

I really don't known, maybe this summer or september, but the big patches for qemu blockdev are already in git, multithreading is pretty simple to add .(I have already the patches, just wait the blockdev support 100% complete)

(I'll try to finish qcow2 snapshot for lvm shared san storage too)
 
From what we're seeing, it's definitely possible to hit around 275,000 to 300,000 IOPS today using just a single I/O thread to a single device. That said, we do see workloads with demands that go beyond what a single thread can handle... whether it's more IOPS or more bandwidth. In those cases, we currently use alternative approaches, like network direct, to meet the need.

Ideally, we'd love to see performance scale in a mostly linear way, especially on a single device. Today's systems can easily reach a million IOPS and more than 10 GB/s per device on bare metal, and achieving that kind of performance inside a VM out of the box would be great.

That said, while higher IOPS are great for benchmarks, latency is king. Any additional controls you can implement in this area will help improve latency consistency... even if it's only for a single I/O thread :)

If there's anything we can do, please don't hesitate to ping me.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: Johannes S
From what we're seeing, it's definitely possible to hit around 275,000 to 300,000 IOPS today using just a single I/O thread to a single device. That said, we do see workloads with demands that go beyond what a single thread can handle...
I don't known what kind of cpu redhat is using, but it's really scaling across multiple threads (around x2~x4 from baselineà. Personnally, I really want to test it on ceph rbd, because it's currently really cpu limited client side around 70k iops.

That said, while higher IOPS are great for benchmarks, latency is king. Any additional controls you can implement in this area will help improve latency consistency... even if it's only for a single I/O thread :)
I think that they are still other improvement possible to latency, specially for local nvme drive, with kernel bypass (vhost-user-blk-pci, vdpa,....). But I never had time to work on it.