Improve virtio-blk device performance using iothread-vq-mapping

werter

Well-Known Member
Dec 10, 2017
91
9
48
Hi there!
Will this feature be added to Proxmox VE ?
https://blogs.oracle.com/linux/post/virtioblk-using-iothread-vq-mapping

The virtio-blk device has supported multi-queue for quite a while. It is used to improve performance during heavy I/O, by processing the queues in parallel. However, before QEMU 9.0, all the virtqueues were processed by a single IOThread or the main loop. This single thread can be a CPU bottleneck.
Now, in QEMU 9.0, the ‘virtio-blk’ device offers real multiqueue functionality, allowing multiple I/O threads to execute distinct queues on a single disk, thus allowing it to distribute the workload. It is now possible to specify the mapping between multiple IOThreads and virtqueues for a virtio-blk device. This can help improve scalability in particular in situations where the guest provided enough I/O to overload the host CPU while processing the virtio-blk requests with a single I/O thread.
 
Just a +1 here. We are using Xiraid and getting great host performance. However, we're limited to single guest disk performance of about ~30K IOPs for 4k random write. Being able to configure the number of IO threads per virtual hard disk would be amazing!
 
Last edited:
Mmh, I don't think there's bad limit this time as on quiet old hw and with fio 4k randwrite iops could go quiet higher as seen in pic ;)
 

Attachments

  • vm-fio-randw.png
    vm-fio-randw.png
    234.6 KB · Views: 16
Sorry, I'm referring to synchronous 4k random write. The most difficult test.

Code:
job: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.33
Starting 32 processes

job: (groupid=0, jobs=32): err= 0: pid=2396: Sat Aug 10 15:43:11 2024
  write: IOPS=29.7k, BW=116MiB/s (122MB/s)(408GiB/3600001msec); 0 zone resets
    clat (usec): min=24, max=8218, avg=1074.77, stdev=673.68
     lat (usec): min=24, max=8218, avg=1075.05, stdev=673.68
    clat percentiles (usec):
     |  1.00th=[  227],  5.00th=[  355], 10.00th=[  412], 20.00th=[  515],
     | 30.00th=[  619], 40.00th=[  734], 50.00th=[  873], 60.00th=[ 1074],
     | 70.00th=[ 1287], 80.00th=[ 1582], 90.00th=[ 2089], 95.00th=[ 2442],
     | 99.00th=[ 3130], 99.50th=[ 3359], 99.90th=[ 3884], 99.95th=[ 4146],
     | 99.99th=[ 4883]
   bw (  KiB/s): min=81208, max=269752, per=100.00%, avg=118913.33, stdev=494.06, samples=230367
   iops        : min=20302, max=67438, avg=29725.59, stdev=123.53, samples=230367
  lat (usec)   : 50=0.13%, 100=0.42%, 250=0.90%, 500=17.22%, 750=22.81%
  lat (usec)   : 1000=15.07%
  lat (msec)   : 2=32.11%, 4=11.27%, 10=0.07%
  cpu          : usr=0.49%, sys=1.16%, ctx=106989750, majf=0, minf=1172
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,106978320,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=408GiB (438GB), run=3600001-3600001msec

Disk stats (read/write):
  vdb: ios=56/131279695, merge=0/0, ticks=23/137896655, in_queue=137896678, util=100.00%

Code:
[global]
blocksize=${BS}K
time_based=1
ioengine=psync
sync=1
direct=1
runtime=${RUNTIME}
ramp_time=${RAMPTIME}
random_generator=tausworthe64
readwrite=randwrite
filename_format=${OUTPUT}
numjobs=${NUMJOBS}
group_reporting=1
[job]
 
They are all in the output reporting (except it's ramp+runtime), but here they are else wise:

Code:
BS=4 NUMJOBS=32 RUNTIME=300 RAMPTIME=60 OUTPUT=/dev/vdb
 
That vm fio kill's the T430 host but not as bad ...
 

Attachments

  • vm-fio-randw-32j.png
    vm-fio-randw-32j.png
    161.3 KB · Views: 16
Yeah, that's exactly what I'm describing. You seem to be limited to about ~20k IOPs.

Synchronous writes are seriously thread limited. I can get ~30k IOPs on INTEL(R) XEON(R) GOLD 6530.

The changes to QEMU 9 are extremely helpful in this regard.
 
Tomorrow make a new wishlist for a couple of Dell R770 CSP-Edition servers as I have now a reason for that :)
 
Yes, but the hope is that Hard Disk interface is extended to support multiqueue threads to take advantage of it.
 
  • Like
Reactions: werter
Instead of scsi 2nd disk with virtio-blk now ...
 

Attachments

  • vm-fio-randw-32j-virtioblk.png
    vm-fio-randw-32j-virtioblk.png
    186.9 KB · Views: 22
any mail client should do ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!