Improve virtio-blk device performance using iothread-vq-mapping

werter

Renowned Member
Dec 10, 2017
99
11
73
Hi there!
Will this feature be added to Proxmox VE ?
https://blogs.oracle.com/linux/post/virtioblk-using-iothread-vq-mapping

The virtio-blk device has supported multi-queue for quite a while. It is used to improve performance during heavy I/O, by processing the queues in parallel. However, before QEMU 9.0, all the virtqueues were processed by a single IOThread or the main loop. This single thread can be a CPU bottleneck.
Now, in QEMU 9.0, the ‘virtio-blk’ device offers real multiqueue functionality, allowing multiple I/O threads to execute distinct queues on a single disk, thus allowing it to distribute the workload. It is now possible to specify the mapping between multiple IOThreads and virtqueues for a virtio-blk device. This can help improve scalability in particular in situations where the guest provided enough I/O to overload the host CPU while processing the virtio-blk requests with a single I/O thread.
 
  • Like
Reactions: NiceRath
Just a +1 here. We are using Xiraid and getting great host performance. However, we're limited to single guest disk performance of about ~30K IOPs for 4k random write. Being able to configure the number of IO threads per virtual hard disk would be amazing!
 
Last edited:
Mmh, I don't think there's bad limit this time as on quiet old hw and with fio 4k randwrite iops could go quiet higher as seen in pic ;)
 

Attachments

  • vm-fio-randw.png
    vm-fio-randw.png
    234.6 KB · Views: 43
Sorry, I'm referring to synchronous 4k random write. The most difficult test.

Code:
job: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.33
Starting 32 processes

job: (groupid=0, jobs=32): err= 0: pid=2396: Sat Aug 10 15:43:11 2024
  write: IOPS=29.7k, BW=116MiB/s (122MB/s)(408GiB/3600001msec); 0 zone resets
    clat (usec): min=24, max=8218, avg=1074.77, stdev=673.68
     lat (usec): min=24, max=8218, avg=1075.05, stdev=673.68
    clat percentiles (usec):
     |  1.00th=[  227],  5.00th=[  355], 10.00th=[  412], 20.00th=[  515],
     | 30.00th=[  619], 40.00th=[  734], 50.00th=[  873], 60.00th=[ 1074],
     | 70.00th=[ 1287], 80.00th=[ 1582], 90.00th=[ 2089], 95.00th=[ 2442],
     | 99.00th=[ 3130], 99.50th=[ 3359], 99.90th=[ 3884], 99.95th=[ 4146],
     | 99.99th=[ 4883]
   bw (  KiB/s): min=81208, max=269752, per=100.00%, avg=118913.33, stdev=494.06, samples=230367
   iops        : min=20302, max=67438, avg=29725.59, stdev=123.53, samples=230367
  lat (usec)   : 50=0.13%, 100=0.42%, 250=0.90%, 500=17.22%, 750=22.81%
  lat (usec)   : 1000=15.07%
  lat (msec)   : 2=32.11%, 4=11.27%, 10=0.07%
  cpu          : usr=0.49%, sys=1.16%, ctx=106989750, majf=0, minf=1172
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,106978320,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=408GiB (438GB), run=3600001-3600001msec

Disk stats (read/write):
  vdb: ios=56/131279695, merge=0/0, ticks=23/137896655, in_queue=137896678, util=100.00%

Code:
[global]
blocksize=${BS}K
time_based=1
ioengine=psync
sync=1
direct=1
runtime=${RUNTIME}
ramp_time=${RAMPTIME}
random_generator=tausworthe64
readwrite=randwrite
filename_format=${OUTPUT}
numjobs=${NUMJOBS}
group_reporting=1
[job]
 
They are all in the output reporting (except it's ramp+runtime), but here they are else wise:

Code:
BS=4 NUMJOBS=32 RUNTIME=300 RAMPTIME=60 OUTPUT=/dev/vdb
 
That vm fio kill's the T430 host but not as bad ...
 

Attachments

  • vm-fio-randw-32j.png
    vm-fio-randw-32j.png
    161.3 KB · Views: 24
Yeah, that's exactly what I'm describing. You seem to be limited to about ~20k IOPs.

Synchronous writes are seriously thread limited. I can get ~30k IOPs on INTEL(R) XEON(R) GOLD 6530.

The changes to QEMU 9 are extremely helpful in this regard.
 
Tomorrow make a new wishlist for a couple of Dell R770 CSP-Edition servers as I have now a reason for that :)
 
  • Like
Reactions: waltar
Yes, but the hope is that Hard Disk interface is extended to support multiqueue threads to take advantage of it.
 
  • Like
Reactions: werter
Instead of scsi 2nd disk with virtio-blk now ...
 

Attachments

  • vm-fio-randw-32j-virtioblk.png
    vm-fio-randw-32j-virtioblk.png
    186.9 KB · Views: 37
any mail client should do ;)