VM storage latancy

bung69

Member
Nov 15, 2021
22
5
8
38
What is the best expected latancy of storage within a VM?

from reading https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/ it looks like their un-opimised guest is getting just over 50us, yet no matter what i try i am getting around 200-300us with raw image, virtio scsi single and an iothread or slightly better with virtio-blk.

i have tried using both nvme and optaine pmem for storage on the host, the later gets just over 10us when tested on the host with ioping.

host is a DL380 gen10 with 2x 6230's

Am i missing something or is low latancy VM storage just not posible on my hardware?

Thanks
 
Last edited:
note to future self and anyone thats interested, 1 million iops in a VM at higher Qdepths and threads is posible using iothread-vq-mapping and multiple io threads

guide here - https://blogs.oracle.com/linux/post/virtioblk-using-iothread-vq-mapping


Code:
args: -object iothread,id=iothread0 -object iothread,id=iothread1 -object iothread,id=iothread2 -object iothread,id=iothread3 -object iothread,id=iothread4 -object iothread,id=iothread5 -object iothread,id=iothread6 -object iothread,id=iothread7 -object iothread,id=iothread8 -object iothread,id=iothread9 -object iothread,id=iothread10 -object iothread,id=iothread11 -object iothread,id=iothread12 -object iothread,id=iothread13 -object iothread,id=iothread14 -object iothread,id=iothread15

-drive file=/mnt/pmem0fs/images/103/vm-103-disk-1.raw,if=none,id=drive-virtio1,aio=io_uring,format=raw,cache=none

--device '{"driver":"virtio-blk-pci","iothread-vq-mapping":[{"iothread":"iothread0"},{"iothread":"iothread1"},{"iothread":"iothread2"},{"iothread":"iothread3"},{"iothread":"iothread4"},{"iothread":"iothread5"},{"iothread":"iothread6"},{"iothread":"iothread7"},{"iothread":"iothread8"},{"iothread":"iothread9"},{"iothread":"iothread10"},{"iothread":"iothread11"},{"iothread":"iothread12"},{"iothread":"iothread13"},{"iothread":"iothread14"},{"iothread":"iothread15"}],"drive":"drive-virtio1","queue-size":1024,"config-wce":false}'

Code:
~150K iops, 600MB/s with 1 io thread, fedora VM
~1000k iops, 4200MB/s with 16 io threads and iothread-vq-mapping,  fedora VM
~3300k iops 12,800MB/s  on proxmox host

 fio bs=4k, iodepth=128, numjobs=30, 40vcpu's
dl380 gen 10 with 2x 6230 and 2x 256GB optane dimms
 
Last edited:
Hi @bung69

Please note that the system referenced in the low-latency article outperforms yours in several key areas, including CPU frequency, memory bandwidth, and memory latency.

Questions:

1. What does the virtual machine CPU/memory configuration look like?

2. Is the system idle?

3. Have you done any tuning to the base hardware (i.e., Performance vs. Power)

Cheers,


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi @bbgeek17

1. For most of my testing, my vm had 10 vcpu, type host and i tested both with and without affinity and numa.
when testing iothread-vq-mapping at high Q depths and jobs i assigned 40 cores with affinity and 16 iothreads.

2. system is otherwise idle

3. all obvious bios settings are set for performance, disabled C states, static high performance. however i just noticed Sub-NUMA Clustering is disabled.

Do you know if the latancy numbers from the BB low-latency article are from ioping or FIO, as i have also noticed a large discrepancy between both programs mesured latencys on both my host and VM's so i assume they must be mesuring diferent things.

thanks
 
Hi @bung69,
We exclusively use fio for benchmarking, even on Windows. We have many years of experience with it, so we're confident in its accuracy.

Here’s a step-by-step guide to help with your setup:
  1. Ensure your CPUs are not using idle states:
    • Install the linux-cpupower package.
    • Run cpupower idle-info to check for available idle states.
    • If idle states are present, limit them to C1 using the tool.
  2. Reduce the number of VCPUs to 2 in your test instance.
  3. Verify that you’re using optimal I/O settings:
    • Ensure you're using aio=native, virtio-scsi-single, and an io_thread.
  4. Identify the NUMA node for your NVMe device:
    • Run: cat /sys/block/nvmeXn1/device/numa_node
  5. Assign CPU affinity via the GUI to pin the VM to the first 4 cores of the NUMA node where the NVMe device is attached.
    • Doing this via the GUI implicitly sets affinity on the IO thread.
  6. Use fio for testing inside the VM:
Give this a try and let us know how it turns out.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Assign CPU affinity via the GUI to pin the VM to the first 4 cores of the NUMA node where the NVMe device is attached.
What is the purpose of this benchmark? unless all you want to do is show hero numbers, this doesnt simulate real world use- especially in circumstances where the storage is abstracted from the guest anyway (eg, raid.)
 
What is the purpose of this benchmark? unless all you want to do is show hero numbers, this doesnt simulate real world use- especially in circumstances where the storage is abstracted from the guest anyway (eg, raid.)
Hi @alexskysilk , Our approach is to limit variables first to establish a baseline and get to the root cause faster.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
That makes sense for validations of disks, not so much for OPs original request- although taken literally, doing the benchmark as instructed would, in fact, provide best latency. @bung69 I suppose the real answer to your original question has everything to do with the system totality- you will get best results with a single device mapped to a single vm that has cpu pinned to the same package the nvme is connected to, but is that the way you intend to provide storage?
 
although taken literally, doing the benchmark as instructed would, in fact, provide best latency.
Right, OP asked about the best possible latency without reference to any particular workload. However, OP seems to be going to great lengths to understand his system and the overhead of virtualization. Not everyone is using persistent memory ;)

Once he has a "best case baseline," it's easier to understand the effect of adding other variables, such as load, multiple VMs, disks, etc. Simultaneously, the experiment could point towards an entirely different issue if the numbers don't prove out.

It's also worth pointing out to future readers that the approach we recommend here is for physical raw devices. The techniques would be more complicated for ZFS, RAID, CEPH, etc, since there aren't great ways to isolate the performance impacts of the various storage layers.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Yes, @bbgeek17 you have it right, i am trying to understand what is the limiting factor for low Q depth small block storage performance and why my benchmarks appeard much worse than your "proxmox-tuning-low-latency-storage/" results even though i am using lightly much faster localy attached storage. Pritty much just an "idiot check" to ensure im not doing anything wrong.

End goal is only for runing a couple workstation VMs with decent NVME like or better storage performance, and i am trying to work with what i have and/or what is cheep. All other VM's and containers are lightweight and dont require anything special as far as storage performance goes.

It turns out the latencys reported by ioping are much higher than from fio, almost 10x. im sure there is a good reason for this but it tripped me up and a quick search doesn't tell me what ioping is mesuring differently to fio.

I initially started down this rabbit hole due to 4k rnd R/W performance being terible within a Win11 VM, at least much worse than a linux VM. It Turns out win11 uses Bitlocker by default, Turning it off doubled the 4k performance. Its still less than half that of a Linux VM though, but it matches Win10. I assume that virtio windows drivers are just not as optimised as in linux or do you know of any tweaks for windows VM's?

virtIO Bock with an IO thread consistantly performed slightly better than VirtIO SCSI single with an IO Thread regardless of other changes in both windows and linux VM's

Disabling C6 with "cpupower idle-set -d 3" seemed to also limit the maximum boost freqency of the pinned VM's cores & io cores to ~2800mhz re-enabling C6 allowed the VMs cores to boost to 3800mhz slightly improving performance, however this will lightly be irrelevant in real world conditions when the server is not idle.

Code:
Host 4k                                                                             2.8us 334k iops
ubuntu VM virtio scsi single                                                        41us 24k iops
ubuntu VM virtio scsi single + io thread + affinity                                 27us 35k iops
ubuntu VM virtio block                                                              31us 31k iops
ubuntu VM virtio block + io thread + affinity                                       16us 58k iops
ubuntu VM virtio block + io thread + affinity + haltpoll                            12us 78k iops
windows10 VM virtio block                                                           37us 26k iops
windows10 VM virtio scsi single                                                     42us 23k iops
windows10 VM virtio block disabled hyperthreading                                   26us 38k iops
windows11 default VM virtio block disabled hyperthreading                           51us 19k iops
ubuntu VM virtio block + io thread + affinity + haltpoll + disabled hyperthreading  9.3us 103k iops

device: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=1309MiB/s][r=335k IOPS][eta 00m:00s]
device: (groupid=0, jobs=1): err= 0: pid=325768: Sat Jan 4 23:06:42 2025
read: IOPS=334k, BW=1304MiB/s (1367MB/s)(76.4GiB/60001msec)
slat (nsec): min=1815, max=75020, avg=2286.40, stdev=278.31
clat (nsec): min=435, max=75061, avg=471.22, stdev=109.86
lat (nsec): min=2294, max=77305, avg=2757.62, stdev=301.33

cpu : usr=20.17%, sys=79.82%, ctx=284, majf=1, minf=37

Run status group 0 (all jobs):
READ: bw=1304MiB/s (1367MB/s), 1304MiB/s-1304MiB/s (1367MB/s-1367MB/s), io=76.4GiB (82.1GB), run=60001-60001msec

Disk stats (read/write):
pmem0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

device: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.36
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=85.8MiB/s][r=22.0k IOPS][eta 00m:00s]
device: (groupid=0, jobs=1): err= 0: pid=2592: Sat Jan 4 23:55:06 2025
read: IOPS=24.1k, BW=94.3MiB/s (98.8MB/s)(5655MiB/60001msec)
slat (usec): min=4, max=1391, avg= 6.59, stdev= 2.17
clat (nsec): min=1009, max=3482.9k, avg=34186.92, stdev=20457.36
lat (usec): min=22, max=3488, avg=40.78, stdev=20.67

cpu : usr=6.34%, sys=29.91%, ctx=1447782, majf=0, minf=36

Run status group 0 (all jobs):
READ: bw=94.3MiB/s (98.8MB/s), 94.3MiB/s-94.3MiB/s (98.8MB/s-98.8MB/s), io=5655MiB (5930MB), run=60001-60001msec

Disk stats (read/write):
sdb: ios=2178331/0, sectors=17426648/0, merge=0/0, ticks=71586/0, in_queue=71586, util=65.64%

device: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.36
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=139MiB/s][r=35.6k IOPS][eta 00m:00s]
device: (groupid=0, jobs=1): err= 0: pid=3142: Sat Jan 4 22:49:58 2025
read: IOPS=35.5k, BW=139MiB/s (145MB/s)(8324MiB/60001msec)
slat (nsec): min=4562, max=110646, avg=5841.38, stdev=802.26
clat (nsec): min=889, max=2492.6k, avg=21643.42, stdev=11927.76
lat (usec): min=17, max=2498, avg=27.48, stdev=11.93

cpu : usr=10.11%, sys=39.67%, ctx=2129541, majf=0, minf=36

Run status group 0 (all jobs):
READ: bw=139MiB/s (145MB/s), 139MiB/s-139MiB/s (145MB/s-145MB/s), io=8324MiB (8729MB), run=60001-60001msec

Disk stats (read/write):
sdb: ios=3197327/0, sectors=25578616/0, merge=0/0, ticks=63896/0, in_queue=63896, util=48.72%

device: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.36
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=120MiB/s][r=30.8k IOPS][eta 00m:00s]
device: (groupid=0, jobs=1): err= 0: pid=2568: Sat Jan 4 23:49:15 2025
read: IOPS=31.3k, BW=122MiB/s (128MB/s)(7339MiB/60001msec)
slat (usec): min=3, max=5402, avg= 4.67, stdev= 7.35
clat (nsec): min=824, max=5745.7k, avg=26612.64, stdev=32743.38
lat (usec): min=16, max=5750, avg=31.28, stdev=33.57

cpu : usr=8.58%, sys=31.94%, ctx=1878910, majf=0, minf=36

Run status group 0 (all jobs):
READ: bw=122MiB/s (128MB/s), 122MiB/s-122MiB/s (128MB/s-128MB/s), io=7339MiB (7696MB), run=60001-60001msec

Disk stats (read/write):
vda: ios=2814964/0, sectors=22519712/0, merge=0/0, ticks=65280/0, in_queue=65280, util=49.91%

device: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.36
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=232MiB/s][r=59.4k IOPS][eta 00m:00s]
device: (groupid=0, jobs=1): err= 0: pid=2627: Sat Jan 4 23:00:59 2025
read: IOPS=58.7k, BW=229MiB/s (241MB/s)(13.4GiB/60001msec)
slat (nsec): min=1744, max=3400.8k, avg=2519.23, stdev=6119.87
clat (nsec): min=734, max=5536.2k, avg=13970.67, stdev=14927.97
lat (usec): min=10, max=5539, avg=16.49, stdev=16.13

cpu : usr=15.66%, sys=43.16%, ctx=3520201, majf=0, minf=36

Run status group 0 (all jobs):
READ: bw=229MiB/s (241MB/s), 229MiB/s-229MiB/s (241MB/s-241MB/s), io=13.4GiB (14.4GB), run=60001-60001msec

Disk stats (read/write):
vda: ios=5255629/0, sectors=42045032/0, merge=0/0, ticks=50205/0, in_queue=50205, util=40.03%

device: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.36
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=304MiB/s][r=77.9k IOPS][eta 00m:00s]
device: (groupid=0, jobs=1): err= 0: pid=2973: Sun Jan 5 00:54:41 2025
read: IOPS=78.0k, BW=305MiB/s (319MB/s)(17.8GiB/60001msec)
slat (nsec): min=1765, max=392630, avg=2193.85, stdev=440.78
clat (nsec): min=693, max=2471.4k, avg=10153.30, stdev=8049.12
lat (usec): min=9, max=2473, avg=12.35, stdev= 8.06

cpu : usr=15.75%, sys=41.36%, ctx=4673156, majf=0, minf=36

Run status group 0 (all jobs):
READ: bw=305MiB/s (319MB/s), 305MiB/s-305MiB/s (319MB/s-319MB/s), io=17.8GiB (19.2GB), run=60001-60001msec

Disk stats (read/write):
vda: ios=6999846/0, sectors=55998776/0, merge=0/0, ticks=55334/0, in_queue=55334, util=46.54%

qd1-4k: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=windowsaio, iodepth=1
fio-3.38
Starting 1 thread
qd1-4k: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [R(1)][100.0%][r=109MiB/s][r=27.8k IOPS][eta 00m:00s]
qd1-4k: (groupid=0, jobs=1): err= 0: pid=4132: Sun Jan 5 20:50:29 2025
read: IOPS=25.7k, BW=100MiB/s (105MB/s)(6017MiB/60001msec)
slat (usec): min=7, max=40042, avg=13.92, stdev=63.88
clat (nsec): min=200, max=55435k, avg=23837.39, stdev=148040.94
lat (usec): min=21, max=55446, avg=37.76, stdev=163.12

cpu : usr=8.33%, sys=31.67%, ctx=0, majf=0, minf=0

Run status group 0 (all jobs):
READ: bw=100MiB/s (105MB/s), 100MiB/s-100MiB/s (105MB/s-105MB/s), io=6017MiB (6309MB), run=60001-60001msec

qd1-4k: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=windowsaio, iodepth=1
fio-3.38
Starting 1 thread
Jobs: 1 (f=0): [f(1)][100.0%][r=86.9MiB/s][r=22.3k IOPS][eta 00m:00s]
qd1-4k: (groupid=0, jobs=1): err= 0: pid=6452: Sun Jan 5 20:55:14 2025
read: IOPS=23.2k, BW=90.7MiB/s (95.1MB/s)(5440MiB/60001msec)
slat (usec): min=10, max=1007, avg=15.17, stdev= 4.51
clat (nsec): min=200, max=29099k, avg=26836.70, stdev=27621.72
lat (usec): min=28, max=29112, avg=42.01, stdev=27.80

cpu : usr=5.00%, sys=46.67%, ctx=0, majf=0, minf=0

Run status group 0 (all jobs):
READ: bw=90.7MiB/s (95.1MB/s), 90.7MiB/s-90.7MiB/s (95.1MB/s-95.1MB/s), io=5440MiB (5704MB), run=60001-60001msec

qd1-4k: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=windowsaio, iodepth=1
fio-3.38
Starting 1 thread
Jobs: 1 (f=1): [R(1)][100.0%][r=147MiB/s][r=37.8k IOPS][eta 00m:00s]
qd1-4k: (groupid=0, jobs=1): err= 0: pid=992: Mon Jan 6 00:46:31 2025
read: IOPS=37.8k, BW=148MiB/s (155MB/s)(8862MiB/60001msec)
slat (usec): min=5, max=504, avg= 8.87, stdev= 4.19
clat (nsec): min=200, max=11098k, avg=16804.85, stdev=10178.54
lat (usec): min=15, max=11109, avg=25.68, stdev=10.95

cpu : usr=8.33%, sys=36.67%, ctx=0, majf=0, minf=0

Run status group 0 (all jobs):
READ: bw=148MiB/s (155MB/s), 148MiB/s-148MiB/s (155MB/s-155MB/s), io=8862MiB (9293MB), run=60001-60001msec

qd1-latency: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=windowsaio, iodepth=1
fio-3.38
Starting 1 thread
Jobs: 1 (f=0): [f(1)][100.0%][r=60.2MiB/s][r=15.4k IOPS][eta 00m:00s]
qd1-latency: (groupid=0, jobs=1): err= 0: pid=7212: Mon Jan 6 01:05:17 2025
read: IOPS=19.1k, BW=74.6MiB/s (78.3MB/s)(4478MiB/60001msec)
slat (usec): min=7, max=30947, avg=13.52, stdev=137.19
clat (usec): min=3, max=56323, avg=37.96, stdev=248.02
lat (usec): min=23, max=56332, avg=51.49, stdev=284.07

cpu : usr=3.33%, sys=23.33%, ctx=0, majf=0, minf=0

Run status group 0 (all jobs):
READ: bw=74.6MiB/s (78.3MB/s), 74.6MiB/s-74.6MiB/s (78.3MB/s-78.3MB/s), io=4478MiB (4696MB), run=60001-60001msec

device: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.36
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=416MiB/s][r=106k IOPS][eta 00m:00s]
device: (groupid=0, jobs=1): err= 0: pid=4710: Sun Jan 12 13:37:39 2025
read: IOPS=103k, BW=404MiB/s (423MB/s)(23.6GiB/60001msec)
slat (nsec): min=1325, max=2960.4k, avg=1685.67, stdev=4025.75
clat (nsec): min=524, max=3278.0k, avg=7631.91, stdev=14462.84
lat (usec): min=7, max=3280, avg= 9.32, stdev=15.02

cpu : usr=15.70%, sys=41.77%, ctx=6194093, majf=0, minf=15

Run status group 0 (all jobs):
READ: bw=404MiB/s (423MB/s), 404MiB/s-404MiB/s (423MB/s-423MB/s), io=23.6GiB (25.4GB), run=60001-60001msec

Disk stats (read/write):
vda: ios=6188219/0, sectors=49505752/0, merge=0/0, ticks=36637/0, in_queue=36637, util=45.17%
 
  • Like
Reactions: ucholak and waltar
Hi @bung69,

I'm glad to see that you're achieving reasonably comparable results with resource isolation. Based on your data, the challenge will always lie in managing idle states and NUMA. However, note that you can tune idle states on specific CPUs, and when combined with manual CPU affinity settings applied to the guest, this flexibility can help you achieve the desired outcome for VMs that need low latency.

For what it's worth, a handful of our customers use dynamic tuning scripts to optimize performance, so exploring this approach certainly isn't misguided. However, it certainly adds complexity to dealing with HA.

Something else worth noting: For PCIe NVMe devices, completion events are delivered on the same queue pair that handles the request. Typically, this is the same CPU (depending on your core count and NVMe queue pair capabilities). You may find that pinning the iothread to a single CPU gives you more consistent latency (but possibly less IOPS potential). That said, I'm not familiar enough with the pmem driver architecture to advise you on how that compares to NVMe. If it uses "bottoms halves" for async io completion, you might be out of luck in terms of consistent sub-10 microsecond latency.

Regarding ioping, I took a quick look through the source code out of curiosity. Essentially, ioping isn't designed for repeatable results on fast storage. While it does support libaio and IO_DIRECT, it lacks the ability to control CPU affinity, which also makes achieving consistent results challenging. But, it can be a handy tool for a sanity check.

You are correct that virtio-blk can exhibit lower latency. In our internal data, I can see a 1-2 microsecond advantage for virtio-blk on Linux. However, I seem to recall past issues with virtio-blk, and the overhead of virtio-blk in Windows tends to be higher than that on Linux. You may find these efficiency scores (measured on Windows) useful (but not necessarily applicable to Linux):

https://kb.blockbridge.com/technote...easured-values-rankings-and-efficiency-scores

Thanks for sharing your data! Good luck, and happy tweaking!


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox