Iops issues on VMs with local storage with LVM/ZFS

dominiaz

Renowned Member
Sep 16, 2016
34
1
73
37
Hey,

Testing procedure:

fio --ioengine=libaio --direct=1 --sync=1 --rw=read --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name seq_read --filename=/dev/nvme0n1

Host1 on NVME:
read: IOPS=40.4k, BW=158MiB/s (165MB/s)(9461MiB/60001msec)

VM on local NVME with LVM:
scsi0: local-nvme:vm-1111-disk-0,discard=on,iothread=1,size=128G,ssd=1
read: IOPS=5972, BW=23.3MiB/s (24.5MB/s)(1400MiB/60001msec)

--

VM on Host2 with DELL SAN connected to Host over SCSI (raided 6x 15K HDD):

read: IOPS=30.0k, BW=117MiB/s (123MB/s)(7032MiB/60001msec)

Why local storage on Host1 with NVME is sooooo terribly slow - only 5972 iops? I have tried this testing procedure on many hosts and the results are pretty the same.

Do we have to use SAN storage always when we want to get good disk speeds on VM?
 
Please provide the vendor and model of the NVMe.

Also please run benchmarks for at least 10 seconds so that any effect caches might have on the result is minimized.
 
All my benchmarks run for 60s.

NVME is Samsung SM961.

I made the same tests with HP Enterprise SSD (Model MO001920JWFWU) with RAID1:

Host:
read: IOPS=11.7k, BW=45.6MiB/s (47.8MB/s)(2735MiB/60001msec)

VM:
read: IOPS=5850, BW=22.9MiB/s (23.0MB/s)(1371MiB/60001msec)
 
Write cache results in 60s tests are pretty the same.

Proxmox do you have any explanations for my storage speed issues?
 
Sorry, I meant 10 minutes -> 600 seconds.
 
Hi @dominiaz ,

Let's take your information and reverse engineer the problem.

Bare-Metal NVMe:

According to your benchmark, you achieved 40,400 IOPS on the local NVMe device with a QD1 read workload. This means your measured access latency was approximately 25 microseconds. Given that NAND access latencies are ~100us, how is this possible? The answer is either the device is initialized to zero (i.e., you are reading thin zeroes) or internally prefetching (which is expected behavior). If the device is prefetching, you are effectively reading the DRAM cache on the NVMe controller. As such, 25 microseconds represents the overhead of the benchmark, NVMe driver, PCI, and NVMe controller path.

VM on LVM/ZFS/NVME

According to your benchmark, you achieved 5972 IOPS which translated to 167 microseconds of latency for QD1. We know the underlying access latency is minimally 25 microseconds. Depending on ZFS block allocation, you might experience slightly higher access latencies if the workload is no longer sequential (because randomization defeats prefetch). Pushing the complexity of block allocation aside, let's assume that you see an incremental latency gain of roughly 142 microseconds.

In our experience using the standard virtio-scsi-single controller with an IO thread and aio=native, the latency insertion of QEMU/Virtio on the fastest machines available is 17 microseconds for NVME/TCP connected storage. CPU type, memory architecture, and system activity are contributing factors that can increase overhead. If you have an active server with lots of VMs running, insertion latency is heavily affected by context switch latencies and VMENTER/VMEXIT. It's safe to assume that the remaining latency (i.e., ~125 microseconds) must be from ZFS, LVM, or ambient load on your system.

Questions:
  • Is this a dual-processor system?
  • Is the system completely idle when running your test?
  • Are you using dedupe or compression?
  • What virtual controller are you using to present storage?
Recommendation:
  • install the systat tools (apt-get sysstat) and measure the latency of ZFS on the host using iostat
VM on ??? on SAN

According to your benchmark, you see an access latency of 33 microseconds INSIDE the VM. Our experience suggests that you are hitting a host-side cache, populated by prefetch or simply resident in the page cache. Performance at this level for raw device access is on the cutting edge of what's technically possible. I would be concerned that your comparison is not apples-to-apples.

Questions:
  • How is the SAN storage presented to the guest?
  • Is this a file stored on an LVM partition attached to the host via the SAN?
  • Is fio being executed on a raw device or a file inside of the VM?

https://kb.blockbridge.com/technote/proxmox-aio-vs-iouring/

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
  • Like
Reactions: guletz and Dunuin
Hi @bbgeek17,

I found your previous post very helpful, but I was wondering if you could maybe go into a bit more detail about expected access latencies as I feel like I still don't understand the performance of some of the newer higher end consumer NVMe SSDs.

I'm using some WD SN850X NVMe SSDs in a brand new Proxmox install.

The system is:
- EPYC 7302P (NPS4)
- (16GBx8) of Micron DDR4 3200MHz MTA18ASF2G72PZ-3G2
- 2TB WD SN850X WDS200T2X0E

I created an Ubuntu 22.04 VM using all the default Proxmox setup (VirtIO SCSI, no cache, 32GB disk size on local-lvm).
I'm giving the VMs 4 CPUs with NUMA checked and 16GB of memory.
On the node I have a 100GB local directory and a 1.84TB LVM-Thin storage pool (default created during Proxmox install).

Here's some of the random read results I get on the node vs the VMs:
Code:
# fio --filename=/dev/nvme0n1 --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [r(4)][5.0%][r=694MiB/s][r=178k IOPS][eta 00m:57s]
Jobs: 4 (f=4): [r(4)][8.3%][r=692MiB/s][r=177k IOPS][eta 00m:55s]
...
Jobs: 4 (f=4): [r(4)][98.3%][r=700MiB/s][r=179k IOPS][eta 00m:01s]
Jobs: 4 (f=4): [r(4)][100.0%][r=700MiB/s][r=179k IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=4): err= 0: pid=364222: Sun Feb 26 13:12:46 2023
  read: IOPS=176k, BW=686MiB/s (719MB/s)(40.2GiB/60001msec)
    slat (nsec): min=1710, max=618179, avg=2141.29, stdev=585.44
    clat (nsec): min=410, max=4006.5k, avg=20081.54, stdev=22565.37
     lat (usec): min=10, max=4008, avg=22.28, stdev=22.57
    clat percentiles (usec):
     |  1.00th=[   10],  5.00th=[   10], 10.00th=[   10], 20.00th=[   10],
     | 30.00th=[   11], 40.00th=[   11], 50.00th=[   11], 60.00th=[   11],
     | 70.00th=[   12], 80.00th=[   14], 90.00th=[   50], 95.00th=[   75],
     | 99.00th=[   91], 99.50th=[  105], 99.90th=[  128], 99.95th=[  133],
     | 99.99th=[  155]
   bw (  KiB/s): min=629849, max=727752, per=100.00%, avg=702313.32, stdev=6651.89, samples=476
   iops        : min=157462, max=181938, avg=175578.33, stdev=1662.95, samples=476
  lat (nsec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=22.73%, 20=57.78%, 50=9.74%
  lat (usec)   : 100=9.09%, 250=0.67%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=7.08%, sys=15.73%, ctx=10532380, majf=0, minf=87
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=10532389,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=686MiB/s (719MB/s), 686MiB/s-686MiB/s (719MB/s-719MB/s), io=40.2GiB (43.1GB), run=60001-60001msec

Disk stats (read/write):
  nvme0n1: ios=10510995/112, merge=0/65, ticks=205585/24, in_queue=205614, util=99.83%

Code:
# sudo fio --filename=/dev/sda --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.28
Starting 4 processes
Jobs: 4 (f=4): [r(4)][5.0%][r=195MiB/s][r=49.9k IOPS][eta 00m:57s]
Jobs: 4 (f=4): [r(4)][8.3%][r=184MiB/s][r=47.0k IOPS][eta 00m:55s]
...
Jobs: 4 (f=4): [r(4)][98.3%][r=186MiB/s][r=47.5k IOPS][eta 00m:01s]
Jobs: 4 (f=4): [r(4)][100.0%][r=195MiB/s][r=50.0k IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=4): err= 0: pid=2985: Sun Feb 26 21:15:47 2023
  read: IOPS=45.9k, BW=179MiB/s (188MB/s)(10.5GiB/60001msec)
    slat (usec): min=2, max=327, avg= 7.40, stdev= 2.24
    clat (nsec): min=610, max=4098.3k, avg=78464.25, stdev=21237.14
     lat (usec): min=28, max=4105, avg=86.00, stdev=21.65
    clat percentiles (usec):
     |  1.00th=[   36],  5.00th=[   43], 10.00th=[   51], 20.00th=[   73],
     | 30.00th=[   76], 40.00th=[   78], 50.00th=[   81], 60.00th=[   83],
     | 70.00th=[   86], 80.00th=[   89], 90.00th=[   94], 95.00th=[  101],
     | 99.00th=[  116], 99.50th=[  121], 99.90th=[  133], 99.95th=[  141],
     | 99.99th=[  461]
   bw (  KiB/s): min=164560, max=202856, per=99.96%, avg=183602.89, stdev=2282.18, samples=476
   iops        : min=41140, max=50714, avg=45900.72, stdev=570.54, samples=476
  lat (nsec)   : 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 10=0.01%, 20=0.01%, 50=9.59%, 100=85.13%
  lat (usec)   : 250=5.26%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=5.11%, sys=12.73%, ctx=2755368, majf=0, minf=54
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2755271,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=179MiB/s (188MB/s), 179MiB/s-179MiB/s (188MB/s-188MB/s), io=10.5GiB (11.3GB), run=60001-60001msec

Disk stats (read/write):
  sda: ios=2749994/22, merge=0/5, ticks=216111/5, in_queue=216117, util=99.87%

Code:
# fio --filename=/dev/nvme0n1 --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=60 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [r(4)][5.0%][r=1861MiB/s][r=476k IOPS][eta 00m:57s]
Jobs: 4 (f=4): [r(4)][8.3%][r=1862MiB/s][r=477k IOPS][eta 00m:55s]
...
Jobs: 4 (f=4): [r(4)][98.3%][r=1862MiB/s][r=477k IOPS][eta 00m:01s]
Jobs: 4 (f=4): [r(4)][100.0%][r=1862MiB/s][r=477k IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=4): err= 0: pid=366473: Sun Feb 26 13:17:51 2023
  read: IOPS=476k, BW=1860MiB/s (1950MB/s)(109GiB/60003msec)
    slat (nsec): min=1540, max=1622.7k, avg=1842.87, stdev=768.78
    clat (usec): min=531, max=4445, avg=2148.47, stdev=73.05
     lat (usec): min=536, max=4448, avg=2150.37, stdev=73.01
    clat percentiles (usec):
     |  1.00th=[ 2024],  5.00th=[ 2040], 10.00th=[ 2057], 20.00th=[ 2073],
     | 30.00th=[ 2147], 40.00th=[ 2147], 50.00th=[ 2180], 60.00th=[ 2180],
     | 70.00th=[ 2180], 80.00th=[ 2180], 90.00th=[ 2180], 95.00th=[ 2212],
     | 99.00th=[ 2311], 99.50th=[ 2343], 99.90th=[ 2540], 99.95th=[ 2573],
     | 99.99th=[ 3064]
   bw (  MiB/s): min= 1841, max= 1875, per=100.00%, avg=1861.15, stdev= 0.97, samples=476
   iops        : min=471527, max=480082, avg=476453.17, stdev=249.07, samples=476
  lat (usec)   : 750=0.01%, 1000=0.05%
  lat (msec)   : 2=0.19%, 4=99.75%, 10=0.01%
  cpu          : usr=13.12%, sys=26.76%, ctx=1500205, majf=0, minf=16532
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=28564017,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=1860MiB/s (1950MB/s), 1860MiB/s-1860MiB/s (1950MB/s-1950MB/s), io=109GiB (117GB), run=60003-60003msec

Disk stats (read/write):
  nvme0n1: ios=28505480/218, merge=0/75, ticks=60332021/125, in_queue=60332153, util=99.88%

Code:
# sudo fio --filename=/dev/sda --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=60 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.28
Starting 4 processes
Jobs: 4 (f=4): [r(4)][5.0%][r=693MiB/s][r=177k IOPS][eta 00m:57s]
Jobs: 4 (f=4): [r(4)][8.3%][r=686MiB/s][r=176k IOPS][eta 00m:55s]
...
Jobs: 4 (f=4): [r(4)][98.3%][r=554MiB/s][r=142k IOPS][eta 00m:01s]
Jobs: 4 (f=4): [r(4)][100.0%][r=530MiB/s][r=136k IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=4): err= 0: pid=2997: Sun Feb 26 21:19:27 2023
  read: IOPS=161k, BW=628MiB/s (658MB/s)(36.8GiB/60007msec)
    slat (nsec): min=950, max=34306k, avg=16811.54, stdev=130737.72
    clat (usec): min=98, max=565968, avg=6353.42, stdev=3942.80
     lat (usec): min=105, max=565971, avg=6370.33, stdev=3946.97
    clat percentiles (usec):
     |  1.00th=[ 1057],  5.00th=[ 1991], 10.00th=[ 2900], 20.00th=[ 4178],
     | 30.00th=[ 5014], 40.00th=[ 5669], 50.00th=[ 6194], 60.00th=[ 6783],
     | 70.00th=[ 7439], 80.00th=[ 8291], 90.00th=[ 9634], 95.00th=[10814],
     | 99.00th=[13566], 99.50th=[15008], 99.90th=[23725], 99.95th=[34866],
     | 99.99th=[93848]
   bw (  KiB/s): min=449587, max=763160, per=100.00%, avg=644341.49, stdev=16232.70, samples=476
   iops        : min=112396, max=190790, avg=161085.33, stdev=4058.21, samples=476
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.03%, 750=0.24%, 1000=0.55%
  lat (msec)   : 2=4.22%, 4=13.38%, 10=73.58%, 20=7.85%, 50=0.12%
  lat (msec)   : 100=0.02%, 250=0.01%, 500=0.01%, 750=0.01%
  cpu          : usr=2.78%, sys=9.71%, ctx=2274329, majf=0, minf=1078
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=9644325,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=628MiB/s (658MB/s), 628MiB/s-628MiB/s (658MB/s-658MB/s), io=36.8GiB (39.5GB), run=60007-60007msec

Disk stats (read/write):
  sda: ios=9624354/52, merge=17199/57, ticks=49116299/163, in_queue=49116481, util=99.93%


There aren't any workloads running on this node besides my tests and the results are consistent when run on the node directly as well as on the VM.
These are random read tests run directly against the device (not a file).

There are a few things I'm wondering about:
- Is this just the expected overhead of running inside of a VM/LVM? It feels like a lot of overhead (~4x worse performance on both low and high QD)?
- Based on your comment about NAND access latencies, it seems like this level of QD1 IOPS shouldn't be possible? What am I measuring when running these QD1 tests and getting 100k+ IOPS? Is there a better way to run low QD tests to avoid prefetching?

Edit
For anyone else reading I originally thought that drives like the Kioxia CD6 would get ~1m read IOPS even at low queue depths but that's not actually the case.

There's a good review from Tweaktown that shows that these enterprise drives max out ~15k random read IOPS at QD1, and only start to reach their peak IOPS ~QD128+.

It would be nice to know a better way to get at the QD1 performance with fio.

Thanks!
 
Last edited:
Hi @jordangarside,

Those are good questions and great information. Many things may be holding you back. I highly recommend that you read through the link below. It talks specifically about QD1 latency in the context of NVMe/TCP. Several lessons should apply to NVMe/PCI and will unlock some performance for you. At a minimum, it will help you understand some of the dynamics at play.

https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/

A few casual observations:
- Your host QD1/Jobs=4 results suggest that you are not reading from nand: you must not have preconditioned the entire drive before testing. These latencies make sense for a "thin-read". This is OK; just keep it in mind if you are reading from the root disk of a VM that has blocks backed by nand.
- Max IOPS on the host seems low relative to spec. Check the PCIe configuration/link status of the device. Verify the number of lanes and negotiated link speed (lspci -vvv).
- IOPS in the VM is low. With tuning, you should be able to do much better. Check out that link above for ideas.
- Remember that the 7302P has eight core complexes with two cores per complex. Therefore, your 4 VCPU machines span core complexes, using the infinity fabric for cross-core synchronization. This is non-optimal, but you have no alternative with this CPU. You may find better results with fewer VCPUs.

Lastly, you did not specify whether you are using an I/O thread? If not, you should.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: jordangarside

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!