Very low random read (and write) throughput on NVMe ZFS mirror

Dec 30, 2023
7
2
3
Hello all,

I'm using a single node proxmox server with ZFS in mirror:
Code:
  pool: rpool
 state: ONLINE
config:

    NAME                                 STATE     READ WRITE CKSUM
    rpool                                ONLINE       0     0     0
      mirror-0                           ONLINE       0     0     0
        nvme-eui.002538bc31a32330-part3  ONLINE       0     0     0
        nvme-eui.002538bc31a32315-part3  ONLINE       0     0     0

With the following drives:
Code:
SAMSUNG MZVL22T0HBLB-00B00

On a host with AMD EPYC 7272 12-Core Processor and 256G RAM.

I am trying to verify disk performance before installing customer vms on it.

I am first running fio on windows vm with seq read 1M and the results are acceptable:
Code:
C:\Users\Administrator>fio --name=mytest --filename=\\.\PhysicalDrive0 --rw=read --bs=1M --ioengine=windowsaio --direct=1 --time_based --runtime=30 --group_reporting --iodepth=16 --thread=1
mytest: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=windowsaio, iodepth=16
fio-3.38
Starting 1 thread
Jobs: 1 (f=0): [f(1)][100.0%][r=5496MiB/s][r=5495 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=7188: Fri Dec 13 21:08:32 2024
  read: IOPS=5382, BW=5383MiB/s (5644MB/s)(158GiB/30003msec)
    slat (nsec): min=900, max=1747.7k, avg=61804.56, stdev=35938.80
    clat (usec): min=776, max=12086, avg=2874.11, stdev=418.40
     lat (usec): min=822, max=12172, avg=2935.91, stdev=416.70
    clat percentiles (usec):
     |  1.00th=[ 2114],  5.00th=[ 2311], 10.00th=[ 2409], 20.00th=[ 2573],
     | 30.00th=[ 2737], 40.00th=[ 2835], 50.00th=[ 2900], 60.00th=[ 2966],
     | 70.00th=[ 2999], 80.00th=[ 3064], 90.00th=[ 3195], 95.00th=[ 3326],
     | 99.00th=[ 3884], 99.50th=[ 5014], 99.90th=[ 7635], 99.95th=[ 8586],
     | 99.99th=[10028]
   bw (  MiB/s): min= 4959, max= 6282, per=100.00%, avg=5386.79, stdev=337.54, samples=59
   iops        : min= 4959, max= 6282, avg=5386.42, stdev=337.55, samples=59
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.52%, 4=98.62%, 10=0.84%, 20=0.01%
  cpu          : usr=0.00%, sys=36.66%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=12.0%, 16=87.9%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.9%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=161494,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=5383MiB/s (5644MB/s), 4096MiB/s-5383MiB/s (4295MB/s-5644MB/s), io=158GiB (169GB), run=30003-30003msec

But as soon as I test with random read 4K, it gets really worse:
Code:
C:\Users\Administrator>fio --name=mytest --filename=\\.\PhysicalDrive0 --rw=read --bs=1M --ioengine=windowsaio --direct=1 --time_based --runtime=30 --group_reporting --iodepth=16 --thread=1
mytest: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=windowsaio, iodepth=16
fio-3.38
Starting 1 thread
Jobs: 1 (f=0): [f(1)][100.0%][r=5496MiB/s][r=5495 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=7188: Fri Dec 13 21:08:32 2024
  read: IOPS=5382, BW=5383MiB/s (5644MB/s)(158GiB/30003msec)
    slat (nsec): min=900, max=1747.7k, avg=61804.56, stdev=35938.80
    clat (usec): min=776, max=12086, avg=2874.11, stdev=418.40
     lat (usec): min=822, max=12172, avg=2935.91, stdev=416.70
    clat percentiles (usec):
     |  1.00th=[ 2114],  5.00th=[ 2311], 10.00th=[ 2409], 20.00th=[ 2573],
     | 30.00th=[ 2737], 40.00th=[ 2835], 50.00th=[ 2900], 60.00th=[ 2966],
     | 70.00th=[ 2999], 80.00th=[ 3064], 90.00th=[ 3195], 95.00th=[ 3326],
     | 99.00th=[ 3884], 99.50th=[ 5014], 99.90th=[ 7635], 99.95th=[ 8586],
     | 99.99th=[10028]
   bw (  MiB/s): min= 4959, max= 6282, per=100.00%, avg=5386.79, stdev=337.54, samples=59
   iops        : min= 4959, max= 6282, avg=5386.42, stdev=337.55, samples=59
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.52%, 4=98.62%, 10=0.84%, 20=0.01%
  cpu          : usr=0.00%, sys=36.66%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=12.0%, 16=87.9%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.9%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=161494,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=5383MiB/s (5644MB/s), 4096MiB/s-5383MiB/s (4295MB/s-5644MB/s), io=158GiB (169GB), run=30003-30003msec

I tried to benchmark the host directly (using /dev device that was created by ZFS, maybe that's not correct?) with the same specs (only changing device and ioengine obviously), results are not that great either but they are better still:
Code:
fio --name=mytest --filename=/dev/zd64 --rw=randread --bs=4K --ioengine=libaio --direct=1 --time_based --runtime=30 --group_reporting --iodepth=16 --thread=1
mytest: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
fio-3.33
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=474MiB/s][r=121k IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=1067235: Sat Dec 14 06:18:36 2024
  read: IOPS=54.2k, BW=212MiB/s (222MB/s)(6351MiB/30001msec)
    slat (usec): min=2, max=138, avg= 6.32, stdev= 1.39
    clat (usec): min=13, max=8556, avg=126.55, stdev=68.26
     lat (usec): min=17, max=8611, avg=132.86, stdev=68.33
    clat percentiles (usec):
     |  1.00th=[   50],  5.00th=[   66], 10.00th=[   75], 20.00th=[   85],
     | 30.00th=[   93], 40.00th=[  100], 50.00th=[  108], 60.00th=[  115],
     | 70.00th=[  126], 80.00th=[  147], 90.00th=[  243], 95.00th=[  269],
     | 99.00th=[  310], 99.50th=[  326], 99.90th=[  408], 99.95th=[  562],
     | 99.99th=[  947]
   bw (  KiB/s): min=59872, max=501528, per=100.00%, avg=464471.11, stdev=81291.09, samples=27
   iops        : min=14970, max=125382, avg=116117.85, stdev=20322.39, samples=27
  lat (usec)   : 20=0.01%, 50=1.06%, 100=39.72%, 250=50.51%, 500=8.65%
  lat (usec)   : 750=0.04%, 1000=0.02%
  lat (msec)   : 2=0.01%, 10=0.01%
  cpu          : usr=6.88%, sys=91.77%, ctx=35080, majf=7, minf=17
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1625957,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=212MiB/s (222MB/s), 212MiB/s-212MiB/s (222MB/s-222MB/s), io=6351MiB (6660MB), run=30001-30001msec

Disk stats (read/write):
  zd64: ios=1612016/128, merge=0/0, ticks=144074/234, in_queue=144308, util=44.91%

For the record, here is the same test but with a mirror member directly (bypassing ZFS altogether):
Code:
fio --name=mytest --filename=/dev/nvme0n1 --rw=randread --bs=4K --ioengine=libaio --direct=1 --time_based --runtime=30 --group_reporting --iodepth=16 --thread=1
mytest: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
fio-3.33
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=546MiB/s][r=140k IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=1075286: Sat Dec 14 06:48:24 2024
  read: IOPS=137k, BW=536MiB/s (562MB/s)(15.7GiB/30001msec)
    slat (usec): min=3, max=195, avg= 5.37, stdev= 2.73
    clat (usec): min=12, max=8323, avg=110.16, stdev=31.05
     lat (usec): min=15, max=8465, avg=115.53, stdev=31.17
    clat percentiles (usec):
     |  1.00th=[   77],  5.00th=[   84], 10.00th=[   88], 20.00th=[   93],
     | 30.00th=[   96], 40.00th=[   98], 50.00th=[  101], 60.00th=[  104],
     | 70.00th=[  109], 80.00th=[  129], 90.00th=[  157], 95.00th=[  167],
     | 99.00th=[  182], 99.50th=[  190], 99.90th=[  210], 99.95th=[  221],
     | 99.99th=[  302]
   bw (  KiB/s): min=532136, max=561104, per=100.00%, avg=549006.78, stdev=10284.35, samples=59
   iops        : min=133034, max=140276, avg=137251.69, stdev=2571.08, samples=59
  lat (usec)   : 20=0.01%, 50=0.01%, 100=46.25%, 250=53.73%, 500=0.01%
  lat (usec)   : 750=0.01%, 1000=0.01%
  lat (msec)   : 10=0.01%
  cpu          : usr=25.20%, sys=74.71%, ctx=394, majf=0, minf=17
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=4117503,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=536MiB/s (562MB/s), 536MiB/s-536MiB/s (562MB/s-562MB/s), io=15.7GiB (16.9GB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=4098031/882, merge=0/0, ticks=99420/54, in_queue=99511, util=98.83%

Compression is on on zpool.

I am not that experienced with ZFS and if you have any pointers for me that would be awesome!

Thanks a lot in advance,

Best,

BTW: there is also a noticable difference between the raw device and the zfs device in seq read 1M

Code:
Raw device:
READ: bw=6767MiB/s (7096MB/s), 6767MiB/s-6767MiB/s (7096MB/s-7096MB/s), io=198GiB (213GB), run=30002-30002msec
ZFS device:
   READ: bw=4394MiB/s (4608MB/s), 4394MiB/s-4394MiB/s (4608MB/s-4608MB/s), io=129GiB (138GB), run=30006-30006msec
 
Zvols are known to be slow. What volblocksize are you using? The default is 16k (or 8k on Proxmox?), so random 4K reads require more than a 4K read (depending on the ashift).
ZFS has huge write amplification (volblocksize? ashift?) and requires synchronous writes for metadata (which your enterprise drives should be able to handle).
ZFS is not the fastest filesystem for VMs but it does come with a lot of features. Maybe test some real workloads from your VMs instead of benchmarks?
 
  • Like
Reactions: Kingneutron
Ok I'll test. For the record, yes my block size is 16k and ashift is 12 (which is I think the default for Proxmox ZFS installs)

The VM does not "feel" slow but that's not a guarantee that my workload will be able to perform

Thanks!
 
Mmh something must not be right, on our old server (which has SSD non NVMe drive) we see the following values:
Code:
mytest: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=windowsaio, iodepth=16
fio-3.38
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=232MiB/s][r=59.4k IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=1804: Sun Dec 15 04:55:47 2024
  read: IOPS=56.8k, BW=222MiB/s (233MB/s)(6659MiB/30001msec)
    slat (usec): min=2, max=589, avg=10.93, stdev=11.07
    clat (nsec): min=400, max=12725k, avg=245092.05, stdev=154146.81
     lat (usec): min=42, max=12728, avg=256.02, stdev=154.25
    clat percentiles (usec):
     |  1.00th=[   68],  5.00th=[   85], 10.00th=[   96], 20.00th=[  117],
     | 30.00th=[  149], 40.00th=[  245], 50.00th=[  265], 60.00th=[  281],
     | 70.00th=[  297], 80.00th=[  318], 90.00th=[  351], 95.00th=[  388],
     | 99.00th=[  676], 99.50th=[  840], 99.90th=[ 1893], 99.95th=[ 3032],
     | 99.99th=[ 3884]
   bw (  KiB/s): min=158059, max=255097, per=99.59%, avg=226338.52, stdev=21876.29, samples=60
   iops        : min=39514, max=63774, avg=56584.38, stdev=5469.11, samples=60
  lat (nsec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.09%
  lat (usec)   : 100=11.89%, 250=30.18%, 500=56.20%, 750=0.87%, 1000=0.49%
  lat (msec)   : 2=0.17%, 4=0.09%, 10=0.01%, 20=0.01%
  cpu          : usr=10.00%, sys=46.67%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.9%, 8=68.0%, 16=31.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=94.4%, 8=4.9%, 16=0.7%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1704656,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=222MiB/s (233MB/s), 222MiB/s-222MiB/s (233MB/s-233MB/s), io=6659MiB (6982MB), run=30001-30001msec

Benchmarking the underlying zd zfs drive directly on old server raises the following:
Code:
root@compute:~# fio --name=mytest --filename=/dev/zd304 --rw=randread --bs=4K --ioengine=libaio --direct=1 --time_based --runtime=30 --group_reporting --iodepth=16 --thread=1
mytest: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
fio-3.33
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=389MiB/s][r=99.5k IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=1096175: Sun Dec 15 05:06:45 2024
  read: IOPS=92.2k, BW=360MiB/s (378MB/s)(10.6GiB/30001msec)
    slat (usec): min=2, max=129, avg= 5.05, stdev= 1.38
    clat (nsec): min=1150, max=11758k, avg=167916.68, stdev=130045.61
     lat (usec): min=10, max=11766, avg=172.97, stdev=130.06
    clat percentiles (usec):
     |  1.00th=[   24],  5.00th=[   31], 10.00th=[   36], 20.00th=[   44],
     | 30.00th=[   52], 40.00th=[   82], 50.00th=[  210], 60.00th=[  225],
     | 70.00th=[  237], 80.00th=[  253], 90.00th=[  281], 95.00th=[  314],
     | 99.00th=[  396], 99.50th=[  523], 99.90th=[ 1074], 99.95th=[ 1745],
     | 99.99th=[ 3392]
   bw (  KiB/s): min=258016, max=401344, per=99.95%, avg=368597.15, stdev=35015.98, samples=59
   iops        : min=64506, max=100336, avg=92149.36, stdev=8753.89, samples=59
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.24%, 50=28.04%
  lat (usec)   : 100=12.11%, 250=38.57%, 500=20.50%, 750=0.31%, 1000=0.12%
  lat (msec)   : 2=0.08%, 4=0.03%, 10=0.01%, 20=0.01%
  cpu          : usr=11.01%, sys=61.07%, ctx=685204, majf=7, minf=17
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2765968,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=360MiB/s (378MB/s), 360MiB/s-360MiB/s (378MB/s-378MB/s), io=10.6GiB (11.3GB), run=30001-30001msec

Disk stats (read/write):
  zd304: ios=2751066/58, merge=0/0, ticks=417587/6, in_queue=417593, util=99.72%

We also see a drop in performance between zd device and device in vm but much much less.

I'll start investigating disk configuration in VM
 
Sorry for the spam but I just discovered that, while our VM disk configuration was the same, the volblock size for the performant disk on the old server was 8K instead of 16k.

I'll try to change that setting and create a new VM to test that out.

EDIT: With an 8K volblocksize, same performance (even a little better) than our old server. I guess that the default switched from 8k to 16k between Proxmox 7 and 8. Should I keep the default 16k and try with my workload? What was the intention in setting the default to 16k? What is the advantage of 16k volblocksize?
 
Last edited:
EDIT: With an 8K volblocksize, same performance (even a little better) than our old server. I guess that the default switched from 8k to 16k between Proxmox 7 and 8. Should I keep the default 16k and try with my workload? What was the intention in setting the default to 16k?
If the blocksize of the filesystem inside the VM is 4K, you might want to use 4K to reduce write amplifications. On the other hand, compressions works better (or at all) with larger blocksizes.
Or maybe let the operating system inside the VM known that the optimal size is 16K (and the "physical sector" size is 4K)? Maybe try different combinations and see what works best for your VM (and it's I/O-workload).
What is the advantage of 16k volblocksize?
All of this is not Proxmox specific; have a look at the OpenZFS website for example: https://github.com/openzfs/zfs/issues/14771

EDIT: I also found this: https://openzfs.github.io/openzfs-docs/Performance and Tuning/Workload Tuning.html#zvol-volblocksize
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!