Poor ZFS SSD IO benchmark: RAID-Z1 4 x SSD similar to RAID-Z10 12 x HDD

Only increase zfs_dirty_data_max (4294967296 -> 10737418240 -> 21474836480 -> 42949672960) compensate performance penalties, but this is background record same slow per nvme devices ~10k iops per device:
Bash:
# fio --time_based --name=benchmark --size=15G --runtime=30 --filename=/mnt/zfs/g-fio.test --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=871MiB/s][w=223k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=13035: Thu Nov 25 18:19:06 2021
  write: IOPS=166k, BW=650MiB/s (682MB/s)(19.0GiB/30001msec); 0 zone resets


# iostat  -x 1 | awk '{print $1"\t"$8"\t"$9}'
Device    w/s    wkB/s
loop0    1.00    4.00
loop1    1.00    4.00
nvme0n1    0.00    0.00
nvme1n1    7963.00    398872.00
nvme2n1    6197.00    393752.00
nvme3n1    8052.00    403096.00
nvme4n1    7933.00    398872.00
nvme5n1    0.00    0.00
nvme6n1    0.00    0.00

We can increase IOPS by increasing:
zfs_vdev_async_write_min_active
zfs_vdev_async_write_max_active
Bash:
Device    w/s    wkB/s
loop0    0.00    0.00
loop1    0.00    0.00
nvme0n1    0.00    0.00
nvme1n1    48071.00    1595316.00
nvme2n1    47334.00    1496244.00
nvme3n1    48044.00    1595120.00
nvme4n1    47676.00    1549908.00
nvme5n1    0.00    0.00
nvme6n1    0.00    0.00


Than single nvme device have raw speed 700k iops:
Bash:
# fio --time_based --name=benchmark --size=15G --runtime=30 --filename=/dev/nvme6n1 --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=2761MiB/s][w=707k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=3828468: Thu Nov 25 21:30:06 2021
  write: IOPS=706k, BW=2758MiB/s (2892MB/s)(80.8GiB/30001msec); 0 zone resets
    slat (nsec): min=1300, max=258439, avg=2391.93, stdev=1159.86
    clat (usec): min=314, max=2934, avg=722.11, stdev=111.84
     lat (usec): min=316, max=2936, avg=724.57, stdev=111.82
    clat percentiles (usec):
     |  1.00th=[  502],  5.00th=[  553], 10.00th=[  586], 20.00th=[  627],
     | 30.00th=[  660], 40.00th=[  685], 50.00th=[  717], 60.00th=[  750],
     | 70.00th=[  783], 80.00th=[  816], 90.00th=[  865], 95.00th=[  906],
     | 99.00th=[  988], 99.50th=[ 1020], 99.90th=[ 1106], 99.95th=[ 1205],
     | 99.99th=[ 1942]
   bw (  MiB/s): min= 2721, max= 2801, per=100.00%, avg=2760.62, stdev= 3.03, samples=236
   iops        : min=696742, max=717228, avg=706717.86, stdev=775.87, samples=236
  lat (usec)   : 500=0.97%, 750=59.81%, 1000=38.47%
  lat (msec)   : 2=0.74%, 4=0.01%
  cpu          : usr=24.41%, sys=44.96%, ctx=5787827, majf=0, minf=70
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,21184205,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=2758MiB/s (2892MB/s), 2758MiB/s-2758MiB/s (2892MB/s-2892MB/s), io=80.8GiB (86.8GB), run=30001-30001msec

Disk stats (read/write):
  nvme6n1: ios=50/21098501, merge=0/0, ticks=4/15168112, in_queue=15168116, util=99.77%
 
Last edited:
Also for comparsion XFS (direct) IO:
Bash:
# mount | grep xfs | grep test
/dev/nvme6n1p1 on /mnt/test1 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)

# fio --time_based --name=benchmark --size=15G --runtime=30 --filename=/mnt/test1/test.file --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=1061MiB/s][w=272k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=238227: Thu Nov 25 22:50:47 2021
  write: IOPS=184k, BW=718MiB/s (753MB/s)(21.0GiB/30001msec); 0 zone resets
    slat (usec): min=2, max=31969, avg=20.64, stdev=53.80
    clat (usec): min=13, max=34843, avg=2762.10, stdev=773.87
     lat (usec): min=17, max=34860, avg=2782.87, stdev=777.77
    clat percentiles (usec):
     |  1.00th=[ 1696],  5.00th=[ 1860], 10.00th=[ 1975], 20.00th=[ 2180],
     | 30.00th=[ 2409], 40.00th=[ 2540], 50.00th=[ 2671], 60.00th=[ 2769],
     | 70.00th=[ 2933], 80.00th=[ 3195], 90.00th=[ 3654], 95.00th=[ 4047],
     | 99.00th=[ 5080], 99.50th=[ 5604], 99.90th=[ 7111], 99.95th=[ 9503],
     | 99.99th=[11994]
   bw (  KiB/s): min=491203, max=1082880, per=99.23%, avg=730021.88, stdev=37469.97, samples=236
   iops        : min=122800, max=270720, avg=182505.56, stdev=9367.51, samples=236
  lat (usec)   : 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%
  lat (usec)   : 750=0.01%, 1000=0.01%
  lat (msec)   : 2=11.44%, 4=83.09%, 10=5.42%, 20=0.02%, 50=0.01%
  cpu          : usr=5.69%, sys=30.71%, ctx=4469577, majf=0, minf=69
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,5517963,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=718MiB/s (753MB/s), 718MiB/s-718MiB/s (753MB/s-753MB/s), io=21.0GiB (22.6GB), run=30001-30001msec

Disk stats (read/write):
  nvme6n1: ios=0/5486013, merge=0/0, ticks=0/66265, in_queue=66266, util=99.67%

Code:
Device    w/s    wkB/s    %util
nvme6n1    82109.00    328436.00    66.40
nvme6n1    129252.00    517008.00    100.00
nvme6n1    133798.00    535192.00    100.00
nvme6n1    140417.00    561672.00    100.00
nvme6n1    142180.00    596544.00    100.00
nvme6n1    142952.00    571808.00    100.00
nvme6n1    151052.00    604208.00    100.00
nvme6n1    155050.00    620200.00    100.00
nvme6n1    159833.00    639332.00    100.00
nvme6n1    162689.00    650756.00    100.00
nvme6n1    161429.00    645716.00    100.00
nvme6n1    164999.00    659996.00    100.00
nvme6n1    167785.00    671140.00    100.00
nvme6n1    171748.00    686992.00    100.00
nvme6n1    174931.00    699724.00    100.00
nvme6n1    175983.00    703932.00    100.00
nvme6n1    180748.00    722992.00    100.00
nvme6n1    184010.00    736040.00    100.00
nvme6n1    187432.00    749728.00    100.00
nvme6n1    191138.00    764552.00    100.00
nvme6n1    194453.00    777816.00    100.00
nvme6n1    198893.00    795568.00    100.00
nvme6n1    203407.00    813632.00    100.00
nvme6n1    210010.00    840036.00    100.00
nvme6n1    218265.00    873060.00    100.00
nvme6n1    232786.00    931144.00    100.00
nvme6n1    232171.00    947824.00    98.40
nvme6n1    248974.00    995900.00    100.00
nvme6n1    257395.00    1029576.00    100.00
nvme6n1    269008.00    1076032.00    100.00
nvme6n1    94948.00    379792.00    34.00


As see ZFS use bigger block then XFS for writing
Bash:
# ZFS with increased zfs_vdev_async_write_min_active and zfs_vdev_async_write_max_active:
Device               w/s            wkB/s  
nvme1n1    48071.00    1595316.00
nvme2n1    47334.00    1496244.00
nvme3n1    48044.00    1595120.00
nvme4n1    47676.00    1549908.00
 
I tune zfs for the next performance:

FIO param size=2G
Bash:
# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=/mnt/zfs/g-fio.test --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=788MiB/s][w=202k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=3296: Fri Nov 26 00:50:31 2021
  write: IOPS=203k, BW=792MiB/s (830MB/s)(23.2GiB/30001msec); 0 zone resets
    slat (usec): min=4, max=8091, avg=18.98, stdev= 7.63
    clat (usec): min=2, max=17115, avg=2506.81, stdev=301.22
     lat (usec): min=17, max=17131, avg=2525.86, stdev=303.45
    clat percentiles (usec):
     |  1.00th=[ 1795],  5.00th=[ 1860], 10.00th=[ 2409], 20.00th=[ 2442],
     | 30.00th=[ 2474], 40.00th=[ 2474], 50.00th=[ 2507], 60.00th=[ 2540],
     | 70.00th=[ 2573], 80.00th=[ 2606], 90.00th=[ 2671], 95.00th=[ 2737],
     | 99.00th=[ 2966], 99.50th=[ 3884], 99.90th=[ 5997], 99.95th=[ 6849],
     | 99.99th=[ 9241]
   bw (  KiB/s): min=678448, max=1125032, per=100.00%, avg=810869.19, stdev=16297.54, samples=236
   iops        : min=169612, max=281258, avg=202717.27, stdev=4074.39, samples=236
  lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=6.57%, 4=92.96%, 10=0.47%, 20=0.01%
  cpu          : usr=4.09%, sys=95.84%, ctx=396, majf=0, minf=803
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,6079580,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=792MiB/s (830MB/s), 792MiB/s-792MiB/s (830MB/s-830MB/s), io=23.2GiB (24.9GB), run=30001-30001msec

FIO param size=15G
Bash:
# fio --time_based --name=benchmark --size=15G --runtime=30 --filename=/mnt/zfs/g-fio.test --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=452MiB/s][w=116k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=48426: Fri Nov 26 00:39:19 2021
  write: IOPS=100k, BW=392MiB/s (411MB/s)(11.5GiB/30001msec); 0 zone resets
    slat (usec): min=5, max=30693, avg=38.34, stdev=99.68
    clat (usec): min=4, max=47906, avg=5063.25, stdev=2041.14
     lat (usec): min=21, max=47928, avg=5101.72, stdev=2053.79
    clat percentiles (usec):
     |  1.00th=[ 2671],  5.00th=[ 3097], 10.00th=[ 3392], 20.00th=[ 3687],
     | 30.00th=[ 3949], 40.00th=[ 4228], 50.00th=[ 4555], 60.00th=[ 4948],
     | 70.00th=[ 5538], 80.00th=[ 6259], 90.00th=[ 7177], 95.00th=[ 7832],
     | 99.00th=[12780], 99.50th=[15664], 99.90th=[24511], 99.95th=[28443],
     | 99.99th=[36439]
   bw (  KiB/s): min=241038, max=611712, per=99.59%, avg=399671.85, stdev=21682.16, samples=236
   iops        : min=60259, max=152928, avg=99917.25, stdev=5420.56, samples=236
  lat (usec)   : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%
  lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=32.55%, 10=65.33%, 20=1.91%, 50=0.19%
  cpu          : usr=4.28%, sys=89.98%, ctx=17752, majf=0, minf=2663
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,3009925,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=392MiB/s (411MB/s), 392MiB/s-392MiB/s (411MB/s-411MB/s), io=11.5GiB (12.3GB), run=30001-30001msec

This is already applicable, but still much less than the expected I/O rates for raidz on 4x nvme drives

Bash:
# egrep -v '^#|^$' /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=137400000000 #128G for ARC
options zfs zfs_txg_timeout=30
options zfs zfs_dirty_data_max=42949672960
options zfs zfs_dirty_data_max_percent=30
options zfs zfs_vdev_queue_depth_pct=100
options zfs zfs_vdev_async_write_min_active=1024
options zfs zfs_vdev_async_write_max_active=2048
options zfs zfs_vdev_async_read_min_active=1024
options zfs zfs_vdev_async_read_max_active=2048
options zfs zfs_vdev_sync_write_min_active=1024
options zfs zfs_vdev_sync_write_max_active=2048
options zfs zfs_vdev_sync_read_min_active=1024
options zfs zfs_vdev_sync_read_max_active=2048
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!