Poor ZFS SSD IO benchmark: RAID-Z1 4 x SSD similar to RAID-Z10 12 x HDD

swibwob · Nov 18, 2021

Hi there!

I have two PVE 7.0 on ZFS, one with 12 x 4TB 7.2K SAS HDD in ZFS RAID 10, the other with 4 x 4TB SATA SSD in Z1 and they're coming out with near identical IO performance, which is suspicious! From benchmarking with FIO with caches and buffers disabled, on sequential read / writes, the SSD it looks like the SSDs are seriously underperforming. They are enterprise SATA's (intel S4520). The performance kind of makes sense on the HDDs but not the SSDs!

On both :

Code:

zfs create rpool/fio
zfs set primarycache=none rpool/fio
fio --ioengine=sync --direct=1 --gtod_reduce=1 --name=test --filename=/rpool/fio/test --bs=4k --iodepth=1 --size=4G --readwrite=readwrite --rwmixread=50

SSD results:

Code:

test: (groupid=0, jobs=1): err= 0: pid=3778230: Thu Nov 18 17:14:32 2021
  read: IOPS=1469, BW=5878KiB/s (6019kB/s)(1468MiB/255757msec)
   bw (  KiB/s): min= 2328, max=66600, per=100.00%, avg=5882.09, stdev=8835.21, samples=511
   iops        : min=  582, max=16650, avg=1470.51, stdev=2208.83, samples=511
  write: IOPS=1467, BW=5868KiB/s (6009kB/s)(1466MiB/255757msec); 0 zone resets
   bw (  KiB/s): min= 2184, max=66840, per=100.00%, avg=5872.33, stdev=8840.91, samples=511
   iops        : min=  546, max=16710, avg=1468.07, stdev=2210.23, samples=511
  cpu          : usr=0.83%, sys=13.86%, ctx=270574, majf=0, minf=53
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=375824,375217,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=5878KiB/s (6019kB/s), 5878KiB/s-5878KiB/s (6019kB/s-6019kB/s), io=1468MiB (1539MB), run=255757-255757msec
  WRITE: bw=5868KiB/s (6009kB/s), 5868KiB/s-5868KiB/s (6009kB/s-6009kB/s), io=1466MiB (1537MB), run=255757-255757msec

HDD results:

Code:

test: (groupid=0, jobs=1): err= 0: pid=3762085: Thu Nov 18 17:07:35 2021
  read: IOPS=1449, BW=5797KiB/s (5936kB/s)(363MiB/64101msec)
   bw (  KiB/s): min= 4040, max= 9960, per=100.00%, avg=5800.88, stdev=1603.29, samples=128
   iops        : min= 1010, max= 2490, avg=1450.22, stdev=400.82, samples=128
  write: IOPS=483, BW=1933KiB/s (1980kB/s)(121MiB/64101msec); 0 zone resets
   bw (  KiB/s): min= 1328, max= 3248, per=100.00%, avg=1934.87, stdev=536.80, samples=128
   iops        : min=  332, max=  812, avg=483.72, stdev=134.20, samples=128
  cpu          : usr=0.63%, sys=16.79%, ctx=156895, majf=0, minf=7
  IO depths    : 1=103.1%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=92896,30979,0,3815 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=5797KiB/s (5936kB/s), 5797KiB/s-5797KiB/s (5936kB/s-5936kB/s), io=363MiB (381MB), run=64101-64101msec
  WRITE: bw=1933KiB/s (1980kB/s), 1933KiB/s-1933KiB/s (1980kB/s-1980kB/s), io=121MiB (127MB), run=64101-64101msec

I believe the important numbers are:
SSD:
read: IOPS=1469
write: IOPS=1467

HDD
read: IOPS=1449
write: IOPS=483

Eg, I get a 10k on read and write on my laptop with the same fio test!

Any advice appreciated.

Simon

PS I've just seen :
https://forum.proxmox.com/threads/bad-zfs-performance-with-sas3416-hba.96260/
which says to add log and cache devices... is the above really expected without seperate SLOG and L2ARC devices?!

PPS I added log and cache with no change!

Dunuin · Nov 18, 2021

In theory you should get 6x the IOPS performance of a single HDD with that 12 disk striped mirror. But with a 4 disk raidz1 you should only get 1x the performance of a single SSD.

How did you do the fio test? If you are doing small sync writes I wouldn't wonder if both pools perform similar. But with async writes your SSD pool should be indeed way faster,

swibwob · Nov 19, 2021

Dunuin said:
In theory you should get 6x the IOPS performance of a single HDD with that 12 disk striped mirror. But with a 4 disk raidz1 you should only get 1x the performance of a single SSD.

How did you do the fio test? If you are doing small sync writes I wouldn't wonder if both pools perform similar. But with async writes your SSD pool should be indeed way faster,

Hi Dunion, thanks for your reply! The FIO command is in the OP. I'd expect the IOPS of one SSD to be in the order of 10k's for sequential read/writes, as it is on my laptop with an SSD, but on there it's nowhere near that...

Dunuin · Nov 19, 2021

swibwob said:
Hi Dunion, thanks for your reply! The FIO command is in the OP. I'd expect the IOPS of one SSD to be in the order of 10k's for sequential read/writes, as it is on my laptop with an SSD, but on there it's nowhere near that...

No, you do unparallelized 4k sync writes. In that case SSDs aren't really fast. A consumer SSD should only be around 2-3 times faster than a HDD and a enterprise TLC SSD like yours maybe 100x times faster than a HDD. But then you are benchmarking on file level and not on blocklevel on the SSD itself without the ZFS overhead. For every 4K write ZFS does it will write multiple metadata blocks so here you loose alot of IOPS too. So wouldn't wonder if your SSDs would only write something like 1000-3000 IOPS.
To see really high IOPS you need to run multiple async fio jobs in parallel.

swibwob · Nov 19, 2021

Dunuin said:
No, you do unparallelized 4k sync writes. In that case SSDs aren't really fast. A consumer SSD should only be around 2-3 times faster than a HDD and a enterprise TLC SSD like yours maybe 100x times faster than a HDD. But then you are benchmarking on file level and not on blocklevel on the SSD itself without the ZFS overhead. For every 4K write ZFS does it will write multiple metadata blocks so here you loose alot of IOPS too. So wouldn't wonder if your SSDs would only write something like 1000-3000K IOPS.
So see really high IOPS you need to run multiple async fio jobs in parallel.

OK, that makes sense, thank you Dunuin. However, I'm sure something is wrong. Compare the following:

Running 4 tests, 2 on each machine, the only difference being the target is either the raw device or a file on the non-cached ZFS pool, /dev/sdg or /rpool/fio/testx.

fio --ioengine=sync --filename=[X] --direct=1 --sync=1 --rw=read --bs=128K --numjobs=1 --iodepth=1 --runtime=10 --time_based --name=fio

--filename	/dev/sdg	/rpool/fio/testx
HDD	read: IOPS=1773	read: IOPS=907
SSD	read: IOPS=21.3k	read: IOPS=553

The /dev/sdg column makes sense and the HDD row makes sense, but the bottom right result make no sense to me... What do you think?

Dunuin · Nov 19, 2021

All fio read benchmarks using ZFS are basially useless unless you temporarily disable ARC caching (zfs set primariycache=metadata YourPool). If you don't disable caching you will only benchmark your RAM and not your drives.

swibwob · Nov 23, 2021

OK, so a potentially 'farier' test of the SSDs:

64 x random read/write:

Code:

fio --ioengine=libaio --filename=/rpool/fio/testx --size=4G --time_based --name=fio --group_reporting --runtime=10 --direct=1 --sync=1 --iodepth=1 --rw=randrw  --bs=4K --numjobs=64

Also, I'm moved to zpool iostat for IO monitoring.

SSD results:

Code:

FIO output:
read: IOPS=4022, BW=15.7MiB/s (16.5MB/s)
write: IOPS=4042, BW=15.8MiB/s (16.6MB/s)


# zpool iostat -vy rpool 5 1
                                                        capacity     operations     bandwidth
pool                                                  alloc   free   read  write   read  write
----------------------------------------------------  -----  -----  -----  -----  -----  -----
rpool                                                  216G  27.7T  28.1K  14.5K  1.17G   706M
  raidz1                                               195G  13.8T  13.9K  7.26K   595M   358M
    ata-INTEL_SSDSC2KB038TZ_BTYI13730BAV3P8EGN-part3      -      -  3.60K  1.73K   159M  90.3M
    ata-INTEL_SSDSC2KB038TZ_BTYI13730B9Q3P8EGN-part3      -      -  3.65K  1.82K   150M  89.0M
    ata-INTEL_SSDSC2KB038TZ_BTYI13730B9G3P8EGN-part3      -      -  3.35K  1.83K   147M  90.0M
    ata-INTEL_SSDSC2KB038TZ_BTYI13730BAT3P8EGN-part3      -      -  3.34K  1.89K   139M  88.4M
  raidz1                                              21.3G  13.9T  14.2K  7.21K   604M   348M
    sde                                                   -      -  3.39K  1.81K   149M  87.5M
    sdf                                                   -      -  3.35K  1.90K   139M  86.3M
    sdg                                                   -      -  3.71K  1.70K   163M  87.8M
    sdh                                                   -      -  3.69K  1.81K   152M  86.4M
----------------------------------------------------  -----  -----  -----  -----  -----  -----

HDD results:

Code:

FIO output:
read: IOPS=1382, BW=5531KiB/s
write: IOPS=1385, BW=5542KiB/s

$ zpool iostat -vy rpool 5 1
                                    capacity     operations     bandwidth
pool                              alloc   free   read  write   read  write
--------------------------------  -----  -----  -----  -----  -----  -----
rpool                              160G  18.0T  3.07K  2.71K   393M   228M
  mirror                          32.2G  3.59T    624    589  78.0M  40.2M
    scsi-35000c500de5c67f7-part3      -      -    321    295  40.1M  20.4M
    scsi-35000c500de75a863-part3      -      -    303    293  37.9M  19.7M
  mirror                          31.9G  3.59T    625    551  78.2M  49.9M
    scsi-35000c500de2bd6bb-part3      -      -    313    274  39.1M  24.2M
    scsi-35000c500de5ae5a7-part3      -      -    312    277  39.0M  25.7M
  mirror                          32.2G  3.59T    648    548  81.1M  45.9M
    scsi-35000c500de5ae667-part3      -      -    320    279  40.1M  23.0M
    scsi-35000c500de2bd2d3-part3      -      -    328    268  41.0M  23.0M
  mirror                          31.6G  3.59T    612    536  76.5M  45.5M
    scsi-35000c500de5ef20f-part3      -      -    301    266  37.7M  22.7M
    scsi-35000c500de5edbfb-part3      -      -    310    269  38.9M  22.8M
  mirror                          32.0G  3.59T    629    555  78.7M  46.5M
    scsi-35000c500de5c6f7f-part3      -      -    318    283  39.8M  23.1M
    scsi-35000c500de5c6c5f-part3      -      -    311    272  38.9M  23.4M
--------------------------------  -----  -----  -----  -----  -----  -----

I'd have thought the SSDs shuuld be about 10x more IOPS than the above - are my expectations out of whack? Esp. compare to the raw performance shown above (https://forum.proxmox.com/threads/p...imilar-to-raid-z10-12-x-hdd.99967/post-431629)

NB I get read: IOPS=21.3k from FIO on a single SSD when targeting the devive directly...

Thanks!

aaron · Nov 23, 2021

A few things to consider: runtime of the test. 10 seconds is not long and will most likely be skewed due to some caches. That's why we run benchmarks for 10 minutes, --runtime 600.

Disable the ZFS ARC for the actual data, see ZFS dataset option "primarycache" in Appendix 6 of the ZFS benchmark paper.

numjobs is quite large. Try to set it to one first and see how the results differ.
Also, do you want to benchmark filesystem datasets or volume datasets? The latter are used for VM disks and they do have different performance charateristics.

One last thing: a pool made up of 5 mirror vdevs will have quite good IOPS as those can be spread out for each vdev. raidz are not really great regarding IOPS.

swibwob · Nov 23, 2021

HI @aaron ,

Thanks for your reply, primarycache=none on the zpool in question. I will be using containers, so I guess the filesystem dataset is relevant? The system is idle, the benchmark duration appears not to have an impact.

My concern is the per device IOPS, it appears to make sense for HHDs but not for SSDs, especially w.r.t. 'raw' devices performance. E.g. I benchmark the raw device --filename=/dev/sda:

Code:

fio --ioengine=libaio --filename=/dev/sda --size=4G --time_based --name=fio --group_reporting --runtime=10 --direct=1 --sync=1 --iodepth=1 --rw=randread  --bs=4k --numjobs=32

and get

Code:

read: IOPS=75.2k, BW=294MiB/s (308MB/s)(2936MiB/10001msec)

Compared to --filename=/rpool/fio

Code:

read: IOPS=11.3k, BW=44.3MiB/s (46.4MB/s)(4952MiB/111805msec)

See for HHD/SSD raw/ZFS comparison: https://forum.proxmox.com/threads/p...imilar-to-raid-z10-12-x-hdd.99967/post-431629

I wouldn't expect IOPs to drop by nearly an order of magnitude when benchmarking thru ZFS - is that a valid comparison? Am I missing something?

NB according to various sources a raidz vdev should have the performance of a single device, and adding N vdevs should scale that by N, so I should be seeing 2N raw performance, ie 140K from ZFS...

Thanks again!

Dunuin · Nov 23, 2021

ZFS got alot of overhead. I got a total write amplification of factor 3 (async 1M sequential writes) to factor 82 (sync random 4K writes) with an real world average of factor 20. So thw performance will drop to 1/3th to 1/82th.

So dont wonder if you loose a magnitude of performance using ZFS.

swibwob · Nov 23, 2021

Thanks @Dunuin - how do you measure write amplification directly?

Also, why would ZFS write amplification effect read IOPS?

Dunuin · Nov 23, 2021

swibwob said:
Thanks @Dunuin - how do you measure write amplification directly?

Also, why would ZFS write amplification effect read IOPS?

Isnt that easy. Write a fixed amount of data using fio and monitor the smart attributes to see how much is actually written to the single drives.

You also got read overhead. Lets say your guest is reading 4k blocks but your ZFS pool is using a 32K blocksize. So if you read 100x 4k blocks it will read 100x 32k so you are reading 8 times more than needed. Most will be cached but if tou disable the ARC you will see it.

swibwob · Nov 23, 2021

I don't believe that's true, the ZFS recordsize parameter is just the maximum size, check zpool iostat -r [pool] to see the distribution of various block sizes. If you mean volblocksize (is the minimum?) mine doesn't show a value, is that weird? But it I guess it's 4k

Code:

# zfs get volblocksize rpool
NAME   PROPERTY      VALUE     SOURCE
rpool  volblocksize  -         -

Besides, these benchmarks are aligned to --bs=4k so shouldn't be showing much read / write amplification.

Dunuin · Nov 23, 2021

swibwob said:
I don't believe that's true, the ZFS recordsize parameter is just the maximum size, check zpool iostat -r [pool] to see the distribution of various block sizes. If you mean volblocksize (is the minimum?) mine doesn't show a value, is that weird? But it I guess it's 4k

Code:

# zfs get volblocksize rpool NAME PROPERTY VALUE SOURCE rpool volblocksize - -

Besides, these benchmarks are aligned to --bs=4k so shouldn't be showing much read / write amplification.

The recordsize is only used for datasets and it is indeed a "up to". But for zvols only the volblocksize is used and that is a fixed value. There it will always use this volblocksize no matter how big or small your write is. So reading/writing 100x 4k to a zvol with 32k volblocksize will read/write 3.2MB ( + metadata) instead of just 400kb.

swibwob · Nov 23, 2021

Thanks @Dunuin appreciate your attention. I'm not using a zvol (I think?)

Also, why do I not see a smilar slow down when benchmarking on HDDs in terms or underlying IOPs?

swibwob · Nov 24, 2021

Is raidz just really slow? Should I give up and use mirrors? Various places quote that a raidz vdev goes at the speed of the slowest device, but the underlying IOPS aren't maxing out to what the SSDs can do...

Alibek · Nov 24, 2021

FYI (about poor performance ZFS with 4k):

ZFS (NVME SSD x4 in RAIDZ1 and 1 NVME SSD for LOG):

Bash:

# zpool get all | egrep 'ashift|trim'
zfs-p1  ashift                         13                             local
zfs-p1  autotrim                       on                             local
# zfs get all zfs-p1/subvol-700-disk-0 | egrep 'compression|record|atime'
zfs-p1/subvol-700-disk-0  recordsize            128K                       local
zfs-p1/subvol-700-disk-0  compression           off                        inherited from zfs-p1
zfs-p1/subvol-700-disk-0  atime                 off                        local
zfs-p1/subvol-700-disk-0  relatime              on                         inherited from zfs-p1
# zpool status -LP
  pool: zfs-p1
 state: ONLINE
config:

    NAME                STATE     READ WRITE CKSUM
    zfs-p1              ONLINE       0     0     0
      raidz1-0          ONLINE       0     0     0
        /dev/nvme1n1p1  ONLINE       0     0     0
        /dev/nvme2n1p1  ONLINE       0     0     0
        /dev/nvme3n1p1  ONLINE       0     0     0
        /dev/nvme4n1p1  ONLINE       0     0     0
    logs
      /dev/nvme6n1p1    ONLINE       0     0     0


# fio --time_based --name=benchmark --size=8G --runtime=30 --filename=/mnt/zfs/g-fio.test --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=44.7MiB/s][w=11.4k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=14060: Wed Nov 24 19:50:18 2021
  write: IOPS=12.9k, BW=50.2MiB/s (52.7MB/s)(1507MiB/30001msec); 0 zone resets
    slat (usec): min=5, max=58457, avg=309.76, stdev=402.81
    clat (nsec): min=1720, max=110956k, avg=9643453.29, stdev=4344838.92
     lat (usec): min=245, max=111428, avg=9953.39, stdev=4465.79
    clat percentiles (usec):
     |  1.00th=[  938],  5.00th=[ 1270], 10.00th=[ 4555], 20.00th=[ 6915],
     | 30.00th=[ 8094], 40.00th=[ 9110], 50.00th=[ 9896], 60.00th=[10683],
     | 70.00th=[11469], 80.00th=[12387], 90.00th=[13698], 95.00th=[15139],
     | 99.00th=[18482], 99.50th=[20317], 99.90th=[52691], 99.95th=[57934],
     | 99.99th=[70779]
   bw (  KiB/s): min=25664, max=247912, per=100.00%, avg=51610.83, stdev=6881.92, samples=236
   iops        : min= 6416, max=61980, avg=12902.63, stdev=1720.54, samples=236
  lat (usec)   : 2=0.01%, 4=0.01%, 250=0.01%, 500=0.01%, 750=0.03%
  lat (usec)   : 1000=1.92%
  lat (msec)   : 2=4.35%, 4=2.63%, 10=42.31%, 20=48.21%, 50=0.43%
  lat (msec)   : 100=0.12%, 250=0.01%
  cpu          : usr=0.53%, sys=11.35%, ctx=354587, majf=0, minf=759
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,385765,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=50.2MiB/s (52.7MB/s), 50.2MiB/s-50.2MiB/s (52.7MB/s-52.7MB/s), io=1507MiB (1580MB), run=30001-30001msec


https://github.com/masonr/yet-another-bench-script/blob/1a56f578111f302ce3238b943810e119e26a7fed/yabs.sh#L296

# zfs primarycache=all
fio Disk Speed Tests (Mixed R/W 50/50):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 246.05 MB/s  (61.5k) | 2.62 GB/s    (41.0k)
Write      | 246.70 MB/s  (61.6k) | 2.63 GB/s    (41.2k)
Total      | 492.75 MB/s (123.1k) | 5.26 GB/s    (82.2k)
           |                      |                 
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 3.60 GB/s     (7.0k) | 3.62 GB/s     (3.5k)
Write      | 3.80 GB/s     (7.4k) | 3.86 GB/s     (3.7k)
Total      | 7.41 GB/s    (14.4k) | 7.48 GB/s     (7.3k)

# zfs primarycache=metadata
fio Disk Speed Tests (Mixed R/W 50/50):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 9.52 MB/s     (2.3k) | 106.89 MB/s   (1.6k)
Write      | 9.55 MB/s     (2.3k) | 107.45 MB/s   (1.6k)
Total      | 19.07 MB/s    (4.7k) | 214.35 MB/s   (3.3k)
           |                      |                 
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 1.13 GB/s     (2.2k) | 1.23 GB/s     (1.2k)
Write      | 1.19 GB/s     (2.3k) | 1.31 GB/s     (1.2k)
Total      | 2.32 GB/s     (4.5k) | 2.55 GB/s     (2.4k)

# zfs primarycache=none
fio Disk Speed Tests (Mixed R/W 50/50):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 4.59 MB/s     (1.1k) | 135.02 MB/s   (2.1k)
Write      | 4.61 MB/s     (1.1k) | 135.73 MB/s   (2.1k)
Total      | 9.20 MB/s     (2.3k) | 270.75 MB/s   (4.2k)
           |                      |                 
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 909.15 MB/s   (1.7k) | 756.73 MB/s    (739)
Write      | 957.46 MB/s   (1.8k) | 807.13 MB/s    (788)
Total      | 1.86 GB/s     (3.6k) | 1.56 GB/s     (1.5k)

XFS (single NVME SSD):

Bash:

# stat /mnt/pve/xfs-pool1/images/700/vm-700-disk-0.raw
  File: /mnt/pve/xfs-pool1/images/700/vm-700-disk-0.raw
  Size: 1073741824000    Blocks: 40993344   IO Block: 4096   regular file
Device: 10310h/66320d    Inode: 133         Links: 1
Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-11-24 21:02:03.995569404 +0300
Modify: 2021-11-24 21:14:43.396155625 +0300
Change: 2021-11-24 21:14:43.396155625 +0300
 Birth: 2021-11-16 22:50:06.299116170 +0300

# mount | grep xfs
/dev/nvme5n1p1 on /mnt/pve/xfs-pool1 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)


# fio --time_based --name=benchmark --size=8G --runtime=30 --filename=/mnt/xfs/g-fio.test --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=944MiB/s][w=242k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=13798: Wed Nov 24 19:49:42 2021
  write: IOPS=193k, BW=753MiB/s (789MB/s)(22.0GiB/30001msec); 0 zone resets
    slat (nsec): min=1960, max=622174, avg=15113.64, stdev=8351.92
    clat (nsec): min=380, max=1651.3M, avg=648625.13, stdev=15277115.14
     lat (usec): min=24, max=1651.3k, avg=663.84, stdev=15277.09
    clat percentiles (usec):
     |  1.00th=[  351],  5.00th=[  396], 10.00th=[  437], 20.00th=[  469],
     | 30.00th=[  482], 40.00th=[  494], 50.00th=[  506], 60.00th=[  515],
     | 70.00th=[  529], 80.00th=[  545], 90.00th=[  570], 95.00th=[  594],
     | 99.00th=[  685], 99.50th=[  725], 99.90th=[  873], 99.95th=[ 1074],
     | 99.99th=[ 2900]
   bw (  KiB/s): min=71776, max=1320696, per=100.00%, avg=906513.38, stdev=57278.57, samples=200
   iops        : min=17944, max=330174, avg=226628.36, stdev=14319.64, samples=200
  lat (nsec)   : 500=0.01%
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.01%, 500=44.92%, 750=54.74%
  lat (usec)   : 1000=0.27%
  lat (msec)   : 2=0.05%, 4=0.01%, 2000=0.01%
  cpu          : usr=3.55%, sys=45.22%, ctx=4937745, majf=0, minf=304
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,5779550,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=753MiB/s (789MB/s), 753MiB/s-753MiB/s (789MB/s-789MB/s), io=22.0GiB (23.7GB), run=30001-30001msec

https://github.com/masonr/yet-another-bench-script/blob/1a56f578111f302ce3238b943810e119e26a7fed/yabs.sh#L296

fio Disk Speed Tests (Mixed R/W 50/50):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 414.15 MB/s (103.5k) | 2.14 GB/s    (33.4k)
Write      | 415.24 MB/s (103.8k) | 2.15 GB/s    (33.6k)
Total      | 829.40 MB/s (207.3k) | 4.29 GB/s    (67.1k)
           |                      |                 
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 2.37 GB/s     (4.6k) | 1.95 GB/s     (1.9k)
Write      | 2.50 GB/s     (4.8k) | 2.08 GB/s     (2.0k)
Total      | 4.87 GB/s     (9.5k) | 4.03 GB/s     (3.9k)

Bash:

# nvme list -o json | egrep 'DevicePath|ModelNumber|SectorSize'
      "DevicePath" : "/dev/nvme0n1",
      "ModelNumber" : "KXG60ZNV256G TOSHIBA",
      "SectorSize" : 512
      "DevicePath" : "/dev/nvme1n1",
      "ModelNumber" : "INTEL SSDPE2KE076T8",
      "SectorSize" : 4096
      "DevicePath" : "/dev/nvme2n1",
      "ModelNumber" : "INTEL SSDPE2KE076T8",
      "SectorSize" : 4096
      "DevicePath" : "/dev/nvme3n1",
      "ModelNumber" : "INTEL SSDPE2KE076T8",
      "SectorSize" : 4096
      "DevicePath" : "/dev/nvme4n1",
      "ModelNumber" : "INTEL SSDPE2KE076T8",
      "SectorSize" : 4096
      "DevicePath" : "/dev/nvme5n1",
      "ModelNumber" : "INTEL SSDPE2KE076T8",
      "SectorSize" : 512
      "DevicePath" : "/dev/nvme6n1",
      "ModelNumber" : "INTEL SSDPE2KE076T8",
      "SectorSize" : 4096

Alibek · Nov 24, 2021

AND:

man fio:

Code:

   I/O size
       size=int
              The total size of file I/O for each thread of this job. Fio will run until this many bytes has been transferred, unless runtime is limited by other options (such  as  runtime,  for  in‐
              stance,  or  increased/decreased by io_size).  Fio will divide this size between the available files determined by options such as nrfiles, filename, unless filesize is specified by the
              job. If the result of division happens to be 0, the size is set to the physical size of the given files or devices if they exist.  If this option is not specified, fio will use the full
              size  of  the given files or devices. If the files do not exist, size must be given. It is also possible to give size as a percentage between 1 and 100. If `size=20%' is given, fio will
              use 20% of the full size of the given files or devices.  Can be combined with offset to constrain the start and end range that I/O will be done within.

Bash:

# fio --time_based --name=benchmark --size=1M --runtime=30 --filename=/mnt/zfs/g-fio.test --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=893MiB/s][w=229k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=18292: Wed Nov 24 20:17:27 2021
  write: IOPS=233k, BW=909MiB/s (953MB/s)(26.6GiB/30002msec); 0 zone resets
    slat (usec): min=3, max=36433, avg=16.45, stdev=34.81
    clat (nsec): min=1380, max=37005k, avg=533126.42, stdev=194848.62
     lat (usec): min=12, max=37020, avg=549.64, stdev=197.97
    clat percentiles (usec):
     |  1.00th=[  465],  5.00th=[  494], 10.00th=[  502], 20.00th=[  515],
     | 30.00th=[  519], 40.00th=[  529], 50.00th=[  529], 60.00th=[  537],
     | 70.00th=[  537], 80.00th=[  545], 90.00th=[  562], 95.00th=[  570],
     | 99.00th=[  685], 99.50th=[  717], 99.90th=[  775], 99.95th=[  807],
     | 99.99th=[ 1074]
   bw (  KiB/s): min=851720, max=995720, per=100.00%, avg=931495.86, stdev=4753.39, samples=236
   iops        : min=212930, max=248930, avg=232874.00, stdev=1188.36, samples=236
  lat (usec)   : 2=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%
  lat (usec)   : 500=8.18%, 750=91.63%, 1000=0.17%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=4.41%, sys=94.83%, ctx=133698, majf=0, minf=82
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,6981612,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=909MiB/s (953MB/s), 909MiB/s-909MiB/s (953MB/s-953MB/s), io=26.6GiB (28.6GB), run=30002-30002msec

ZFS degradate performance when FIO param "size" > 1000M..1600M

Bash:

# fio --time_based --name=benchmark --size=20G --runtime=30 --filename=/mnt/xfs/g-fio.test --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 4 processes
benchmark: Laying out IO file (1 file / 20480MiB)
Jobs: 4 (f=4): [w(4)][100.0%][eta 00m:00s]                     
benchmark: (groupid=0, jobs=4): err= 0: pid=18819: Wed Nov 24 20:19:35 2021
  write: IOPS=42.3k, BW=165MiB/s (173MB/s)(5043MiB/30500msec); 0 zone resets
    slat (usec): min=2, max=5022, avg=61.69, stdev=64.94
    clat (usec): min=333, max=1968.5k, avg=2960.51, stdev=39498.35
     lat (usec): min=342, max=1968.6k, avg=3022.38, stdev=39498.09
    clat percentiles (usec):
     |  1.00th=[    955],  5.00th=[   1319], 10.00th=[   1434],
     | 20.00th=[   1532], 30.00th=[   1614], 40.00th=[   1680],
     | 50.00th=[   1762], 60.00th=[   1909], 70.00th=[   2147],
     | 80.00th=[   2474], 90.00th=[   2933], 95.00th=[   3359],
     | 99.00th=[   4293], 99.50th=[   4686], 99.90th=[   6325],
     | 99.95th=[1384121], 99.99th=[1669333]
   bw (  KiB/s): min=58424, max=310552, per=100.00%, avg=229470.04, stdev=13085.52, samples=180
   iops        : min=14606, max=77638, avg=57367.56, stdev=3271.39, samples=180
  lat (usec)   : 500=0.01%, 750=0.34%, 1000=0.90%
  lat (msec)   : 2=62.96%, 4=34.06%, 10=1.67%, 20=0.01%, 2000=0.06%
  cpu          : usr=2.16%, sys=29.68%, ctx=2268685, majf=0, minf=74
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1290898,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=165MiB/s (173MB/s), 165MiB/s-165MiB/s (173MB/s-173MB/s), io=5043MiB (5288MB), run=30500-30500msec

Disk stats (read/write):
  loop1: ios=0/1334507, merge=0/0, ticks=0/1164304, in_queue=1172433, util=87.67%

XFS degradate performance when FIO param "size" > 10G

and need stell remember - ZFS pool on 5xNVME vs XFS single NVME
and of couse about mouning in container (ext4 -> xfs -> device):

Bash:

root@fio-bench:~# mount | grep mnt
zfs-p1/subvol-700-disk-0 on /mnt/zfs type zfs (rw,noatime,xattr,posixacl)
/mnt/pve/xfs-pool1/images/700/vm-700-disk-0.raw on /mnt/xfs type ext4 (rw,relatime)

Dunuin · Nov 24, 2021

You created your pool "zfs-p1" with a ashift of 13 so 8K blocksize is used. Then you got a raidz1 with 4 disks so you want a volblocksize of atleast 4 times your ashift so 32K volblocksize (4x 8K) to only loose 33% of your raw capacity instead of 50%. So each time you do a 4K read/write to a zvol it will need to read/write full 32K so you only get 1/8 of the performance of a single drive with XFS thats is working with a 4K blocksize.
And with a raidz1 of 5 drives you might even want a volblocksize of 8 times your ashift (so 64K) to only loose 20% instead of 50% of your raw capacity and in that case your 4K performance would drop down to 1/16th.
And then there is metadata too. If I remember right ZFS will write 3 copies of it, so I guess for each 4k write you also write 3x 8K of metadata in addition to the 32k or 64k of data+parity+padding.
So don't wonder if you loose alot of IOPS when using ZFS, especially with a raidz that requires a big blocksize.

For datasets it shouldn't be that bad, because the recordsize of 128K will allow it to write files from 8K (your ashift) to 128K. But each 4K read/write would still need to read/write double the data because nothing smaller than 8K can be read/written.

Alibek · Nov 25, 2021

Dunuin said:
You created your pool "zfs-p1" with a ashift of 13 so 8K blocksize is used. Then you got a raidz1 with 4 disks so you want a volblocksize of atleast 4 times your ashift so 32K volblocksize (4x 8K) to only loose 33% of your raw capacity instead of 50%. So each time you do a 4K read/write to a zvol it will need to read/write full 32K so you only get 1/8 of the performance of a single drive with XFS thats is working with a 4K blocksize.
And with a raidz1 of 5 drives you might even want a volblocksize of 8 times your ashift (so 64K) to only loose 20% instead of 50% of your raw capacity and in that case your 4K performance would drop down to 1/16th.
And then there is metadata too. If I remember right ZFS wull write 3 copies of itnso I guess for each 4k write your also write 3x 8K of metadata in addition to the 32k or 64k of data+parity+padding.
So dont wonder if you loose alot of IOPS when using ZFS, especially with a raidz that requires a big blocksize.

For datasets it shouldn't be that bad, because the recordsize of 128K will allow it to write files from 8K (your ashift) to 128K. But each 4K read/write would still need to read/write double the data because nothing smaller than 8K can be read/written.

With ashift=12 same - when FIO param "size" > 2G with "bs"=4k - performance of ZFS is dropdown:

Bash:

# fio --time_based --name=benchmark --size=1800M --runtime=30 --filename=/mnt/zfs/g-fio.test --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
...
  write: IOPS=139k, BW=544MiB/s (570MB/s)(15.9GiB/30003msec); 0 zone resets

# fio --time_based --name=benchmark --size=2G --runtime=30 --filename=/mnt/zfs/g-fio.test --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
...
  write: IOPS=95.6k, BW=373MiB/s (391MB/s)(10.9GiB/30001msec); 0 zone resets

# fio --time_based --name=benchmark --size=3G --runtime=30 --filename=/mnt/zfs/g-fio.test --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
...
  write: IOPS=20.8k, BW=81.3MiB/s (85.2MB/s)(2438MiB/30001msec); 0 zone resets

# fio --time_based --name=benchmark --size=4G --runtime=30 --filename=/mnt/zfs/g-fio.test --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
...
  write: IOPS=14.2k, BW=55.5MiB/s (58.2MB/s)(1666MiB/30001msec); 0 zone resets

# fio --time_based --name=benchmark --size=6G --runtime=30 --filename=/mnt/zfs/g-fio.test --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
...
  write: IOPS=11.9k, BW=46.4MiB/s (48.7MB/s)(1392MiB/30001msec); 0 zone resets

# fio --time_based --name=benchmark --size=8G --runtime=30 --filename=/mnt/zfs/g-fio.test --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
...
  write: IOPS=11.9k, BW=46.6MiB/s (48.9MB/s)(1399MiB/30001msec); 0 zone resets

Poor ZFS SSD IO benchmark: RAID-Z1 4 x SSD similar to RAID-Z10 12 x HDD

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Proxmox Staff Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

New Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

We value your privacy