Proxmox VE ZFS Benchmark with NVMe

martin · Dec 15, 2020

To optimize performance in hyper-converged deployments with Proxmox VE and ZFS storage, the appropriate hardware setup is essential. This benchmark presents a possible setup and its resulting performance, with the intention of supporting Proxmox users in making better decisions.

Download PDF
https://www.proxmox.com/en/downloads/item/proxmox-ve-zfs-benchmark-2020
__________________
Best regards,

Martin Maurer
Proxmox VE project leader

jsterr · Dec 15, 2020

Nice, very useful! What ioshift and zfs recordsize was used? I experimented a little bit with zfs recordsize but was not sure, if it should be changed from its default value of 128K.

Alwin · Dec 15, 2020

jsterr said:
ioshift and zfs recordsize was used?

ashift is autodetect and volblocksize is 8K. Both are the default values.

Dunuin · Dec 25, 2020

FAQ on page 8:

Can I use consumer or pro-sumer SSDs, as these are much cheaper than enterprise-class SSD?
No. Never. These SSDs wont provide the required performance, reliability or endurance. See the fio results from before and/or run your own fio tests.

Looks like 90% of the people of this forum don't get this.^^

elurex · Dec 30, 2020

runtime=60 is not enough to do real world testing

elurex · Dec 30, 2020

Dunuin said:
FAQ on page 8:

Looks like 90% of the people of this forum don't get this.^^

this is a yes and no....

Yes you can use Intel 900P/905P optane SSD but No You cannot use Intel SSD 670p or H2 optane

It is really down to how much testing have you done

auranext · Jan 13, 2021

Hello everyone and Happy new Year !

When it is not possible to use motherboard embendded disk controller which HBA card is recommanded for ZFS replication setup ?
is there other possible technology like NVMe card ?
does anyone have a good card controller experience to share ?

thank you

RedneckBob · Jan 15, 2021

auranext said:
Hello everyone and Happy new Year !

When it is not possible to use motherboard embendded disk controller which HBA card is recommanded for ZFS replication setup ?
is there other possible technology like NVMe card ?
does anyone have a good card controller experience to share ?

thank you

No simple answer because it all depends on your objectives and hardware constraints. Do you have full PCIe 3.0 x16 slots open? How many? Are they restricted when using onboard M.2 slots or SATA ports? Are they bifurcated or non-bifurcated? You want to use M.2 NVMe drives or 2.5" NVMe drives? How many drives? So many questions when going down the do-it-yourself road.

For one of my test systems I had a PCIe 3.0 x8 slot open, so decided to roll the dice and copy a setup I saw on Amazon. It got me 4 x 1T NVMe drives with 1 HBA in a smallish deskside system:

1 x PCIe 3.0 Adapter 4 "multiplexed" x4 over x8 - link
4 1T - Intel DC P4510 2.5" NVMe drives - link
4 cables - SFF-8643 to SFF-8639 - link

It probably takes more of a beating than most of our production systems, so far no issues. I do question multiplexing 16x (4 4x) PCIe 3.0 slots through a single 8x PCIe 3.0 slot. Speedy, but never benchmarked it, using datacener drives makes a big difference too. Also a no-name card from China, but appears to be solid as well. There is a 16x PCIe 3.0 version, rather than 8x PCIe 3.0 and there is a price difference.

Not sure what your objectives are, but if building a homelab system and want a robust do-it-yourself disk subsystem, the above might work for you.

aaron · Jan 15, 2021

Since the Micron 9300 that we use in the benchmark paper support different block sizes that can be configured for the namespaces, we did some testing to see how they affect performance.

We tested the setup as in the benchmark paper:

IOPS tests

1 Mirror pool on the NVMEs with the default 512b block"size and a zvol with default 8k volblocksize:

Code:

# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=181MiB/s][w=46.3k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=95469: Fri Jan 15 12:54:59 2021
  write: IOPS=66.2k, BW=258MiB/s (271MB/s)(151GiB/600002msec); 0 zone resets
    clat (usec): min=59, max=143209, avg=482.78, stdev=1112.45
     lat (usec): min=59, max=143210, avg=482.95, stdev=1112.45
    clat percentiles (usec):
     |  1.00th=[  215],  5.00th=[  245], 10.00th=[  265], 20.00th=[  293],
     | 30.00th=[  322], 40.00th=[  359], 50.00th=[  404], 60.00th=[  445],
     | 70.00th=[  494], 80.00th=[  578], 90.00th=[  742], 95.00th=[  996],
     | 99.00th=[ 1401], 99.50th=[ 1614], 99.90th=[ 5145], 99.95th=[ 7701],
     | 99.99th=[12649]
   bw (  KiB/s): min= 4856, max=11424, per=3.13%, avg=8270.95, stdev=1912.80, samples=38369
   iops        : min= 1214, max= 2856, avg=2067.72, stdev=478.20, samples=38369
  lat (usec)   : 100=0.03%, 250=5.97%, 500=64.97%, 750=19.27%, 1000=4.77%
  lat (msec)   : 2=4.76%, 4=0.11%, 10=0.09%, 20=0.01%, 100=0.01%
  lat (msec)   : 250=0.01%
  cpu          : usr=0.45%, sys=24.12%, ctx=281344506, majf=0, minf=349
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,39696863,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=258MiB/s (271MB/s), 258MiB/s-258MiB/s (271MB/s-271MB/s), io=151GiB (163GB), run=600002-600002msec

The result of 46k IOPS is in the ballpark of the result of the benchmark paper. So far no surprise.

Recreating the test on the same kind of NVMEs but with the block size set to 4k:

Code:

# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=297MiB/s][w=75.9k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=121126: Fri Jan 15 13:20:53 2021
  write: IOPS=85.9k, BW=335MiB/s (352MB/s)(197GiB/600002msec); 0 zone resets
    clat (usec): min=58, max=144238, avg=371.76, stdev=1097.88
     lat (usec): min=58, max=144238, avg=371.95, stdev=1097.88
    clat percentiles (usec):
     |  1.00th=[  196],  5.00th=[  225], 10.00th=[  239], 20.00th=[  258],
     | 30.00th=[  273], 40.00th=[  289], 50.00th=[  302], 60.00th=[  322],
     | 70.00th=[  347], 80.00th=[  392], 90.00th=[  498], 95.00th=[  676],
     | 99.00th=[ 1287], 99.50th=[ 1532], 99.90th=[ 5932], 99.95th=[ 8029],
     | 99.99th=[12387]
   bw (  KiB/s): min= 7456, max=11680, per=3.12%, avg=10730.64, stdev=866.24, samples=38374
   iops        : min= 1864, max= 2920, avg=2682.64, stdev=216.56, samples=38374
  lat (usec)   : 100=0.09%, 250=15.87%, 500=74.23%, 750=5.73%, 1000=1.90%
  lat (msec)   : 2=1.98%, 4=0.06%, 10=0.11%, 20=0.02%, 50=0.01%
  lat (msec)   : 250=0.01%
  cpu          : usr=0.56%, sys=37.53%, ctx=339985859, majf=0, minf=404
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,51513479,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=335MiB/s (352MB/s), 335MiB/s-335MiB/s (352MB/s-352MB/s), io=197GiB (211GB), run=600002-600002msec

As you can see, using the larger 4k block size for the NVME namespace, we get ~76k IOPS which is close to double the IOPS performance.

Bandwidth tests:

NVME 512b blocksize pool:

Code:

# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=1810MiB/s][w=452 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=81902: Fri Jan 15 12:44:31 2021
  write: IOPS=431, BW=1727MiB/s (1811MB/s)(1012GiB/600016msec); 0 zone resets
    clat (msec): min=2, max=403, avg=73.90, stdev=22.84
     lat (msec): min=3, max=403, avg=74.10, stdev=22.86
    clat percentiles (msec):
     |  1.00th=[   43],  5.00th=[   49], 10.00th=[   55], 20.00th=[   59],
     | 30.00th=[   64], 40.00th=[   68], 50.00th=[   72], 60.00th=[   77],
     | 70.00th=[   82], 80.00th=[   86], 90.00th=[   89], 95.00th=[   93],
     | 99.00th=[  197], 99.50th=[  222], 99.90th=[  271], 99.95th=[  284],
     | 99.99th=[  313]
   bw (  KiB/s): min= 8192, max=98304, per=3.12%, avg=55256.56, stdev=12647.86, samples=38399
   iops        : min=    2, max=   24, avg=13.43, stdev= 3.10, samples=38399
  lat (msec)   : 4=0.01%, 10=0.01%, 20=0.06%, 50=5.71%, 100=90.53%
  lat (msec)   : 250=3.46%, 500=0.21%
  cpu          : usr=0.27%, sys=4.86%, ctx=3678933, majf=0, minf=359
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,259085,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1727MiB/s (1811MB/s), 1727MiB/s-1727MiB/s (1811MB/s-1811MB/s), io=1012GiB (1087GB), run=600016-600016msec

We get about 1700MB/s Bandwidth.

NVME 4k blocksize pool

Code:

# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=1280MiB/s][w=320 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=201124: Fri Jan 15 12:29:19 2021
  write: IOPS=454, BW=1818MiB/s (1907MB/s)(1066GiB/600049msec); 0 zone resets
    clat (msec): min=3, max=411, avg=70.14, stdev=25.82
     lat (msec): min=3, max=411, avg=70.39, stdev=25.83
    clat percentiles (msec):
     |  1.00th=[   46],  5.00th=[   52], 10.00th=[   54], 20.00th=[   58],
     | 30.00th=[   61], 40.00th=[   64], 50.00th=[   67], 60.00th=[   70],
     | 70.00th=[   75], 80.00th=[   79], 90.00th=[   83], 95.00th=[   89],
     | 99.00th=[  230], 99.50th=[  271], 99.90th=[  338], 99.95th=[  359],
     | 99.99th=[  388]
   bw (  KiB/s): min=16384, max=98304, per=3.12%, avg=58180.81, stdev=12074.67, samples=38400
   iops        : min=    4, max=   24, avg=14.17, stdev= 2.95, samples=38400
  lat (msec)   : 4=0.01%, 10=0.01%, 20=0.03%, 50=3.77%, 100=94.07%
  lat (msec)   : 250=1.37%, 500=0.75%
  cpu          : usr=0.34%, sys=6.25%, ctx=3648814, majf=0, minf=346
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,272779,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1818MiB/s (1907MB/s), 1818MiB/s-1818MiB/s (1907MB/s-1907MB/s), io=1066GiB (1144GB), run=600049-600049msec

With the 4k blocksize namespaces there is not a significant higher bandwidth (1800MB/s).

The output of nvme list to show the NVMEs configured with different block sizes:

Code:

# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     194525xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB    512   B +  0 B   11300DN0
/dev/nvme1n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB    512   B +  0 B   11300DN0
/dev/nvme2n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB      4 KiB +  0 B   11300DN0
/dev/nvme3n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB      4 KiB +  0 B   11300DN0

TL;DR:

Changing the block size of the NVME namespace can improve the performance. Tested with 512b and 4k NVME block sizes and a zfs mirror with a zvol (8k volblocksize).

512b NVME block size: ~46k IOPS, ~1700MB/s bandwidth
4k NVME block size: ~75k IOPS, ~1800MB/s bandwidth

RedneckBob · Jan 17, 2021

aaron said:
512b NVME block size: ~46k IOPS, ~1700MB/s bandwidth
4k NVME block size: ~75k IOPS, ~1800MB/s bandwidth

udo · Jan 18, 2021

Hi,
is it possible to select the 4k NVME block sizes during pve install on an nvme-only system?
Or must I set the blocksize before (I guess)?
Udo

aaron · Jan 18, 2021

udo said:
Hi,
is it possible to select the 4k NVME block sizes during pve install on an nvme-only system?
Or must I set the blocksize before (I guess)?
Udo

This needs to be done before installing and probably depends on the used NVME. For the Micron NVME we have, we used microns msecli tool to delete the (default) namespace and create a new namespace with the larger block size.

Alwin Antreich · Jan 19, 2021

udo said:
is it possible to select the 4k NVME block sizes during pve install on an nvme-only system?
Or must I set the blocksize before (I guess)?

Maybe to better clarify, @aaron changed the format of the NVMe SSD itself. The volblocksize of the zvol was still 8K.

mfreund · Jan 29, 2021

aaron said:

Since the Micron 9300 that we use in the benchmark paper support different block sizes that can be configured for the namespaces, we did some testing to see how they affect performance.

We tested the setup as in the benchmark paper:

IOPS tests

1 Mirror pool on the NVMEs with the default 512b block"size and a zvol with default 8k volblocksize:

Code:

# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=181MiB/s][w=46.3k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=95469: Fri Jan 15 12:54:59 2021
  write: IOPS=66.2k, BW=258MiB/s (271MB/s)(151GiB/600002msec); 0 zone resets
    clat (usec): min=59, max=143209, avg=482.78, stdev=1112.45
     lat (usec): min=59, max=143210, avg=482.95, stdev=1112.45
    clat percentiles (usec):
     |  1.00th=[  215],  5.00th=[  245], 10.00th=[  265], 20.00th=[  293],
     | 30.00th=[  322], 40.00th=[  359], 50.00th=[  404], 60.00th=[  445],
     | 70.00th=[  494], 80.00th=[  578], 90.00th=[  742], 95.00th=[  996],
     | 99.00th=[ 1401], 99.50th=[ 1614], 99.90th=[ 5145], 99.95th=[ 7701],
     | 99.99th=[12649]
   bw (  KiB/s): min= 4856, max=11424, per=3.13%, avg=8270.95, stdev=1912.80, samples=38369
   iops        : min= 1214, max= 2856, avg=2067.72, stdev=478.20, samples=38369
  lat (usec)   : 100=0.03%, 250=5.97%, 500=64.97%, 750=19.27%, 1000=4.77%
  lat (msec)   : 2=4.76%, 4=0.11%, 10=0.09%, 20=0.01%, 100=0.01%
  lat (msec)   : 250=0.01%
  cpu          : usr=0.45%, sys=24.12%, ctx=281344506, majf=0, minf=349
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,39696863,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=258MiB/s (271MB/s), 258MiB/s-258MiB/s (271MB/s-271MB/s), io=151GiB (163GB), run=600002-600002msec

The result of 46k IOPS is in the ballpark of the result of the benchmark paper. So far no surprise.

Recreating the test on the same kind of NVMEs but with the block size set to 4k:

Code:

# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=297MiB/s][w=75.9k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=121126: Fri Jan 15 13:20:53 2021
  write: IOPS=85.9k, BW=335MiB/s (352MB/s)(197GiB/600002msec); 0 zone resets
    clat (usec): min=58, max=144238, avg=371.76, stdev=1097.88
     lat (usec): min=58, max=144238, avg=371.95, stdev=1097.88
    clat percentiles (usec):
     |  1.00th=[  196],  5.00th=[  225], 10.00th=[  239], 20.00th=[  258],
     | 30.00th=[  273], 40.00th=[  289], 50.00th=[  302], 60.00th=[  322],
     | 70.00th=[  347], 80.00th=[  392], 90.00th=[  498], 95.00th=[  676],
     | 99.00th=[ 1287], 99.50th=[ 1532], 99.90th=[ 5932], 99.95th=[ 8029],
     | 99.99th=[12387]
   bw (  KiB/s): min= 7456, max=11680, per=3.12%, avg=10730.64, stdev=866.24, samples=38374
   iops        : min= 1864, max= 2920, avg=2682.64, stdev=216.56, samples=38374
  lat (usec)   : 100=0.09%, 250=15.87%, 500=74.23%, 750=5.73%, 1000=1.90%
  lat (msec)   : 2=1.98%, 4=0.06%, 10=0.11%, 20=0.02%, 50=0.01%
  lat (msec)   : 250=0.01%
  cpu          : usr=0.56%, sys=37.53%, ctx=339985859, majf=0, minf=404
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,51513479,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=335MiB/s (352MB/s), 335MiB/s-335MiB/s (352MB/s-352MB/s), io=197GiB (211GB), run=600002-600002msec

As you can see, using the larger 4k block size for the NVME namespace, we get ~76k IOPS which is close to double the IOPS performance.

Bandwidth tests:

NVME 512b blocksize pool:

Code:

# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=1810MiB/s][w=452 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=81902: Fri Jan 15 12:44:31 2021
  write: IOPS=431, BW=1727MiB/s (1811MB/s)(1012GiB/600016msec); 0 zone resets
    clat (msec): min=2, max=403, avg=73.90, stdev=22.84
     lat (msec): min=3, max=403, avg=74.10, stdev=22.86
    clat percentiles (msec):
     |  1.00th=[   43],  5.00th=[   49], 10.00th=[   55], 20.00th=[   59],
     | 30.00th=[   64], 40.00th=[   68], 50.00th=[   72], 60.00th=[   77],
     | 70.00th=[   82], 80.00th=[   86], 90.00th=[   89], 95.00th=[   93],
     | 99.00th=[  197], 99.50th=[  222], 99.90th=[  271], 99.95th=[  284],
     | 99.99th=[  313]
   bw (  KiB/s): min= 8192, max=98304, per=3.12%, avg=55256.56, stdev=12647.86, samples=38399
   iops        : min=    2, max=   24, avg=13.43, stdev= 3.10, samples=38399
  lat (msec)   : 4=0.01%, 10=0.01%, 20=0.06%, 50=5.71%, 100=90.53%
  lat (msec)   : 250=3.46%, 500=0.21%
  cpu          : usr=0.27%, sys=4.86%, ctx=3678933, majf=0, minf=359
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,259085,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1727MiB/s (1811MB/s), 1727MiB/s-1727MiB/s (1811MB/s-1811MB/s), io=1012GiB (1087GB), run=600016-600016msec

We get about 1700MB/s Bandwidth.

NVME 4k blocksize pool

Code:

# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=1280MiB/s][w=320 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=201124: Fri Jan 15 12:29:19 2021
  write: IOPS=454, BW=1818MiB/s (1907MB/s)(1066GiB/600049msec); 0 zone resets
    clat (msec): min=3, max=411, avg=70.14, stdev=25.82
     lat (msec): min=3, max=411, avg=70.39, stdev=25.83
    clat percentiles (msec):
     |  1.00th=[   46],  5.00th=[   52], 10.00th=[   54], 20.00th=[   58],
     | 30.00th=[   61], 40.00th=[   64], 50.00th=[   67], 60.00th=[   70],
     | 70.00th=[   75], 80.00th=[   79], 90.00th=[   83], 95.00th=[   89],
     | 99.00th=[  230], 99.50th=[  271], 99.90th=[  338], 99.95th=[  359],
     | 99.99th=[  388]
   bw (  KiB/s): min=16384, max=98304, per=3.12%, avg=58180.81, stdev=12074.67, samples=38400
   iops        : min=    4, max=   24, avg=14.17, stdev= 2.95, samples=38400
  lat (msec)   : 4=0.01%, 10=0.01%, 20=0.03%, 50=3.77%, 100=94.07%
  lat (msec)   : 250=1.37%, 500=0.75%
  cpu          : usr=0.34%, sys=6.25%, ctx=3648814, majf=0, minf=346
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,272779,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1818MiB/s (1907MB/s), 1818MiB/s-1818MiB/s (1907MB/s-1907MB/s), io=1066GiB (1144GB), run=600049-600049msec

With the 4k blocksize namespaces there is not a significant higher bandwidth (1800MB/s).

The output of nvme list to show the NVMEs configured with different block sizes:

Code:

# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     194525xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB    512   B +  0 B   11300DN0
/dev/nvme1n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB    512   B +  0 B   11300DN0
/dev/nvme2n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB      4 KiB +  0 B   11300DN0
/dev/nvme3n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB      4 KiB +  0 B   11300DN0

TL;DR:

Changing the block size of the NVME namespace can improve the performance. Tested with 512b and 4k NVME block sizes and a zfs mirror with a zvol (8k volblocksize).

512b NVME block size: ~46k IOPS, ~1700MB/s bandwidth
4k NVME block size: ~75k IOPS, ~1800MB/s bandwidth

Hi aaron,

i tried this because i've got exact the same disks. The fio test on the ZVOL with option --direct=1 stop with this error:

fio-3.12
Starting 32 processes
fio: Laying out IO file (1 file / 9216MiB)
fio: looks like your file system does not support direct=1/buffered=0
fio: looks like your file system does not support direct=1/buffered=0
fio: looks like your file system does not support direct=1/buffered=0
fio: looks like your file system does not support direct=1/buffered=0
fio: destination does not support O_DIRECT
fio: destination does not support O_DIRECT
fio: destination does not support O_DIRECT
fio: destination does not support O_DIRECT

root@pve01:/dev/zvol/tank0# zfs version
zfs-0.8.5-pve1
zfs-kmod-0.8.5-pve1

root@pve01:/dev/zvol/tank0# pveversion
pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve)

The pool was created on the gui with standard settings.

root@pve01:/dev/zvol/tank0# zpool status tank0
pool: tank0
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
tank0 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-Micron_9300_MTFDHAL3T2TDR_193223A89845 ONLINE 0 0 0
nvme-Micron_9300_MTFDHAL3T2TDR_2029295253ED ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
nvme-Micron_9300_MTFDHAL3T2TDR_20302982FB07 ONLINE 0 0 0
nvme-Micron_9300_MTFDHAL3T2TDR_20302982F435 ONLINE 0 0 0

errors: No known data errors

Am I missing something here?

Many thanks!

Michael

dcsapak · Jan 29, 2021

mfreund said:
Am I missing something here?

afaics, the tests were all done on a zvol in the zpool not on a file

mfreund · Jan 29, 2021

dcsapak said:
afaics, the tests were all done on a zvol in the zpool not on a file

Many thanks! It's running fine now

guletz · Jan 29, 2021

aaron said:
We get about 1700MB/s Bandwidth.

Hi @Aron

--bs=4m ..... Do you know any application who will write in normal condition(by default) with 4M record size?

Good luck / Bafta !

Alwin Antreich · Feb 1, 2021

guletz said:
--bs=4m ..... Do you know any application who will write in normal condition(by default) with 4M record size?

This is for bandwidth benchmarks, it could have been 1, 2, 3 MB or bigger as well. It was just determined by the testing schema to be 4MB, since the Ceph benchmark paper uses 4MB as well.

And applications normally don't care too much about blocksizes, the filesystem below has to though. There are obviously exceptions.

guletz · Feb 1, 2021

Alwin Antreich said:
just determined by the testing schema to be 4MB, since the Ceph benchmark paper uses 4MB as well.

Hi,

Ceph benchmark use 4 MB because they have take in account how Ceph is designed(Ceph's default block size is 4MB). But what is OK for ceph do not mean that could be OK for any other storge system.

Good luck / Bafta!

Alwin Antreich · Feb 1, 2021

guletz said:
Ceph benchmark use 4 MB because they have take in account how Ceph is designed(Ceph's default block size is 4MB). But what is OK for ceph do not mean that could be OK for any other storge system.

As said, its about bandwidth.

Proxmox VE ZFS Benchmark with NVMe

Proxmox Staff Member

Famous Member

Proxmox Retired Staff

Distinguished Member

Active Member

Active Member

Well-Known Member

Renowned Member

Proxmox Staff Member

IOPS tests​

Bandwidth tests:​

TL;DR:​

Renowned Member

Distinguished Member

Proxmox Staff Member

Well-Known Member

Member

IOPS tests​

Bandwidth tests:​

TL;DR:​

Proxmox Staff Member

Member

Distinguished Member

Well-Known Member

Distinguished Member

Well-Known Member

We value your privacy

IOPS tests

Bandwidth tests:

TL;DR:

IOPS tests

Bandwidth tests:

TL;DR: