Proxmox VE ZFS Benchmark with NVMe

martin

Proxmox Staff Member
Staff member
Apr 28, 2005
754
1,739
223
To optimize performance in hyper-converged deployments with Proxmox VE and ZFS storage, the appropriate hardware setup is essential. This benchmark presents a possible setup and its resulting performance, with the intention of supporting Proxmox users in making better decisions.

Download PDF
https://www.proxmox.com/en/downloads/item/proxmox-ve-zfs-benchmark-2020
__________________
Best regards,

Martin Maurer
Proxmox VE project leader
 
Nice, very useful! What ioshift and zfs recordsize was used? I experimented a little bit with zfs recordsize but was not sure, if it should be changed from its default value of 128K.
 
FAQ on page 8:
Can I use consumer or pro-sumer SSDs, as these are much cheaper than enterprise-class SSD?
No. Never. These SSDs wont provide the required performance, reliability or endurance. See the fio results from before and/or run your own fio tests.
Looks like 90% of the people of this forum don't get this.^^
 
Hello everyone and Happy new Year !

When it is not possible to use motherboard embendded disk controller which HBA card is recommanded for ZFS replication setup ?
is there other possible technology like NVMe card ?
does anyone have a good card controller experience to share ?

thank you
 
Hello everyone and Happy new Year !

When it is not possible to use motherboard embendded disk controller which HBA card is recommanded for ZFS replication setup ?
is there other possible technology like NVMe card ?
does anyone have a good card controller experience to share ?

thank you

No simple answer because it all depends on your objectives and hardware constraints. Do you have full PCIe 3.0 x16 slots open? How many? Are they restricted when using onboard M.2 slots or SATA ports? Are they bifurcated or non-bifurcated? You want to use M.2 NVMe drives or 2.5" NVMe drives? How many drives? So many questions when going down the do-it-yourself road.

For one of my test systems I had a PCIe 3.0 x8 slot open, so decided to roll the dice and copy a setup I saw on Amazon. It got me 4 x 1T NVMe drives with 1 HBA in a smallish deskside system:
  • 1 x PCIe 3.0 Adapter 4 "multiplexed" x4 over x8 - link
  • 4 1T - Intel DC P4510 2.5" NVMe drives - link
  • 4 cables - SFF-8643 to SFF-8639 - link
It probably takes more of a beating than most of our production systems, so far no issues. I do question multiplexing 16x (4 4x) PCIe 3.0 slots through a single 8x PCIe 3.0 slot. Speedy, but never benchmarked it, using datacener drives makes a big difference too. Also a no-name card from China, but appears to be solid as well. There is a 16x PCIe 3.0 version, rather than 8x PCIe 3.0 and there is a price difference.

Not sure what your objectives are, but if building a homelab system and want a robust do-it-yourself disk subsystem, the above might work for you.
 
  • Like
Reactions: wikin
Since the Micron 9300 that we use in the benchmark paper support different block sizes that can be configured for the namespaces, we did some testing to see how they affect performance.

We tested the setup as in the benchmark paper:

IOPS tests​

1 Mirror pool on the NVMEs with the default 512b block"size and a zvol with default 8k volblocksize:
Code:
# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=181MiB/s][w=46.3k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=95469: Fri Jan 15 12:54:59 2021
  write: IOPS=66.2k, BW=258MiB/s (271MB/s)(151GiB/600002msec); 0 zone resets
    clat (usec): min=59, max=143209, avg=482.78, stdev=1112.45
     lat (usec): min=59, max=143210, avg=482.95, stdev=1112.45
    clat percentiles (usec):
     |  1.00th=[  215],  5.00th=[  245], 10.00th=[  265], 20.00th=[  293],
     | 30.00th=[  322], 40.00th=[  359], 50.00th=[  404], 60.00th=[  445],
     | 70.00th=[  494], 80.00th=[  578], 90.00th=[  742], 95.00th=[  996],
     | 99.00th=[ 1401], 99.50th=[ 1614], 99.90th=[ 5145], 99.95th=[ 7701],
     | 99.99th=[12649]
   bw (  KiB/s): min= 4856, max=11424, per=3.13%, avg=8270.95, stdev=1912.80, samples=38369
   iops        : min= 1214, max= 2856, avg=2067.72, stdev=478.20, samples=38369
  lat (usec)   : 100=0.03%, 250=5.97%, 500=64.97%, 750=19.27%, 1000=4.77%
  lat (msec)   : 2=4.76%, 4=0.11%, 10=0.09%, 20=0.01%, 100=0.01%
  lat (msec)   : 250=0.01%
  cpu          : usr=0.45%, sys=24.12%, ctx=281344506, majf=0, minf=349
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,39696863,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=258MiB/s (271MB/s), 258MiB/s-258MiB/s (271MB/s-271MB/s), io=151GiB (163GB), run=600002-600002msec

The result of 46k IOPS is in the ballpark of the result of the benchmark paper. So far no surprise.

Recreating the test on the same kind of NVMEs but with the block size set to 4k:
Code:
# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=297MiB/s][w=75.9k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=121126: Fri Jan 15 13:20:53 2021
  write: IOPS=85.9k, BW=335MiB/s (352MB/s)(197GiB/600002msec); 0 zone resets
    clat (usec): min=58, max=144238, avg=371.76, stdev=1097.88
     lat (usec): min=58, max=144238, avg=371.95, stdev=1097.88
    clat percentiles (usec):
     |  1.00th=[  196],  5.00th=[  225], 10.00th=[  239], 20.00th=[  258],
     | 30.00th=[  273], 40.00th=[  289], 50.00th=[  302], 60.00th=[  322],
     | 70.00th=[  347], 80.00th=[  392], 90.00th=[  498], 95.00th=[  676],
     | 99.00th=[ 1287], 99.50th=[ 1532], 99.90th=[ 5932], 99.95th=[ 8029],
     | 99.99th=[12387]
   bw (  KiB/s): min= 7456, max=11680, per=3.12%, avg=10730.64, stdev=866.24, samples=38374
   iops        : min= 1864, max= 2920, avg=2682.64, stdev=216.56, samples=38374
  lat (usec)   : 100=0.09%, 250=15.87%, 500=74.23%, 750=5.73%, 1000=1.90%
  lat (msec)   : 2=1.98%, 4=0.06%, 10=0.11%, 20=0.02%, 50=0.01%
  lat (msec)   : 250=0.01%
  cpu          : usr=0.56%, sys=37.53%, ctx=339985859, majf=0, minf=404
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,51513479,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=335MiB/s (352MB/s), 335MiB/s-335MiB/s (352MB/s-352MB/s), io=197GiB (211GB), run=600002-600002msec

As you can see, using the larger 4k block size for the NVME namespace, we get ~76k IOPS which is close to double the IOPS performance.

Bandwidth tests:​

NVME 512b blocksize pool:
Code:
# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=1810MiB/s][w=452 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=81902: Fri Jan 15 12:44:31 2021
  write: IOPS=431, BW=1727MiB/s (1811MB/s)(1012GiB/600016msec); 0 zone resets
    clat (msec): min=2, max=403, avg=73.90, stdev=22.84
     lat (msec): min=3, max=403, avg=74.10, stdev=22.86
    clat percentiles (msec):
     |  1.00th=[   43],  5.00th=[   49], 10.00th=[   55], 20.00th=[   59],
     | 30.00th=[   64], 40.00th=[   68], 50.00th=[   72], 60.00th=[   77],
     | 70.00th=[   82], 80.00th=[   86], 90.00th=[   89], 95.00th=[   93],
     | 99.00th=[  197], 99.50th=[  222], 99.90th=[  271], 99.95th=[  284],
     | 99.99th=[  313]
   bw (  KiB/s): min= 8192, max=98304, per=3.12%, avg=55256.56, stdev=12647.86, samples=38399
   iops        : min=    2, max=   24, avg=13.43, stdev= 3.10, samples=38399
  lat (msec)   : 4=0.01%, 10=0.01%, 20=0.06%, 50=5.71%, 100=90.53%
  lat (msec)   : 250=3.46%, 500=0.21%
  cpu          : usr=0.27%, sys=4.86%, ctx=3678933, majf=0, minf=359
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,259085,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1727MiB/s (1811MB/s), 1727MiB/s-1727MiB/s (1811MB/s-1811MB/s), io=1012GiB (1087GB), run=600016-600016msec

We get about 1700MB/s Bandwidth.

NVME 4k blocksize pool
Code:
# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=1280MiB/s][w=320 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=201124: Fri Jan 15 12:29:19 2021
  write: IOPS=454, BW=1818MiB/s (1907MB/s)(1066GiB/600049msec); 0 zone resets
    clat (msec): min=3, max=411, avg=70.14, stdev=25.82
     lat (msec): min=3, max=411, avg=70.39, stdev=25.83
    clat percentiles (msec):
     |  1.00th=[   46],  5.00th=[   52], 10.00th=[   54], 20.00th=[   58],
     | 30.00th=[   61], 40.00th=[   64], 50.00th=[   67], 60.00th=[   70],
     | 70.00th=[   75], 80.00th=[   79], 90.00th=[   83], 95.00th=[   89],
     | 99.00th=[  230], 99.50th=[  271], 99.90th=[  338], 99.95th=[  359],
     | 99.99th=[  388]
   bw (  KiB/s): min=16384, max=98304, per=3.12%, avg=58180.81, stdev=12074.67, samples=38400
   iops        : min=    4, max=   24, avg=14.17, stdev= 2.95, samples=38400
  lat (msec)   : 4=0.01%, 10=0.01%, 20=0.03%, 50=3.77%, 100=94.07%
  lat (msec)   : 250=1.37%, 500=0.75%
  cpu          : usr=0.34%, sys=6.25%, ctx=3648814, majf=0, minf=346
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,272779,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1818MiB/s (1907MB/s), 1818MiB/s-1818MiB/s (1907MB/s-1907MB/s), io=1066GiB (1144GB), run=600049-600049msec

With the 4k blocksize namespaces there is not a significant higher bandwidth (1800MB/s).

The output of nvme list to show the NVMEs configured with different block sizes:

Code:
# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     194525xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB    512   B +  0 B   11300DN0
/dev/nvme1n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB    512   B +  0 B   11300DN0
/dev/nvme2n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB      4 KiB +  0 B   11300DN0
/dev/nvme3n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB      4 KiB +  0 B   11300DN0

TL;DR:​

Changing the block size of the NVME namespace can improve the performance. Tested with 512b and 4k NVME block sizes and a zfs mirror with a zvol (8k volblocksize).

512b NVME block size: ~46k IOPS, ~1700MB/s bandwidth
4k NVME block size: ~75k IOPS, ~1800MB/s bandwidth
 
Last edited:
Hi,
is it possible to select the 4k NVME block sizes during pve install on an nvme-only system?
Or must I set the blocksize before (I guess)?
Udo

This needs to be done before installing and probably depends on the used NVME. For the Micron NVME we have, we used microns msecli tool to delete the (default) namespace and create a new namespace with the larger block size.
 
  • Like
Reactions: Ansy and udo
Since the Micron 9300 that we use in the benchmark paper support different block sizes that can be configured for the namespaces, we did some testing to see how they affect performance.

We tested the setup as in the benchmark paper:

IOPS tests​

1 Mirror pool on the NVMEs with the default 512b block"size and a zvol with default 8k volblocksize:
Code:
# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=181MiB/s][w=46.3k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=95469: Fri Jan 15 12:54:59 2021
  write: IOPS=66.2k, BW=258MiB/s (271MB/s)(151GiB/600002msec); 0 zone resets
    clat (usec): min=59, max=143209, avg=482.78, stdev=1112.45
     lat (usec): min=59, max=143210, avg=482.95, stdev=1112.45
    clat percentiles (usec):
     |  1.00th=[  215],  5.00th=[  245], 10.00th=[  265], 20.00th=[  293],
     | 30.00th=[  322], 40.00th=[  359], 50.00th=[  404], 60.00th=[  445],
     | 70.00th=[  494], 80.00th=[  578], 90.00th=[  742], 95.00th=[  996],
     | 99.00th=[ 1401], 99.50th=[ 1614], 99.90th=[ 5145], 99.95th=[ 7701],
     | 99.99th=[12649]
   bw (  KiB/s): min= 4856, max=11424, per=3.13%, avg=8270.95, stdev=1912.80, samples=38369
   iops        : min= 1214, max= 2856, avg=2067.72, stdev=478.20, samples=38369
  lat (usec)   : 100=0.03%, 250=5.97%, 500=64.97%, 750=19.27%, 1000=4.77%
  lat (msec)   : 2=4.76%, 4=0.11%, 10=0.09%, 20=0.01%, 100=0.01%
  lat (msec)   : 250=0.01%
  cpu          : usr=0.45%, sys=24.12%, ctx=281344506, majf=0, minf=349
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,39696863,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=258MiB/s (271MB/s), 258MiB/s-258MiB/s (271MB/s-271MB/s), io=151GiB (163GB), run=600002-600002msec

The result of 46k IOPS is in the ballpark of the result of the benchmark paper. So far no surprise.

Recreating the test on the same kind of NVMEs but with the block size set to 4k:
Code:
# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=297MiB/s][w=75.9k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=121126: Fri Jan 15 13:20:53 2021
  write: IOPS=85.9k, BW=335MiB/s (352MB/s)(197GiB/600002msec); 0 zone resets
    clat (usec): min=58, max=144238, avg=371.76, stdev=1097.88
     lat (usec): min=58, max=144238, avg=371.95, stdev=1097.88
    clat percentiles (usec):
     |  1.00th=[  196],  5.00th=[  225], 10.00th=[  239], 20.00th=[  258],
     | 30.00th=[  273], 40.00th=[  289], 50.00th=[  302], 60.00th=[  322],
     | 70.00th=[  347], 80.00th=[  392], 90.00th=[  498], 95.00th=[  676],
     | 99.00th=[ 1287], 99.50th=[ 1532], 99.90th=[ 5932], 99.95th=[ 8029],
     | 99.99th=[12387]
   bw (  KiB/s): min= 7456, max=11680, per=3.12%, avg=10730.64, stdev=866.24, samples=38374
   iops        : min= 1864, max= 2920, avg=2682.64, stdev=216.56, samples=38374
  lat (usec)   : 100=0.09%, 250=15.87%, 500=74.23%, 750=5.73%, 1000=1.90%
  lat (msec)   : 2=1.98%, 4=0.06%, 10=0.11%, 20=0.02%, 50=0.01%
  lat (msec)   : 250=0.01%
  cpu          : usr=0.56%, sys=37.53%, ctx=339985859, majf=0, minf=404
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,51513479,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=335MiB/s (352MB/s), 335MiB/s-335MiB/s (352MB/s-352MB/s), io=197GiB (211GB), run=600002-600002msec

As you can see, using the larger 4k block size for the NVME namespace, we get ~76k IOPS which is close to double the IOPS performance.

Bandwidth tests:​

NVME 512b blocksize pool:
Code:
# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=1810MiB/s][w=452 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=81902: Fri Jan 15 12:44:31 2021
  write: IOPS=431, BW=1727MiB/s (1811MB/s)(1012GiB/600016msec); 0 zone resets
    clat (msec): min=2, max=403, avg=73.90, stdev=22.84
     lat (msec): min=3, max=403, avg=74.10, stdev=22.86
    clat percentiles (msec):
     |  1.00th=[   43],  5.00th=[   49], 10.00th=[   55], 20.00th=[   59],
     | 30.00th=[   64], 40.00th=[   68], 50.00th=[   72], 60.00th=[   77],
     | 70.00th=[   82], 80.00th=[   86], 90.00th=[   89], 95.00th=[   93],
     | 99.00th=[  197], 99.50th=[  222], 99.90th=[  271], 99.95th=[  284],
     | 99.99th=[  313]
   bw (  KiB/s): min= 8192, max=98304, per=3.12%, avg=55256.56, stdev=12647.86, samples=38399
   iops        : min=    2, max=   24, avg=13.43, stdev= 3.10, samples=38399
  lat (msec)   : 4=0.01%, 10=0.01%, 20=0.06%, 50=5.71%, 100=90.53%
  lat (msec)   : 250=3.46%, 500=0.21%
  cpu          : usr=0.27%, sys=4.86%, ctx=3678933, majf=0, minf=359
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,259085,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1727MiB/s (1811MB/s), 1727MiB/s-1727MiB/s (1811MB/s-1811MB/s), io=1012GiB (1087GB), run=600016-600016msec

We get about 1700MB/s Bandwidth.

NVME 4k blocksize pool
Code:
# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=1280MiB/s][w=320 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=201124: Fri Jan 15 12:29:19 2021
  write: IOPS=454, BW=1818MiB/s (1907MB/s)(1066GiB/600049msec); 0 zone resets
    clat (msec): min=3, max=411, avg=70.14, stdev=25.82
     lat (msec): min=3, max=411, avg=70.39, stdev=25.83
    clat percentiles (msec):
     |  1.00th=[   46],  5.00th=[   52], 10.00th=[   54], 20.00th=[   58],
     | 30.00th=[   61], 40.00th=[   64], 50.00th=[   67], 60.00th=[   70],
     | 70.00th=[   75], 80.00th=[   79], 90.00th=[   83], 95.00th=[   89],
     | 99.00th=[  230], 99.50th=[  271], 99.90th=[  338], 99.95th=[  359],
     | 99.99th=[  388]
   bw (  KiB/s): min=16384, max=98304, per=3.12%, avg=58180.81, stdev=12074.67, samples=38400
   iops        : min=    4, max=   24, avg=14.17, stdev= 2.95, samples=38400
  lat (msec)   : 4=0.01%, 10=0.01%, 20=0.03%, 50=3.77%, 100=94.07%
  lat (msec)   : 250=1.37%, 500=0.75%
  cpu          : usr=0.34%, sys=6.25%, ctx=3648814, majf=0, minf=346
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,272779,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1818MiB/s (1907MB/s), 1818MiB/s-1818MiB/s (1907MB/s-1907MB/s), io=1066GiB (1144GB), run=600049-600049msec

With the 4k blocksize namespaces there is not a significant higher bandwidth (1800MB/s).

The output of nvme list to show the NVMEs configured with different block sizes:

Code:
# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     194525xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB    512   B +  0 B   11300DN0
/dev/nvme1n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB    512   B +  0 B   11300DN0
/dev/nvme2n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB      4 KiB +  0 B   11300DN0
/dev/nvme3n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB      4 KiB +  0 B   11300DN0

TL;DR:​

Changing the block size of the NVME namespace can improve the performance. Tested with 512b and 4k NVME block sizes and a zfs mirror with a zvol (8k volblocksize).

512b NVME block size: ~46k IOPS, ~1700MB/s bandwidth
4k NVME block size: ~75k IOPS, ~1800MB/s bandwidth
Hi aaron,

i tried this because i've got exact the same disks. The fio test on the ZVOL with option --direct=1 stop with this error:


fio-3.12
Starting 32 processes
fio: Laying out IO file (1 file / 9216MiB)
fio: looks like your file system does not support direct=1/buffered=0
fio: looks like your file system does not support direct=1/buffered=0
fio: looks like your file system does not support direct=1/buffered=0
fio: looks like your file system does not support direct=1/buffered=0
fio: destination does not support O_DIRECT
fio: destination does not support O_DIRECT
fio: destination does not support O_DIRECT
fio: destination does not support O_DIRECT

root@pve01:/dev/zvol/tank0# zfs version
zfs-0.8.5-pve1
zfs-kmod-0.8.5-pve1

root@pve01:/dev/zvol/tank0# pveversion
pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve)

The pool was created on the gui with standard settings.

root@pve01:/dev/zvol/tank0# zpool status tank0
pool: tank0
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
tank0 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-Micron_9300_MTFDHAL3T2TDR_193223A89845 ONLINE 0 0 0
nvme-Micron_9300_MTFDHAL3T2TDR_2029295253ED ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
nvme-Micron_9300_MTFDHAL3T2TDR_20302982FB07 ONLINE 0 0 0
nvme-Micron_9300_MTFDHAL3T2TDR_20302982F435 ONLINE 0 0 0

errors: No known data errors


Am I missing something here?

Many thanks!

Michael
 
--bs=4m ..... Do you know any application who will write in normal condition(by default) with 4M record size?
This is for bandwidth benchmarks, it could have been 1, 2, 3 MB or bigger as well. It was just determined by the testing schema to be 4MB, since the Ceph benchmark paper uses 4MB as well.

And applications normally don't care too much about blocksizes, the filesystem below has to though. There are obviously exceptions. ;)
 
just determined by the testing schema to be 4MB, since the Ceph benchmark paper uses 4MB as well.

Hi,

Ceph benchmark use 4 MB because they have take in account how Ceph is designed(Ceph's default block size is 4MB). But what is OK for ceph do not mean that could be OK for any other storge system.

Good luck / Bafta!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!