Proxmox VE ZFS Benchmark with NVMe

martin

Proxmox Staff Member
Staff member
Apr 28, 2005
679
726
113
To optimize performance in hyper-converged deployments with Proxmox VE and ZFS storage, the appropriate hardware setup is essential. This benchmark presents a possible setup and its resulting performance, with the intention of supporting Proxmox users in making better decisions.

Download PDF
https://www.proxmox.com/en/downloads/item/proxmox-ve-zfs-benchmark-2020
__________________
Best regards,

Martin Maurer
Proxmox VE project leader
 
  • Like
Reactions: mhaluska and jsterr

jsterr

New Member
Jul 24, 2020
26
4
3
29
Nice, very useful! What ioshift and zfs recordsize was used? I experimented a little bit with zfs recordsize but was not sure, if it should be changed from its default value of 128K.
 

Dunuin

Active Member
Jun 30, 2020
693
114
43
FAQ on page 8:
Can I use consumer or pro-sumer SSDs, as these are much cheaper than enterprise-class SSD?
No. Never. These SSDs wont provide the required performance, reliability or endurance. See the fio results from before and/or run your own fio tests.
Looks like 90% of the people of this forum don't get this.^^
 
  • Like
Reactions: grepler and elmo

auranext

Member
Jun 5, 2018
15
0
6
120
Hello everyone and Happy new Year !

When it is not possible to use motherboard embendded disk controller which HBA card is recommanded for ZFS replication setup ?
is there other possible technology like NVMe card ?
does anyone have a good card controller experience to share ?

thank you
 
Dec 22, 2014
33
8
28
Austin, Tejas
Hello everyone and Happy new Year !

When it is not possible to use motherboard embendded disk controller which HBA card is recommanded for ZFS replication setup ?
is there other possible technology like NVMe card ?
does anyone have a good card controller experience to share ?

thank you

No simple answer because it all depends on your objectives and hardware constraints. Do you have full PCIe 3.0 x16 slots open? How many? Are they restricted when using onboard M.2 slots or SATA ports? Are they bifurcated or non-bifurcated? You want to use M.2 NVMe drives or 2.5" NVMe drives? How many drives? So many questions when going down the do-it-yourself road.

For one of my test systems I had a PCIe 3.0 x8 slot open, so decided to roll the dice and copy a setup I saw on Amazon. It got me 4 x 1T NVMe drives with 1 HBA in a smallish deskside system:
  • 1 x PCIe 3.0 Adapter 4 "multiplexed" x4 over x8 - link
  • 4 1T - Intel DC P4510 2.5" NVMe drives - link
  • 4 cables - SFF-8643 to SFF-8639 - link
It probably takes more of a beating than most of our production systems, so far no issues. I do question multiplexing 16x (4 4x) PCIe 3.0 slots through a single 8x PCIe 3.0 slot. Speedy, but never benchmarked it, using datacener drives makes a big difference too. Also a no-name card from China, but appears to be solid as well. There is a 16x PCIe 3.0 version, rather than 8x PCIe 3.0 and there is a price difference.

Not sure what your objectives are, but if building a homelab system and want a robust do-it-yourself disk subsystem, the above might work for you.
 

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
1,564
201
63
Since the Micron 9300 that we use in the benchmark paper support different block sizes that can be configured for the namespaces, we did some testing to see how they affect performance.

We tested the setup as in the benchmark paper:

IOPS tests​

1 Mirror pool on the NVMEs with the default 512b block"size and a zvol with default 8k volblocksize:
Code:
# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=181MiB/s][w=46.3k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=95469: Fri Jan 15 12:54:59 2021
  write: IOPS=66.2k, BW=258MiB/s (271MB/s)(151GiB/600002msec); 0 zone resets
    clat (usec): min=59, max=143209, avg=482.78, stdev=1112.45
     lat (usec): min=59, max=143210, avg=482.95, stdev=1112.45
    clat percentiles (usec):
     |  1.00th=[  215],  5.00th=[  245], 10.00th=[  265], 20.00th=[  293],
     | 30.00th=[  322], 40.00th=[  359], 50.00th=[  404], 60.00th=[  445],
     | 70.00th=[  494], 80.00th=[  578], 90.00th=[  742], 95.00th=[  996],
     | 99.00th=[ 1401], 99.50th=[ 1614], 99.90th=[ 5145], 99.95th=[ 7701],
     | 99.99th=[12649]
   bw (  KiB/s): min= 4856, max=11424, per=3.13%, avg=8270.95, stdev=1912.80, samples=38369
   iops        : min= 1214, max= 2856, avg=2067.72, stdev=478.20, samples=38369
  lat (usec)   : 100=0.03%, 250=5.97%, 500=64.97%, 750=19.27%, 1000=4.77%
  lat (msec)   : 2=4.76%, 4=0.11%, 10=0.09%, 20=0.01%, 100=0.01%
  lat (msec)   : 250=0.01%
  cpu          : usr=0.45%, sys=24.12%, ctx=281344506, majf=0, minf=349
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,39696863,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=258MiB/s (271MB/s), 258MiB/s-258MiB/s (271MB/s-271MB/s), io=151GiB (163GB), run=600002-600002msec

The result of 46k IOPS is in the ballpark of the result of the benchmark paper. So far no surprise.

Recreating the test on the same kind of NVMEs but with the block size set to 4k:
Code:
# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=297MiB/s][w=75.9k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=121126: Fri Jan 15 13:20:53 2021
  write: IOPS=85.9k, BW=335MiB/s (352MB/s)(197GiB/600002msec); 0 zone resets
    clat (usec): min=58, max=144238, avg=371.76, stdev=1097.88
     lat (usec): min=58, max=144238, avg=371.95, stdev=1097.88
    clat percentiles (usec):
     |  1.00th=[  196],  5.00th=[  225], 10.00th=[  239], 20.00th=[  258],
     | 30.00th=[  273], 40.00th=[  289], 50.00th=[  302], 60.00th=[  322],
     | 70.00th=[  347], 80.00th=[  392], 90.00th=[  498], 95.00th=[  676],
     | 99.00th=[ 1287], 99.50th=[ 1532], 99.90th=[ 5932], 99.95th=[ 8029],
     | 99.99th=[12387]
   bw (  KiB/s): min= 7456, max=11680, per=3.12%, avg=10730.64, stdev=866.24, samples=38374
   iops        : min= 1864, max= 2920, avg=2682.64, stdev=216.56, samples=38374
  lat (usec)   : 100=0.09%, 250=15.87%, 500=74.23%, 750=5.73%, 1000=1.90%
  lat (msec)   : 2=1.98%, 4=0.06%, 10=0.11%, 20=0.02%, 50=0.01%
  lat (msec)   : 250=0.01%
  cpu          : usr=0.56%, sys=37.53%, ctx=339985859, majf=0, minf=404
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,51513479,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=335MiB/s (352MB/s), 335MiB/s-335MiB/s (352MB/s-352MB/s), io=197GiB (211GB), run=600002-600002msec

As you can see, using the larger 4k block size for the NVME namespace, we get ~76k IOPS which is close to double the IOPS performance.

Bandwidth tests:​

NVME 512b blocksize pool:
Code:
# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=1810MiB/s][w=452 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=81902: Fri Jan 15 12:44:31 2021
  write: IOPS=431, BW=1727MiB/s (1811MB/s)(1012GiB/600016msec); 0 zone resets
    clat (msec): min=2, max=403, avg=73.90, stdev=22.84
     lat (msec): min=3, max=403, avg=74.10, stdev=22.86
    clat percentiles (msec):
     |  1.00th=[   43],  5.00th=[   49], 10.00th=[   55], 20.00th=[   59],
     | 30.00th=[   64], 40.00th=[   68], 50.00th=[   72], 60.00th=[   77],
     | 70.00th=[   82], 80.00th=[   86], 90.00th=[   89], 95.00th=[   93],
     | 99.00th=[  197], 99.50th=[  222], 99.90th=[  271], 99.95th=[  284],
     | 99.99th=[  313]
   bw (  KiB/s): min= 8192, max=98304, per=3.12%, avg=55256.56, stdev=12647.86, samples=38399
   iops        : min=    2, max=   24, avg=13.43, stdev= 3.10, samples=38399
  lat (msec)   : 4=0.01%, 10=0.01%, 20=0.06%, 50=5.71%, 100=90.53%
  lat (msec)   : 250=3.46%, 500=0.21%
  cpu          : usr=0.27%, sys=4.86%, ctx=3678933, majf=0, minf=359
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,259085,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1727MiB/s (1811MB/s), 1727MiB/s-1727MiB/s (1811MB/s-1811MB/s), io=1012GiB (1087GB), run=600016-600016msec

We get about 1700MB/s Bandwidth.

NVME 4k blocksize pool
Code:
# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=1280MiB/s][w=320 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=201124: Fri Jan 15 12:29:19 2021
  write: IOPS=454, BW=1818MiB/s (1907MB/s)(1066GiB/600049msec); 0 zone resets
    clat (msec): min=3, max=411, avg=70.14, stdev=25.82
     lat (msec): min=3, max=411, avg=70.39, stdev=25.83
    clat percentiles (msec):
     |  1.00th=[   46],  5.00th=[   52], 10.00th=[   54], 20.00th=[   58],
     | 30.00th=[   61], 40.00th=[   64], 50.00th=[   67], 60.00th=[   70],
     | 70.00th=[   75], 80.00th=[   79], 90.00th=[   83], 95.00th=[   89],
     | 99.00th=[  230], 99.50th=[  271], 99.90th=[  338], 99.95th=[  359],
     | 99.99th=[  388]
   bw (  KiB/s): min=16384, max=98304, per=3.12%, avg=58180.81, stdev=12074.67, samples=38400
   iops        : min=    4, max=   24, avg=14.17, stdev= 2.95, samples=38400
  lat (msec)   : 4=0.01%, 10=0.01%, 20=0.03%, 50=3.77%, 100=94.07%
  lat (msec)   : 250=1.37%, 500=0.75%
  cpu          : usr=0.34%, sys=6.25%, ctx=3648814, majf=0, minf=346
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,272779,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1818MiB/s (1907MB/s), 1818MiB/s-1818MiB/s (1907MB/s-1907MB/s), io=1066GiB (1144GB), run=600049-600049msec

With the 4k blocksize namespaces there is not a significant higher bandwidth (1800MB/s).

The output of nvme list to show the NVMEs configured with different block sizes:

Code:
# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     194525xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB    512   B +  0 B   11300DN0
/dev/nvme1n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB    512   B +  0 B   11300DN0
/dev/nvme2n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB      4 KiB +  0 B   11300DN0
/dev/nvme3n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB      4 KiB +  0 B   11300DN0

TL;DR:​

Changing the block size of the NVME namespace can improve the performance. Tested with 512b and 4k NVME block sizes and a zfs mirror with a zvol (8k volblocksize).

512b NVME block size: ~46k IOPS, ~1700MB/s bandwidth
4k NVME block size: ~75k IOPS, ~1800MB/s bandwidth
 
Last edited:

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
1,564
201
63
Hi,
is it possible to select the 4k NVME block sizes during pve install on an nvme-only system?
Or must I set the blocksize before (I guess)?
Udo

This needs to be done before installing and probably depends on the used NVME. For the Micron NVME we have, we used microns msecli tool to delete the (default) namespace and create a new namespace with the larger block size.
 
  • Like
Reactions: udo
Jan 15, 2021
60
11
8
37
is it possible to select the 4k NVME block sizes during pve install on an nvme-only system?
Or must I set the blocksize before (I guess)?
Maybe to better clarify, @aaron changed the format of the NVMe SSD itself. The volblocksize of the zvol was still 8K.
 
  • Like
Reactions: aaron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!