Proxmox VE ZFS Benchmark with NVMe

ZFS isn't that fast because it is doing much more and is more concerned about data integrity and therefore gets more overhead. With mdadm you don't get bit rot protection so your data can silently corrupt oder time, you get no block level compression, no deduplication, no replication for fast backups or HA, no snapshots, ...
You could add some of that features by using a filesystem like btrfs ontop of your mdadm array or using qcow images but than again your mdadm will get slower because the btrfs is creating additional overhead which you don'T get with ZFS because its all already integrated.
 
As a comparison, I'm running a 15TB mdadm ext4 array on the following hardware:

EPYC 7443P 24-core CPU
256GB 8-channel RAM
8x Samsung PM9A3 Gen4 NVME U.2 1.92TB drives, connected via 2x AOC-SLG4-4E4T-O bifurcation re-timer cards

Seeing ~40GB/s read and ~20GB/s write via a sequential FIO test.
 
As a comparison, I'm running a 15TB mdadm ext4 array on the following hardware:

EPYC 7443P 24-core CPU
256GB 8-channel RAM
8x Samsung PM9A3 Gen4 NVME U.2 1.92TB drives, connected via 2x AOC-SLG4-4E4T-O bifurcation re-timer cards

Seeing ~40GB/s read and ~20GB/s write via a sequential FIO test.
Which mdadm raid config?
 
With mdadm you don't get bit rot protection so your data can silently corrupt oder time
... even worst. If one of your HDD RAID10 member will have an non-fatal read error(any HDD has some read errors in time), then you will have a corrupt data/block. Even more, this read non-fatal errors(I can rea a block but I read other data compared with the data that was write in the past) will increase if your HDD temperature will be higher then the normal.... one night without your cooling system and you can lose all of your data.

Good luck / Bafta !
 
Since the Micron 9300 that we use in the benchmark paper support different block sizes that can be configured for the namespaces, we did some

IOPS tests​


Code:
# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1

The result of 46k IOPS is in the ballpark of the result of the benchmark paper. So far no surprise.


Code:
# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1

As you can see, using the larger 4k block size for the NVME namespace, we get ~76k IOPS which is close to double the IOPS performance.

Bandwidth tests:​


Code:
# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1

We get about 1700MB/s Bandwidth.


Code:
# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1

With the 4k blocksize namespaces there is not a significant higher bandwidth (1800MB/s).

512b NVME block size: ~46k IOPS, ~1700MB/s bandwidth
4k NVME block size: ~75k IOPS, ~1800MB/s bandwidth
Thank you for your tests.

For me where is a drawback of the 4k approach, which is the wear level of the NVMe. Depending on your data, you write more as you require to write.

About the Bandwidth, for me it looks like your disks are limited by the connector and only the half bandwidth can be utilized of what the disks are capable to do. E.g. with PCIe x4 you shall have 32GBit/s bandwidth, resulting in theoretical 4GiB/s bandwidth. Typically only 3.6 to 3.8 can be reached. So halfing the 3.6GiB/s results into your provided bandwidth. Means it looks like you are only connected with outdated PCIe version (half bandwidth) or with only x2 lanes to each of your disks, which make the test of course somehow wrong since not the full speed could be used.

Can you please also clarify your hardware used?
CPU, board, RAM
especially HBA/RAID Controller
and used connection system to your disks (SAS or NVMe)?
 
As a comparison, I'm running a 15TB mdadm ext4 array on the following hardware:

EPYC 7443P 24-core CPU
256GB 8-channel RAM
8x Samsung PM9A3 Gen4 NVME U.2 1.92TB drives, connected via 2x AOC-SLG4-4E4T-O bifurcation re-timer cards

Seeing ~40GB/s read and ~20GB/s write via a sequential FIO test.
on servers, especially with that much firepower, I have never needed such fast sequential bandwidth. Servers with that firepower have typically tons of workload to calculate and thus to write a lot of data in non sequential manner or to read a lot of random data with short sequential data streams.

The sequential reading or writing is only helpful if you have long running streams of minutes or hours.
E.g. for backups either reading and having short impacts on live environment or on writing (accepting backups) is helpful here

IMHO beside of the sequential bandwidth it would be more helpful to have tests with e.g. 10-20 parallel threads writing lets say realistic 500MB of data, if you work on steam based data (e.g. videos or similar)
On database side it is far more important to have impressive IOPS on 20 or more parallel streams ...

Do you have such tests too?
 
Thank you for your tests.

For me where is a drawback of the 4k approach, which is the wear level of the NVMe. Depending on your data, you write more as you require to write.

About the Bandwidth, for me it looks like your disks are limited by the connector and only the half bandwidth can be utilized of what the disks are capable to do. E.g. with PCIe x4 you shall have 32GBit/s bandwidth, resulting in theoretical 4GiB/s bandwidth. Typically only 3.6 to 3.8 can be reached. So halfing the 3.6GiB/s results into your provided bandwidth. Means it looks like you are only connected with outdated PCIe version (half bandwidth) or with only x2 lanes to each of your disks, which make the test of course somehow wrong since not the full speed could be used.

Can you please also clarify your hardware used?
CPU, board, RAM
especially HBA/RAID Controller
and used connection system to your disks (SAS or NVMe)?
Back then, I did not test the change in configured block size on the raw disks directly. Don't forget that ZFS and volume datasets do come with a performance penalty. To see how the performance directly to the disk is, I would need to run new benchmarks.

The hardware was exactly the same on which the benchmark papers was done with.
The board is a Gigabyte MZ32-AR0. The NVMEs are connected directly via the Slimline connectors.
 
checked your board
your board is a X399 chipset, right?
How many PCIe slots have you utilized?
Is your slot 7 (nearest to CPU) free or in use?
 
your board is a X399 chipset, right?
This is an AMD Epyc board, there is no Chipset as everything is connected to the CPU directly. See the Manual page 8.

The NVMEs are connected on the slimline connectors in the lower left (16-19 according to the overview on page 6 in the manual). Those ports offer PCI Gen3 which is okay as the NVMEs also only are able to do PCI Gen 3.
 
I forgot, Epycs have southbridge on CPU integrated

Anyway is your PCIe slot 7 free or in use?
 
Anyway is your PCIe slot 7 free or in use?
It is not in use. The slimline connectors right next to it are also free.
 
Hi all, I wonder if I could hijack with related SSD performance benchmarking - are my results within expectations? I have 2 identenical PVE 7.0-11, the only differnce being the HDD / SSD arrangement. The SSD's are enterprise SATA3 Intel S4520, the HDDs are 7.2K SAS. Full post here: https://forum.proxmox.com/threads/p...1-4-x-ssd-similar-to-raid-z10-12-x-hdd.99967/

Prep:
Code:
zfs create rpool/fio
zfs set primarycache=none rpool/fio

Code:
fio --ioengine=libaio --filename=/rpool/fio/testx --size=4G --time_based --name=fio --group_reporting --runtime=10 --direct=1 --sync=1 --iodepth=1 --rw=randrw  --bs=4K --numjobs=64

SSD results:
Code:
FIO output:
read: IOPS=4022, BW=15.7MiB/s (16.5MB/s)
write: IOPS=4042, BW=15.8MiB/s (16.6MB/s)


# zpool iostat -vy rpool 5 1
                                                        capacity     operations     bandwidth
pool                                                  alloc   free   read  write   read  write
----------------------------------------------------  -----  -----  -----  -----  -----  -----
rpool                                                  216G  27.7T  28.1K  14.5K  1.17G   706M
  raidz1                                               195G  13.8T  13.9K  7.26K   595M   358M
    ata-INTEL_SSDSC2KB038TZ_BTYI13730BAV3P8EGN-part3      -      -  3.60K  1.73K   159M  90.3M
    ata-INTEL_SSDSC2KB038TZ_BTYI13730B9Q3P8EGN-part3      -      -  3.65K  1.82K   150M  89.0M
    ata-INTEL_SSDSC2KB038TZ_BTYI13730B9G3P8EGN-part3      -      -  3.35K  1.83K   147M  90.0M
    ata-INTEL_SSDSC2KB038TZ_BTYI13730BAT3P8EGN-part3      -      -  3.34K  1.89K   139M  88.4M
  raidz1                                              21.3G  13.9T  14.2K  7.21K   604M   348M
    sde                                                   -      -  3.39K  1.81K   149M  87.5M
    sdf                                                   -      -  3.35K  1.90K   139M  86.3M
    sdg                                                   -      -  3.71K  1.70K   163M  87.8M
    sdh                                                   -      -  3.69K  1.81K   152M  86.4M
----------------------------------------------------  -----  -----  -----  -----  -----  -----

HDD results:
Code:
FIO output:
read: IOPS=1382, BW=5531KiB/s
write: IOPS=1385, BW=5542KiB/s

$ zpool iostat -vy rpool 5 1
                                    capacity     operations     bandwidth
pool                              alloc   free   read  write   read  write
--------------------------------  -----  -----  -----  -----  -----  -----
rpool                              160G  18.0T  3.07K  2.71K   393M   228M
  mirror                          32.2G  3.59T    624    589  78.0M  40.2M
    scsi-35000c500de5c67f7-part3      -      -    321    295  40.1M  20.4M
    scsi-35000c500de75a863-part3      -      -    303    293  37.9M  19.7M
  mirror                          31.9G  3.59T    625    551  78.2M  49.9M
    scsi-35000c500de2bd6bb-part3      -      -    313    274  39.1M  24.2M
    scsi-35000c500de5ae5a7-part3      -      -    312    277  39.0M  25.7M
  mirror                          32.2G  3.59T    648    548  81.1M  45.9M
    scsi-35000c500de5ae667-part3      -      -    320    279  40.1M  23.0M
    scsi-35000c500de2bd2d3-part3      -      -    328    268  41.0M  23.0M
  mirror                          31.6G  3.59T    612    536  76.5M  45.5M
    scsi-35000c500de5ef20f-part3      -      -    301    266  37.7M  22.7M
    scsi-35000c500de5edbfb-part3      -      -    310    269  38.9M  22.8M
  mirror                          32.0G  3.59T    629    555  78.7M  46.5M
    scsi-35000c500de5c6f7f-part3      -      -    318    283  39.8M  23.1M
    scsi-35000c500de5c6c5f-part3      -      -    311    272  38.9M  23.4M
--------------------------------  -----  -----  -----  -----  -----  -----

I'd have thought the SSDs shuuld be about 10x more IOPS than the above - are my expectations out of whack? Any insights appreciated! Thanks!

Hi,

You must design your pool keeping in mind how zfs work!
You make a test using SYNC write(direct IO). In this case, zfs for ANY block that will need to write, will do this:
- first, it will write to a "special zone" ZIL(zfs intended log)
- second, will write the same block during normal flush(default 5 sec) from the zfs cache on the pool

If you need high IOps, then you will get better results using a dedicated SLOG device with high IOps.

As a side note, in a real world, you will see/need SYNC write, when you will use a Data Base. Most of them will write SYNC with 8k(Oracle, postgresql), or 16K(mysql/percona). In such a case, you will setup your dataset for this block size.

Good luck / Bafta!
 
Could we get as a feature doing the nvme formatting to 4096 blocks if needed following:

https://openzfs.github.io/openzfs-docs/Performance and Tuning/Hardware.html#nvme-low-level-formatting

, it seems that automatically detecting the optimal block size and reformatting before installing is doable. Would be a nice feature, it's always nice for defaults to work optimally.

edit: well, today I learned why- I blew away a new Proxmox install to set NVMe blocks to 4k (instead of 512) and Proxmox wouldn't install- apparently booting ZFS off that does not work.
 
Last edited:
Could we get as a feature doing the nvme formatting to 4096 blocks if needed following:

https://openzfs.github.io/openzfs-docs/Performance and Tuning/Hardware.html#nvme-low-level-formatting

, it seems that automatically detecting the optimal block size and reformatting before installing is doable. Would be a nice feature, it's always nice for defaults to work optimally.

edit: well, today I learned why- I blew away a new Proxmox install to set NVMe blocks to 4k (instead of 512) and Proxmox wouldn't install- apparently booting ZFS off that does not work.

Please open an enhancement / feature request at our bugtracker so we can keep track of it and discuss technicalities there. I am not sure if that would work with consumer NVMEs as there are other NVME features which they usually don't support. But I don't have an empty one at hand to verify that quickly myself.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!