Proxmox VE ZFS Benchmark with NVMe

Dunuin · Jan 29, 2022

ZFS isn't that fast because it is doing much more and is more concerned about data integrity and therefore gets more overhead. With mdadm you don't get bit rot protection so your data can silently corrupt oder time, you get no block level compression, no deduplication, no replication for fast backups or HA, no snapshots, ...
You could add some of that features by using a filesystem like btrfs ontop of your mdadm array or using qcow images but than again your mdadm will get slower because the btrfs is creating additional overhead which you don'T get with ZFS because its all already integrated.

aaron · Jan 31, 2022

Emilien said:
I wonder the results of linux mdadm RAID10 F2 with this setup, any disadvantages?

https://bugzilla.kernel.org/show_bug.cgi?id=99171#c5

The reason we do not officially support it.

ectoplasmosis · Feb 2, 2022

As a comparison, I'm running a 15TB mdadm ext4 array on the following hardware:

EPYC 7443P 24-core CPU
256GB 8-channel RAM
8x Samsung PM9A3 Gen4 NVME U.2 1.92TB drives, connected via 2x AOC-SLG4-4E4T-O bifurcation re-timer cards

Seeing ~40GB/s read and ~20GB/s write via a sequential FIO test.

Dunuin · Feb 2, 2022

Interesting would be a worst case like random 4k sync writes.

ectoplasmosis · Feb 2, 2022

Dunuin said:
Interesting would be a worst case like random 4k sync writes.

I'll run this test when next possible.

For this machine, sequential R/W is paramount as it's serving large assets to video editor clients.

Emilien · Feb 5, 2022

ectoplasmosis said:
As a comparison, I'm running a 15TB mdadm ext4 array on the following hardware:

EPYC 7443P 24-core CPU
256GB 8-channel RAM
8x Samsung PM9A3 Gen4 NVME U.2 1.92TB drives, connected via 2x AOC-SLG4-4E4T-O bifurcation re-timer cards

Seeing ~40GB/s read and ~20GB/s write via a sequential FIO test.

Which mdadm raid config?

guletz · Feb 11, 2022

Dunuin said:
With mdadm you don't get bit rot protection so your data can silently corrupt oder time

... even worst. If one of your HDD RAID10 member will have an non-fatal read error(any HDD has some read errors in time), then you will have a corrupt data/block. Even more, this read non-fatal errors(I can rea a block but I read other data compared with the data that was write in the past) will increase if your HDD temperature will be higher then the normal.... one night without your cooling system and you can lose all of your data.

Good luck / Bafta !

pille99 · Oct 14, 2022

is that still up-to-date
https://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments
??
there are some pretty good options, not just related to performance, also resources.

jsterr · Oct 14, 2022

pille99 said:
is that still up-to-date
https://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments
??
there are some pretty good options, not just related to performance, also resources.

Yourmixing ZFS with Ceph this thread is for ZFS Benchmarks.

pille99 · Oct 14, 2022

jsterr said:
Yourmixing ZFS with Ceph this thread is for ZFS Benchmarks.

yeaa, sorry, got it to late, can not delete the post.

hkais · Nov 30, 2022

aaron said:
Since the Micron 9300 that we use in the benchmark paper support different block sizes that can be configured for the namespaces, we did some

IOPS tests

Code:

# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32 fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1

The result of 46k IOPS is in the ballpark of the result of the benchmark paper. So far no surprise.

Code:

# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32 fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1

As you can see, using the larger 4k block size for the NVME namespace, we get ~76k IOPS which is close to double the IOPS performance.

Bandwidth tests:

Code:

# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32 fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1

We get about 1700MB/s Bandwidth.

Code:

# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32 fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1

With the 4k blocksize namespaces there is not a significant higher bandwidth (1800MB/s).

512b NVME block size: ~46k IOPS, ~1700MB/s bandwidth
4k NVME block size: ~75k IOPS, ~1800MB/s bandwidth

Thank you for your tests.

For me where is a drawback of the 4k approach, which is the wear level of the NVMe. Depending on your data, you write more as you require to write.

About the Bandwidth, for me it looks like your disks are limited by the connector and only the half bandwidth can be utilized of what the disks are capable to do. E.g. with PCIe x4 you shall have 32GBit/s bandwidth, resulting in theoretical 4GiB/s bandwidth. Typically only 3.6 to 3.8 can be reached. So halfing the 3.6GiB/s results into your provided bandwidth. Means it looks like you are only connected with outdated PCIe version (half bandwidth) or with only x2 lanes to each of your disks, which make the test of course somehow wrong since not the full speed could be used.

Can you please also clarify your hardware used?
CPU, board, RAM
especially HBA/RAID Controller
and used connection system to your disks (SAS or NVMe)?

hkais · Nov 30, 2022

ectoplasmosis said:
As a comparison, I'm running a 15TB mdadm ext4 array on the following hardware:

EPYC 7443P 24-core CPU
256GB 8-channel RAM
8x Samsung PM9A3 Gen4 NVME U.2 1.92TB drives, connected via 2x AOC-SLG4-4E4T-O bifurcation re-timer cards

Seeing ~40GB/s read and ~20GB/s write via a sequential FIO test.

on servers, especially with that much firepower, I have never needed such fast sequential bandwidth. Servers with that firepower have typically tons of workload to calculate and thus to write a lot of data in non sequential manner or to read a lot of random data with short sequential data streams.

The sequential reading or writing is only helpful if you have long running streams of minutes or hours.
E.g. for backups either reading and having short impacts on live environment or on writing (accepting backups) is helpful here

IMHO beside of the sequential bandwidth it would be more helpful to have tests with e.g. 10-20 parallel threads writing lets say realistic 500MB of data, if you work on steam based data (e.g. videos or similar)
On database side it is far more important to have impressive IOPS on 20 or more parallel streams ...

Do you have such tests too?

aaron · Nov 30, 2022

hkais said:
Thank you for your tests.

For me where is a drawback of the 4k approach, which is the wear level of the NVMe. Depending on your data, you write more as you require to write.

About the Bandwidth, for me it looks like your disks are limited by the connector and only the half bandwidth can be utilized of what the disks are capable to do. E.g. with PCIe x4 you shall have 32GBit/s bandwidth, resulting in theoretical 4GiB/s bandwidth. Typically only 3.6 to 3.8 can be reached. So halfing the 3.6GiB/s results into your provided bandwidth. Means it looks like you are only connected with outdated PCIe version (half bandwidth) or with only x2 lanes to each of your disks, which make the test of course somehow wrong since not the full speed could be used.

Can you please also clarify your hardware used?
CPU, board, RAM
especially HBA/RAID Controller
and used connection system to your disks (SAS or NVMe)?

Back then, I did not test the change in configured block size on the raw disks directly. Don't forget that ZFS and volume datasets do come with a performance penalty. To see how the performance directly to the disk is, I would need to run new benchmarks.

The hardware was exactly the same on which the benchmark papers was done with.
The board is a Gigabyte MZ32-AR0. The NVMEs are connected directly via the Slimline connectors.

hkais · Dec 1, 2022

checked your board
your board is a X399 chipset, right?
How many PCIe slots have you utilized?
Is your slot 7 (nearest to CPU) free or in use?

aaron · Dec 1, 2022

hkais said:
your board is a X399 chipset, right?

This is an AMD Epyc board, there is no Chipset as everything is connected to the CPU directly. See the Manual page 8.

The NVMEs are connected on the slimline connectors in the lower left (16-19 according to the overview on page 6 in the manual). Those ports offer PCI Gen3 which is okay as the NVMEs also only are able to do PCI Gen 3.

hkais · Dec 1, 2022

I forgot, Epycs have southbridge on CPU integrated

Anyway is your PCIe slot 7 free or in use?

aaron · Dec 1, 2022

hkais said:
Anyway is your PCIe slot 7 free or in use?

It is not in use. The slimline connectors right next to it are also free.

guletz · Dec 2, 2022

swibwob said:

Hi all, I wonder if I could hijack with related SSD performance benchmarking - are my results within expectations? I have 2 identenical PVE 7.0-11, the only differnce being the HDD / SSD arrangement. The SSD's are enterprise SATA3 Intel S4520, the HDDs are 7.2K SAS. Full post here: https://forum.proxmox.com/threads/p...1-4-x-ssd-similar-to-raid-z10-12-x-hdd.99967/

Prep:

Code:

zfs create rpool/fio
zfs set primarycache=none rpool/fio

Code:

fio --ioengine=libaio --filename=/rpool/fio/testx --size=4G --time_based --name=fio --group_reporting --runtime=10 --direct=1 --sync=1 --iodepth=1 --rw=randrw  --bs=4K --numjobs=64

SSD results:

Code:

FIO output:
read: IOPS=4022, BW=15.7MiB/s (16.5MB/s)
write: IOPS=4042, BW=15.8MiB/s (16.6MB/s)


# zpool iostat -vy rpool 5 1
                                                        capacity     operations     bandwidth
pool                                                  alloc   free   read  write   read  write
----------------------------------------------------  -----  -----  -----  -----  -----  -----
rpool                                                  216G  27.7T  28.1K  14.5K  1.17G   706M
  raidz1                                               195G  13.8T  13.9K  7.26K   595M   358M
    ata-INTEL_SSDSC2KB038TZ_BTYI13730BAV3P8EGN-part3      -      -  3.60K  1.73K   159M  90.3M
    ata-INTEL_SSDSC2KB038TZ_BTYI13730B9Q3P8EGN-part3      -      -  3.65K  1.82K   150M  89.0M
    ata-INTEL_SSDSC2KB038TZ_BTYI13730B9G3P8EGN-part3      -      -  3.35K  1.83K   147M  90.0M
    ata-INTEL_SSDSC2KB038TZ_BTYI13730BAT3P8EGN-part3      -      -  3.34K  1.89K   139M  88.4M
  raidz1                                              21.3G  13.9T  14.2K  7.21K   604M   348M
    sde                                                   -      -  3.39K  1.81K   149M  87.5M
    sdf                                                   -      -  3.35K  1.90K   139M  86.3M
    sdg                                                   -      -  3.71K  1.70K   163M  87.8M
    sdh                                                   -      -  3.69K  1.81K   152M  86.4M
----------------------------------------------------  -----  -----  -----  -----  -----  -----

HDD results:

Code:

FIO output:
read: IOPS=1382, BW=5531KiB/s
write: IOPS=1385, BW=5542KiB/s

$ zpool iostat -vy rpool 5 1
                                    capacity     operations     bandwidth
pool                              alloc   free   read  write   read  write
--------------------------------  -----  -----  -----  -----  -----  -----
rpool                              160G  18.0T  3.07K  2.71K   393M   228M
  mirror                          32.2G  3.59T    624    589  78.0M  40.2M
    scsi-35000c500de5c67f7-part3      -      -    321    295  40.1M  20.4M
    scsi-35000c500de75a863-part3      -      -    303    293  37.9M  19.7M
  mirror                          31.9G  3.59T    625    551  78.2M  49.9M
    scsi-35000c500de2bd6bb-part3      -      -    313    274  39.1M  24.2M
    scsi-35000c500de5ae5a7-part3      -      -    312    277  39.0M  25.7M
  mirror                          32.2G  3.59T    648    548  81.1M  45.9M
    scsi-35000c500de5ae667-part3      -      -    320    279  40.1M  23.0M
    scsi-35000c500de2bd2d3-part3      -      -    328    268  41.0M  23.0M
  mirror                          31.6G  3.59T    612    536  76.5M  45.5M
    scsi-35000c500de5ef20f-part3      -      -    301    266  37.7M  22.7M
    scsi-35000c500de5edbfb-part3      -      -    310    269  38.9M  22.8M
  mirror                          32.0G  3.59T    629    555  78.7M  46.5M
    scsi-35000c500de5c6f7f-part3      -      -    318    283  39.8M  23.1M
    scsi-35000c500de5c6c5f-part3      -      -    311    272  38.9M  23.4M
--------------------------------  -----  -----  -----  -----  -----  -----

I'd have thought the SSDs shuuld be about 10x more IOPS than the above - are my expectations out of whack? Any insights appreciated! Thanks!

Hi,

You must design your pool keeping in mind how zfs work!
You make a test using SYNC write(direct IO). In this case, zfs for ANY block that will need to write, will do this:
- first, it will write to a "special zone" ZIL(zfs intended log)
- second, will write the same block during normal flush(default 5 sec) from the zfs cache on the pool

If you need high IOps, then you will get better results using a dedicated SLOG device with high IOps.

As a side note, in a real world, you will see/need SYNC write, when you will use a Data Base. Most of them will write SYNC with 8k(Oracle, postgresql), or 16K(mysql/percona). In such a case, you will setup your dataset for this block size.

Good luck / Bafta!

koalillo · Dec 4, 2022

Could we get as a feature doing the nvme formatting to 4096 blocks if needed following:

https://openzfs.github.io/openzfs-docs/Performance and Tuning/Hardware.html#nvme-low-level-formatting

, it seems that automatically detecting the optimal block size and reformatting before installing is doable. Would be a nice feature, it's always nice for defaults to work optimally.

edit: well, today I learned why- I blew away a new Proxmox install to set NVMe blocks to 4k (instead of 512) and Proxmox wouldn't install- apparently booting ZFS off that does not work.

aaron · Dec 5, 2022

koalillo said:
Could we get as a feature doing the nvme formatting to 4096 blocks if needed following:

https://openzfs.github.io/openzfs-docs/Performance and Tuning/Hardware.html#nvme-low-level-formatting

, it seems that automatically detecting the optimal block size and reformatting before installing is doable. Would be a nice feature, it's always nice for defaults to work optimally.

edit: well, today I learned why- I blew away a new Proxmox install to set NVMe blocks to 4k (instead of 512) and Proxmox wouldn't install- apparently booting ZFS off that does not work.

Please open an enhancement / feature request at our bugtracker so we can keep track of it and discuss technicalities there. I am not sure if that would work with consumer NVMEs as there are other NVME features which they usually don't support. But I don't have an empty one at hand to verify that quickly myself.

Proxmox VE ZFS Benchmark with NVMe

Distinguished Member

Proxmox Staff Member

New Member

Distinguished Member

New Member

Member

Distinguished Member

Active Member

Renowned Member

Active Member

Well-Known Member

IOPS tests​

Bandwidth tests:​

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Distinguished Member

Active Member

Proxmox Staff Member

We value your privacy

IOPS tests

Bandwidth tests: