Proxmox VE ZFS Benchmark with NVMe

Dunuin

Famous Member
Jun 30, 2020
8,092
2,018
149
Germany
ZFS isn't that fast because it is doing much more and is more concerned about data integrity and therefore gets more overhead. With mdadm you don't get bit rot protection so your data can silently corrupt oder time, you get no block level compression, no deduplication, no replication for fast backups or HA, no snapshots, ...
You could add some of that features by using a filesystem like btrfs ontop of your mdadm array or using qcow images but than again your mdadm will get slower because the btrfs is creating additional overhead which you don'T get with ZFS because its all already integrated.
 

ectoplasmosis

New Member
Feb 2, 2022
7
0
1
42
As a comparison, I'm running a 15TB mdadm ext4 array on the following hardware:

EPYC 7443P 24-core CPU
256GB 8-channel RAM
8x Samsung PM9A3 Gen4 NVME U.2 1.92TB drives, connected via 2x AOC-SLG4-4E4T-O bifurcation re-timer cards

Seeing ~40GB/s read and ~20GB/s write via a sequential FIO test.
 

Emilien

Member
Jan 23, 2022
121
7
18
Italy
As a comparison, I'm running a 15TB mdadm ext4 array on the following hardware:

EPYC 7443P 24-core CPU
256GB 8-channel RAM
8x Samsung PM9A3 Gen4 NVME U.2 1.92TB drives, connected via 2x AOC-SLG4-4E4T-O bifurcation re-timer cards

Seeing ~40GB/s read and ~20GB/s write via a sequential FIO test.
Which mdadm raid config?
 

guletz

Famous Member
Apr 19, 2017
1,584
260
103
Brasov, Romania
With mdadm you don't get bit rot protection so your data can silently corrupt oder time
... even worst. If one of your HDD RAID10 member will have an non-fatal read error(any HDD has some read errors in time), then you will have a corrupt data/block. Even more, this read non-fatal errors(I can rea a block but I read other data compared with the data that was write in the past) will increase if your HDD temperature will be higher then the normal.... one night without your cooling system and you can lose all of your data.

Good luck / Bafta !
 

hkais

Member
Feb 3, 2020
7
1
8
42
Since the Micron 9300 that we use in the benchmark paper support different block sizes that can be configured for the namespaces, we did some

IOPS tests​


Code:
# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1

The result of 46k IOPS is in the ballpark of the result of the benchmark paper. So far no surprise.


Code:
# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1

As you can see, using the larger 4k block size for the NVME namespace, we get ~76k IOPS which is close to double the IOPS performance.

Bandwidth tests:​


Code:
# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1

We get about 1700MB/s Bandwidth.


Code:
# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1

With the 4k blocksize namespaces there is not a significant higher bandwidth (1800MB/s).

512b NVME block size: ~46k IOPS, ~1700MB/s bandwidth
4k NVME block size: ~75k IOPS, ~1800MB/s bandwidth
Thank you for your tests.

For me where is a drawback of the 4k approach, which is the wear level of the NVMe. Depending on your data, you write more as you require to write.

About the Bandwidth, for me it looks like your disks are limited by the connector and only the half bandwidth can be utilized of what the disks are capable to do. E.g. with PCIe x4 you shall have 32GBit/s bandwidth, resulting in theoretical 4GiB/s bandwidth. Typically only 3.6 to 3.8 can be reached. So halfing the 3.6GiB/s results into your provided bandwidth. Means it looks like you are only connected with outdated PCIe version (half bandwidth) or with only x2 lanes to each of your disks, which make the test of course somehow wrong since not the full speed could be used.

Can you please also clarify your hardware used?
CPU, board, RAM
especially HBA/RAID Controller
and used connection system to your disks (SAS or NVMe)?
 

hkais

Member
Feb 3, 2020
7
1
8
42
As a comparison, I'm running a 15TB mdadm ext4 array on the following hardware:

EPYC 7443P 24-core CPU
256GB 8-channel RAM
8x Samsung PM9A3 Gen4 NVME U.2 1.92TB drives, connected via 2x AOC-SLG4-4E4T-O bifurcation re-timer cards

Seeing ~40GB/s read and ~20GB/s write via a sequential FIO test.
on servers, especially with that much firepower, I have never needed such fast sequential bandwidth. Servers with that firepower have typically tons of workload to calculate and thus to write a lot of data in non sequential manner or to read a lot of random data with short sequential data streams.

The sequential reading or writing is only helpful if you have long running streams of minutes or hours.
E.g. for backups either reading and having short impacts on live environment or on writing (accepting backups) is helpful here

IMHO beside of the sequential bandwidth it would be more helpful to have tests with e.g. 10-20 parallel threads writing lets say realistic 500MB of data, if you work on steam based data (e.g. videos or similar)
On database side it is far more important to have impressive IOPS on 20 or more parallel streams ...

Do you have such tests too?
 

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
3,071
512
118
Thank you for your tests.

For me where is a drawback of the 4k approach, which is the wear level of the NVMe. Depending on your data, you write more as you require to write.

About the Bandwidth, for me it looks like your disks are limited by the connector and only the half bandwidth can be utilized of what the disks are capable to do. E.g. with PCIe x4 you shall have 32GBit/s bandwidth, resulting in theoretical 4GiB/s bandwidth. Typically only 3.6 to 3.8 can be reached. So halfing the 3.6GiB/s results into your provided bandwidth. Means it looks like you are only connected with outdated PCIe version (half bandwidth) or with only x2 lanes to each of your disks, which make the test of course somehow wrong since not the full speed could be used.

Can you please also clarify your hardware used?
CPU, board, RAM
especially HBA/RAID Controller
and used connection system to your disks (SAS or NVMe)?
Back then, I did not test the change in configured block size on the raw disks directly. Don't forget that ZFS and volume datasets do come with a performance penalty. To see how the performance directly to the disk is, I would need to run new benchmarks.

The hardware was exactly the same on which the benchmark papers was done with.
The board is a Gigabyte MZ32-AR0. The NVMEs are connected directly via the Slimline connectors.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!