NVMe PCIe 5 benchmarking

jggundin

Active Member
Aug 7, 2019
6
0
41
43
Hello,

I'm benchmarking different storage setups on a new server and I've found an issue or strange behaviour.
The server has 4x KIOXIA KCMYXRUG3T84 on AMD EPYC, linking as expected on PCIe 5.
Running a single windows server VM on single disk reports reads about ~15GB/s, that is as expected for that drive.
Clonning that machine on to 4 disks and running the same test simultaneously, reports 15GB/s for every disk. nmon reports about aggregated 55GB/s at the hypervisor.

But, if I combine the 4 disks into any kind of RAID (mdadm, tested levels 0, 1, 10, 5), ZFS RAIDZ, or LVM, or combiations of them, even in directory mode, the read speed is always equal or below 14GB/s. It seems that it is a kernel issue, or setup limitation, but until now, I haven't found the cause of the bottleneck.

For example, mdadm 4 drive raid 0 (tested even setting 1k chunk) mounted as directory over ext4, with the test VM on it, exactly as if I partition a single disk, gives 15GB/s.
The disk is set as IOThread, No cache, Async IO: Native

Any research idea to achieve optimal RAID or LVM performance?

Current results:
CrystalDiskMark testRead (MB/s)Write (MB/s)
RAID 10 + LVM1388410182
RAID 10 + LVM Thin79103049
RAID 1 + LVM131316515
RAID 5 + Directory11291106
ZFS mirror79967765
ZFS RAID 10764811445
ZFS RAIDZ791511714
ZFS RAIDZ279186191
ZFS dRAID (2 DATA 1 SPARE)76239542
ZFS on single disk76998682
ZFS on mdadm RAID0787210338
ext4 on RAID0 Directory1606313764
ext4 on RAID1 Directory154816120
ext4 on single disk Directory149627580


Thanks!
 
Last edited:
Hello,

I'm benchmarking different storage setups on a new server and I've found an issue or strange behaviour.
The server has 4x KIOXIA KCMYXRUG3T84 on AMD EPYC, linking as expected on PCIe 5.
Running a single windows server VM on single disk reports reads about ~15GB/s, that is as expected for that drive.
Clonning that machine on to 4 disks and running the same test simultaneously, reports 15GB/s for every disk. nmon reports about aggregated 55GB/s at the hypervisor.

But, if I combine the 4 disks into any kind of RAID (mdadm, tested levels 0, 1, 10, 5), ZFS RAIDZ, or LVM, or combiations of them, even in directory mode, the read speed is always equal or below 14GB/s. It seems that it is a kernel issue, or setup limitation, but until now, I haven't found the cause of the bottleneck.

For example, mdadm 4 drive raid 0 (tested even setting 1k chunk) mounted as directory over ext4, with the test VM on it, exactly as if I partition a single disk, gives 15GB/s.
The disk is set as IOThread, No cache, Async IO: Native

Any research idea to achieve optimal RAID or LVM performance?

Current results:
CrystalDiskMark testRead (MB/s)Write (MB/s)
RAID 10 + LVM1388410182
RAID 10 + LVM Thin79103049
RAID 1 + LVM131316515
RAID 5 + Directory11291106
ZFS mirror79967765
ZFS RAID 10764811445
ZFS RAIDZ791511714
ZFS RAIDZ279186191
ZFS dRAID (2 DATA 1 SPARE)76239542
ZFS on single disk76998682
ZFS on mdadm RAID0787210338
ext4 on RAID0 Directory1606313764
ext4 on RAID1 Directory154816120
ext4 on single disk Directory149627580


Thanks!

I've had similar results to yours using RAID 10 on my Kioxia CM7 PCIe 5 drives (4 drives). Overall, I can't complain about the performance I get from the system in real-world usage. However, I would like to achieve better performance—maybe when ZFS 2.3.x arrives on Proxmox with direct I/O support.
 
I've had similar results to yours using RAID 10 on my Kioxia CM7 PCIe 5 drives (4 drives). Overall, I can't complain about the performance I get from the system in real-world usage. However, I would like to achieve better performance—maybe when ZFS 2.3.x arrives on Proxmox with direct I/O support.

look like direct i/o will not be supported for ZVOLs :(
 
look like direct i/o will not be supported for ZVOLs :(
I don’t complain much. My environment and use case strike a good balance between features and performance.


Of course, I’m not a fan of the significant performance penalty—especially considering that each of my NVMe drives can reach 16,000 MB/s, and I’m using four of them in RAID 10 on each node, yet all I get is around the performance of a single drive.


On the other hand, I don’t have shared storage; instead, I rely on ZFS replication over a 2x dedicated 25 Gbps (50Gbps as it is a bond) link, as I can tolerate a few minutes of data loss.


All in all, I’ll stick with ZFS—perhaps its performance in the NVMe world will improve in the future.
 
Last edited:
Thanks for the detailed benchmark!

As for your question, it seems to me that there are only two sane ways to utilize the storage:
zfs striped mirror (or two seperate mirrors)
mdadm striped mirror (or two seperate mirrors)

mdadm+lvm will be faster. known fact.
zfs has more features (inline compression, file level crc, snapshots, zfs send, etc etc). also known fact.

On balance, the question to ask is not "whats optimal" with the sole metric of MB/S, which is arguably not even important. The question is "what does the storage facility need to provide."