[SOLVED] Bifurcated PCIe NVMe Adapter with Raid0 and low performance

P4in1

New Member
Sep 23, 2024
5
1
3
Hi @ all,

I'm fairly new to Proxmox and Linux in general and i am trying to upgrade my server with some fancy hardware.
I installed the Asus Hyper with four NVMe cards on my Super Micro X12 Mainboard. I am using 4x4x4x4 bifurcation for this setup and created a RAID0 and added it as a storage for my ZFS. All NVMes were found, working as expected and then i thought i can give it a try and see how fast i get.

My assumption: 4x of those NVMes with a RAID0 setup should get me somewhere over 20GB/s for normal read and write operations.

My "Problem" : I only get 4.5 - 5 GB/s Read and Write speed which i think is underwhelming considering that a single NVMe of my setup alone should score higher than the whole RAID.

Is there some kind of setting i am missing, a switch not toggled or something else? It works, but not how i thought it should and i see videos of people getting this setup running like i wanted it, but i don't know how to "fix" this.

After reading the compatibility chart of the ASUS Hyper and seeing that it might not be compatible i bought this on from Delock, but to no avail.

My System:
MainboardSuper Micro X12DPL-i6
CPUIntel Xeon Silver 4314
RAM256 GB Micron Reg ECC
GPU3090 TI Founders Edition with GPU passthrough
ZFS Pool 12x 1TB NVMe Mirror direct on Board for the System
ZFS Pool 25x 20TB HDD RAIDZ with 1 for parity
ZFS Pool 31x 20TB HDD for network storage
ZFS Pool 44x NVMe via Bifurcated PCIe Adapter

Additional Information:

Proxmox Version: 7.3-4
The GPU is inserted into CPU1 Slot 2 and blocks Slot 3 completely (it's way bigger than i expected it to be)
The NVMe Adapter is inserted into CPU Slot 4

I would appreciate every bit of help to get this running and integrate me into the Linux space. I am still learning and trying to understand how this works in detail.
If any information would be required on how to find the solution, I will find it (but i might need some information on "how to" , because i consider myself a noob in Linux)

Thanks in advance.

Have a good Night or Day :)
 
There are many performance and function improvements made in zfs and so you should just update your pve (why not until todays 8.2.5) with update inside zfs (2.2.6) itself.
After that perhaps you should do some "zpool upgrade <pool>" cmd's and performance should be as expected much better than before.
 
Thanks for your help. I upgraded my Proxmox node to 8.2.7 and did the "zpool upgrade", but nothing changed. My system works as expected but the ZFS pool still has only ~5 GB/s read / write speed.
 
why write down exact model of the GPU where the most important is the model of the disks itself of course.
How do you make the RAID0 ?

As weekly reminder, consumer flash are slow for ZFS, will wear out quickly.
Remember ZFS tank performance for their features too.
 
These are the NVMe drives: NVMe cards. Samsung 990 Pro M.2 2TB

I used these two adapters: ASUS Hyper M.2 x16 Gen 4 Card and the Delock PCI Express x16 Card to 4 x internal NVMe M.2 Key M - Bifurcation

I wrote down the GPU to give a simple overview over my system specifications and the NVMes are linked in the second sentence, but i forgot to add the link in the table.

I simply thought that the NVMe adapter would split the 16 PCI 4.0 lanes into 4x4 and i could profit from the speed of 4 NVMe drives when i combine them into one drive using the zpool toolkit.

I created the pool like this:
"zpool create -o ashift=12 nvme-pool /path_to_discs"
"zfs set compression=lz4 nvme-pool-1"
"zfs set atime=off nvme-pool-1"

My goal was to create a fast Storage i can then use for my special VMs that i use for Gaming and AI related stuff including coding, Video and imaging applications. I considered the fact that some performance would be lost on the way, but currently i can only use ~20% of the full potential which is the reason i created this thread in case i was simply missing something.
 
Consumer SSDs and ZFS again... Nope, no further words about that.
Try a nice BTRFS RAID0. Should work way better.
 
My system works as expected but the ZFS pool still has only ~5 GB/s read / write speed.
How do you test it ? how many time ? sequential with 4 threads & 32 queues ?
within VM ? Windows VM ?

2 TB 990 Pro real write speed is more about 1,5 GB/s ( https://www.techpowerup.com/review/samsung-990-pro-2-tb/6.html )

+ ZFS isn't oriented max perf filesystem.
RAW disks over a LVM RAID0 will be, imo, the fastest but loosing snapshots support and very unsecure if system crash or power outage.
Better is Lvmthin on each drive , then split vdisk's VM over : fast , snapshot support , supported and no write amplification.
 
  • Like
Reactions: P4in1
How do you test it ? how many time ? sequential with 4 threads & 32 queues ?

This! Aprrarently i was just testing them the wrong way or with a wrong assumption.

My initial Test was made this way since i did not know how to test this in the first place:

Code:
fio --name=read_test --ioengine=libaio --rw=read --bs=1M --direct=1 --size=4G --numjobs=1 --time_based --runtime=10 --group_reporting --filename=/nvme-pool-1/testfile-read

Which outputs something like this:

Code:
Jobs: 1 (f=1): [R(1)][100.0%][r=4281MiB/s][r=4281 IOPS][eta 00m:00s]
read_test: (groupid=0, jobs=1): err= 0: pid=262481: Wed Sep 25 10:42:32 2024
  read: IOPS=4275, BW=4276MiB/s (4484MB/s)(41.8GiB/10001msec)
    slat (usec): min=103, max=546, avg=232.54, stdev=10.50
    clat (nsec): min=696, max=6206, avg=839.89, stdev=143.84
     lat (usec): min=104, max=550, avg=233.38, stdev=10.52
    clat percentiles (nsec):
     |  1.00th=[  732],  5.00th=[  748], 10.00th=[  764], 20.00th=[  780],
     | 30.00th=[  796], 40.00th=[  804], 50.00th=[  820], 60.00th=[  836],
     | 70.00th=[  852], 80.00th=[  868], 90.00th=[  908], 95.00th=[  948],
     | 99.00th=[ 1304], 99.50th=[ 1624], 99.90th=[ 2672], 99.95th=[ 3728],
     | 99.99th=[ 4320]
   bw (  MiB/s): min= 4234, max= 4426, per=100.00%, avg=4277.05, stdev=40.55, samples=19
   iops        : min= 4234, max= 4426, avg=4277.05, stdev=40.55, samples=19
  lat (nsec)   : 750=5.18%, 1000=91.25%
  lat (usec)   : 2=3.32%, 4=0.22%, 10=0.03%
  cpu          : usr=0.53%, sys=99.45%, ctx=19, majf=0, minf=264
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=42764,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1


Run status group 0 (all jobs):
   READ: bw=4276MiB/s (4484MB/s), 4276MiB/s-4276MiB/s (4484MB/s-4484MB/s), io=41.8GiB (44.8GB), run=10001-10001msec

I read how to do the thread and queue tests and used this as a new test:

Code:
fio --name=read_test --ioengine=libaio --rw=read --bs=128k --direct=1 --size=1G --numjobs=4 --iodepth=32 --time_based --runtime=10 --group_reporting --filename=/nvme-pool-1/read-testfile

and the new output is like:

Code:
Jobs: 4 (f=4): [R(4)][100.0%][r=23.2GiB/s][r=190k IOPS][eta 00m:00s]
read_test: (groupid=0, jobs=4): err= 0: pid=262912: Wed Sep 25 10:43:42 2024
  read: IOPS=188k, BW=22.9GiB/s (24.6GB/s)(229GiB/10001msec)
    slat (usec): min=10, max=119, avg=20.53, stdev= 2.84
    clat (nsec): min=1142, max=2214.2k, avg=660082.61, stdev=67104.28
     lat (usec): min=17, max=2293, avg=680.61, stdev=69.23
    clat percentiles (usec):
     |  1.00th=[  619],  5.00th=[  627], 10.00th=[  635], 20.00th=[  644],
     | 30.00th=[  644], 40.00th=[  652], 50.00th=[  652], 60.00th=[  652],
     | 70.00th=[  660], 80.00th=[  660], 90.00th=[  668], 95.00th=[  676],
     | 99.00th=[ 1139], 99.50th=[ 1156], 99.90th=[ 1188], 99.95th=[ 1188],
     | 99.99th=[ 1205]
   bw (  MiB/s): min=17273, max=24045, per=99.95%, avg=23474.24, stdev=370.29, samples=76
   iops        : min=138184, max=192360, avg=187794.11, stdev=2962.30, samples=76
  lat (usec)   : 2=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%
  lat (usec)   : 500=0.01%, 750=98.25%, 1000=0.01%
  lat (msec)   : 2=1.74%, 4=0.01%
  cpu          : usr=3.35%, sys=95.64%, ctx=24660, majf=0, minf=4140
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=1878990,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=22.9GiB/s (24.6GB/s), 22.9GiB/s-22.9GiB/s (24.6GB/s-24.6GB/s), io=229GiB (246GB), run=10001-10001msec


This looks more like the result is was hoping for.
Now i know that my setup is able to do heavy lifting, just not the way i was expecting it in the first place.

Consumer SSDs and ZFS again... Nope, no further words about that.

Consumer grade NVMes and ZFS seem to be working quite nicely, or am i missing something?

Do i understand it correctly that i only can use the full potential having multiple read and write processes simultaneously, rather than a single copy paste scenario?

Thanks a lot.
 
And now test your fio which testfile sizes twice ram size per job to minimize cache effects, so if doing 4 jobs and you have 256GB ram :
fio --name=read_test --ioengine=libaio --rw=read --bs=128k --direct=1 --size=512G --numjobs=4 --iodepth=32 --time_based --runtime=10 --group_reporting --filename=/nvme-pool-1/read-testfile
 
The Result ist still 2.5 times higher than my first attempt:

Code:
Starting 4 processes
read_test: Laying out IO file (1 file / 524288MiB)
Jobs: 4 (f=4): [R(4)][100.0%][r=11.3GiB/s][r=92.6k IOPS][eta 00m:00s]
read_test: (groupid=0, jobs=4): err= 0: pid=324266: Wed Sep 25 13:01:12 2024
  read: IOPS=94.0k, BW=11.5GiB/s (12.3GB/s)(115GiB/10001msec)
    slat (usec): min=11, max=2805, avg=41.37, stdev=16.46
    clat (nsec): min=1776, max=5640.6k, avg=1317561.00, stdev=68739.64
     lat (usec): min=21, max=5671, avg=1358.93, stdev=70.51
    clat percentiles (usec):
     |  1.00th=[ 1221],  5.00th=[ 1254], 10.00th=[ 1254], 20.00th=[ 1270],
     | 30.00th=[ 1287], 40.00th=[ 1303], 50.00th=[ 1319], 60.00th=[ 1319],
     | 70.00th=[ 1336], 80.00th=[ 1352], 90.00th=[ 1385], 95.00th=[ 1401],
     | 99.00th=[ 1483], 99.50th=[ 1516], 99.90th=[ 1926], 99.95th=[ 2245],
     | 99.99th=[ 4113]
   bw (  MiB/s): min=11215, max=12267, per=100.00%, avg=11756.32, stdev=68.99, samples=76
   iops        : min=89720, max=98136, avg=94050.53, stdev=551.84, samples=76
  lat (usec)   : 2=0.01%, 4=0.01%, 50=0.01%, 100=0.01%, 250=0.01%
  lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=99.90%, 4=0.08%, 10=0.01%
  cpu          : usr=2.66%, sys=77.72%, ctx=519020, majf=0, minf=4142
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=939642,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32


Run status group 0 (all jobs):
   READ: bw=11.5GiB/s (12.3GB/s), 11.5GiB/s-11.5GiB/s (12.3GB/s-12.3GB/s), io=115GiB (123GB), run=10001-10001msec

I expected roughly 50% of the max potential to call it a success.

I guess i can live happily with this result :)

Thanks @ all. I would mark my problem "Solved"
 
  • Like
Reactions: waltar

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!