NVMe Passthrough Bottlenecking

Solinus

Member
Dec 30, 2021
13
0
6
40
Set up a brand new Dell PowerEdge R6515 with an Epyc 24 Core CPU with NVMe storage. If I store the virtual disk on the NVMe via local LVM in Proxmox, I get the expected speeds in a Windows VM.

1640888677744.png

If I try to pass through the NVMe storage, however, it severely bottlenecks the drives and I cannot figure out where the issue is. Any thoughts?

1640888701117.png
 
CrystalDiskMark isn't great for benchmarking because it is just benchmarking your RAM. I would guess LVM is faster because there your linux page file is caching and maybe you also set some addition caching options like writeback for write caching, which is both missing when you use passthrough.
If you want to see more reliable numbers you should use something like fio and do sync writes and disable all read caching. Then both benchmarks should show the same slow performance.

You didn't told us what NVMe you are using but I would say no SSD can write with 7GB/s to the NAND.
 
Last edited:
How is it just "benchmarking my ram?" Not saying you're wrong per se, but I use it on full-fat Windows hosts and there is a big difference between all my different storage devices and if it was just testing RAM then it would, in theory, be the same between all my devices.

Also, these drives are rated for the speeds in the first picture so that, to me, is accurate. PCI-E passthrough is clearly not in the second picture. In theory, it should be no different than a full-fat host for speeds, for the most part.
 
First your Windows is caching in RAM. Then virtio is caching in RAM. Then your hosts Linux is caching in RAM. Then your SSD is caching in its internal RAM. Then your SSD is caching using its SLC cache. Then it is writing/reading to/from the NAND with normal speeds.
CrystalDiskMark only uses async writes which always will be cached. If you would use sync writes (what CDM can't do) the volatile RAM caches should be ignored and data will be written directly to the SLC cache or NAND skipping all the RAM caching. And all reads will always be cached in RAM if you don't manually disable caching everywhere. Try to read/write 5x 256GB instead of 1GB and your performance will be way lower because thats too much data to be cacheable. First you wil lhit one wall when the RAM gets full and can't cache any longer. Then you will hit another wall when the SLC cache gets full. Only after that it will directly write to NAND without any caching and your writes should be down to some hundred MB/s.

As long as the fast RAM is used somewhere you only read/write from/to the RAM instead of the SSD so you are just benchmarking your RAM.
 
Last edited:
While that still doesn't explain something let's accept that for now as is.

What command do you suggest I run on the Windows guests for fio to be certain I'm testing the drives as you described? I don't have experience with fio and example commands I've found seem to error out because of a syntax error.
 
For Write IOPS you could run something like this:
Code:
fio.exe --ioengine=libaio --directory=c\:\temp --direct=1 --sync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=Sync_IOPS

And for Write throughput something like this:
Code:
fio.exe --ioengine=libaio --directory=c\:\temp --direct=1 --sync=1 --rw=write --bs=4M --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=Sync_Throughput

It will run for 60 seconds and write to "C:\temp" so you need enough space there depeding on the performance of your SSD. All writes should be done as sync writes (like DBs for example will do) so only the SLC cache will be used. If you want to limit the effect of the SLC cache you could fill up that SSD to 90% or something like that so there is less free space available for SLC caching or you could increase the test duration even more. The more your SSD is filled up, the slower it gets, so that also would be more like a real world scenario where you don't use a fresh and empty SSD.

With sync writes you can benchmark what your SSDs worst case write performance is. If you use async writes you see the best case performance. Real world performance will be somewhere in between depending on the workload. Read benchmarking without RAM caching is difficult because there it really depends on the used hardware and used storage/filesystem. Easiest would be to just read a massive amount of data until the can't handle it any longer.
 
Last edited:
Thanks for these.

The first command shows the same type of discrepancy between the two VMs. To be clear, both NVMes are identical Dell NVMe 960gb drives just one is passed directly and the other is not.

These are the results of the first command:

NVMe - File Based

1640893592392.png

NVMe - PCIe Passthrough

1640893628476.png

Curiously, the 2nd command shows more in line with what I would expect with the passthrough storage pulling ahead. Ran that one for 60 seconds and 300 seconds and the results were basically the same:

NVMe - File Based

1640893756459.png

NVMe - PCIe Passthrough

1640893809681.png
 
Did some more testing and looks like things are as they should be. Thanks for the help to confirm. Learned a lot today.