Beginner Seeking Advice: Optimizing NVMe Performance in Proxmox for VMs

simoncechacek

New Member
Jun 21, 2023
23
0
1
wordpresscare.net
Hello Proxmox community,

I hope you all are doing well. I'm writing to seek your advice and guidance on an issue that's been giving me a bit of a challenge. I've just started exploring Proxmox, and while it's been an exciting journey so far, I could really use your collective wisdom on a specific point.

We've got a Gigabyte R272-Z34 server (https://www.gigabyte.com/Enterprise/Rack-Server/R272-Z34-rev-100) with an AMD EPYC 7H12 processor, 512GB RAM, and 24 SAMSUNG MZWLJ1T9HBJR-00007 P2 drives.

For some additional context, we've been using VMware for our other servers and we're considering switching to Proxmox with this new machine. So, I'm relatively new to this platform and still getting my bearings.

The goal is to get the best possible performance from these NVMe drives. We've set up a software RAID within Proxmox and initially tried RAID10, but we're open to other suggestions if they might work better.

I ran a write speed test with this command: dd if=/dev/zero of=/nvme/test1.img bs=5G count=1 oflag=dsync. The result was about 1.7GB/s when run directly in Proxmox SSH, but when the same test was performed inside a Linux VM, the speed dropped to about 833MB/s. I am not sure if this test is the best, but it shows a difference.

Now, I'm not even sure if 1.7GB/s is the best speed we could be getting directly in Proxmox, so I'm hoping to understand how we can optimize this setup further.

I'd really appreciate any advice on how to configure the RAID for better performance or general Proxmox configurations that could be beneficial. Any suggestions for more effective benchmarking tools or methods would also be incredibly helpful.

Thank you so much for your time and help. I'm eager to learn from your expertise.

Kind regards
Simon Cechacek
 
PS: Also, I am learning about CEPH. For this technology, I am on an absolute start, so I have no idea if it would be a great fit for us - we will run only one proxmox server for now, but with all those 24 drives.

Just wanted to say that I am open even to that idea, if it would be a better solution.

We plan to host high performance websites on that server, so mainly Webserver and MySQL loads are expected.
 
Check out the pinned Benchmark threads to get a bit of an idea how to do benchmarks (fio instead of dd for example) and what you could expect to lose in performance with each layer.

Hopefully we can produce new benchmark papers later this year!
 
Hopefully we can produce new benchmark papers later this year!
Would be nice to have some Samsung EVO vs Samsung QVO this time. Maybe then people stop buying QLC SSDs for ZFS :)

PS: Also, I am learning about CEPH. For this technology, I am on an absolute start, so I have no idea if it would be a great fit for us - we will run only one proxmox server for now, but with all those 24 drives.
Ceph needs 3+ nodes.

I ran a write speed test with this command: dd if=/dev/zero of=/nvme/test1.img bs=5G count=1 oflag=dsync. The result was about 1.7GB/s
Basically useless test as you are writing zeros and ZFS does block-level compression, compressing all the zeros. Try "fio" or at least use "if=/dev/urandom".
 
Would be nice to have some Samsung EVO vs Samsung QVO this time. Maybe then people stop buying QLC SSDs for ZFS :)
Hehe, well, so far only SSDs with PLP are planned, for no consumer stuff ;)
 
Hehe, well, so far only SSDs with PLP are planned, for no consumer stuff
Would still be nice to have at least one SMR HDD + CMR HDD + TLC consumer SSD + QLC consumer SSD for comparison, so people can see what to expect when not buying proper SSDs with PLP.
 
The goal is to get the best possible performance from these NVMe drives. We've set up a software RAID within Proxmox and initially tried RAID10, but we're open to other suggestions if they might work better.
You need to consider your options with respect to your SQL database load, also the default blocksize on your SSDs. You propable have to carefully tune the settings on all layers to get the best out of it without wasting space of performance.


Would still be nice to have at least one SMR HDD + CMR HDD + TLC consumer SSD + QLC consumer SSD for comparison, so people can see what to expect when not buying proper SSDs with PLP.
and will hopefully reduce the forum threads about this ;)
 
Hi there!

I read the benchmark PDF and found this command that was used for test:

Code:
fio --ioengine=libaio --filename=/dev/sdx --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio

But that will erase my data and also is not being used for testing the already created ZFS. So can I please ask you to helo with proper testing of the pool? Also, after reading those graphs, I am still unsure if the performance is great as you got around 1.5GB/s sequential write with 3 drives, so I expected higher performance from 24 drives in RAID 10, but maybe I am completely wrong :)

I also read more about CEPH and yes, that is nonsense for me with one node :).

I am sorry if anything from me does not make sense.
 
You need to consider your options with respect to your SQL database load, also the default blocksize on your SSDs. You propable have to carefully tune the settings on all layers to get the best out of it without wasting space of performance.



and will hopefully reduce the forum threads about this ;)
The blocksize should be 4K. The type of load is mostly WordPress sites, not sure what other I should monitor? Can you give me a hint on what to study and check? Thanks!
 
Well, if you have a RAID 10 like pool, you should be able to detach one disk out of one of the mirrors to do a benchmark of the disk itself. I would let it run for 300 or 600 seconds on raw disk as well nowadays. Because, especially consumer SSDs, will only show their true performance once the buffer is full, and that can take a while.

Also keep in mind, that a bs of 4k will benchmark IOPS, and a larger bs, (4M) will benchmark bandwidth. with sync & direct enabled, you test the worst case scenario, especially with numjobs and iodepth set to 1 as well.

Increasing the iodepth to larger numbers will result in nicer numbers and might be closer to a real life scenario.

And once you know what you can expect from a single disk, you can go up layer by layer to see what the result will be for a single client. These layers are: storage (e.g. ZFS), virtualization (test inside the VM) and the VMs filesystem. So you could attach a disk image without a FS to the VM and benchmark it directly, so see how much the FS in the VM costs you :)
 
Well, if you have a RAID 10 like pool, you should be able to detach one disk out of one of the mirrors to do a benchmark of the disk itself. I would let it run for 300 or 600 seconds on raw disk as well nowadays. Because, especially consumer SSDs, will only show their true performance once the buffer is full, and that can take a while.

Also keep in mind, that a bs of 4k will benchmark IOPS, and a larger bs, (4M) will benchmark bandwidth. with sync & direct enabled, you test the worst case scenario, especially with numjobs and iodepth set to 1 as well.

Increasing the iodepth to larger numbers will result in nicer numbers and might be closer to a real life scenario.

And once you know what you can expect from a single disk, you can go up layer by layer to see what the result will be for a single client. These layers are: storage (e.g. ZFS), virtualization (test inside the VM) and the VMs filesystem. So you could attach a disk image without a FS to the VM and benchmark it directly, so see how much the FS in the VM costs you :)
That sound reasonaable. Can you direct me somewhere where I can learn more about how to test it? The command listed in the benchmark is having ivalid parameter. To be honest, I just dont understant fio enough, so is there a quick way how to test it or understand it?

Thank you!
 
and will hopefully reduce the forum threads about this ;)
I don't think less people will spam the forums with "why is my ZFS performance so bad...?!" threads.
But would be very useful to be able to quote some official benchmarks. Otherwise it looks like people got more trust in the manufacturers datasheets, promising multiple GB/s of write performance, than what some random dudes in the forums are explaining, why the performance of SMR/QLC or even TLC without PLP are so bad. People then still buy these because its a bit cheaper..,ignoring that stuff like sync IOPS performance would be 50-100 times better by only spending 20% more on a proper SSD.
 
Last edited:
  • Like
Reactions: spirit
Hello @simoncechacek , and welcome to the community.

Storage performance is a tricky topic. If you are sure that your storage is not the bottleneck, you will find that it is essential to understand your CPU architecture to get the most out of it from inside the VM. You may be interested in the KB article on optimizing storage performance.
The article covers several considerations that will help you minimize latency. Many of the same concepts will help you optimize for bandwidth. As you read the "Hardware Concepts" section, remember that your EPYC 7H12 has a 16x4 configuration (i.e., 4 cores per CCX, distributed across 8 CCDs).

You may also find this KB article helpful.
It explores the optimal disk configurations for your VM and quantifies the performance impacts of iothreads, aio, and iouring.

Let me know if you find this helpful and if there are any other specific topics you would like us to cover.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hello @simoncechacek , and welcome to the community.

Storage performance is a tricky topic. If you are sure that your storage is not the bottleneck, you will find that it is essential to understand your CPU architecture to get the most out of it from inside the VM. You may be interested in the KB article on optimizing storage performance.
The article covers several considerations that will help you minimize latency. Many of the same concepts will help you optimize for bandwidth. As you read the "Hardware Concepts" section, remember that your EPYC 7H12 has a 16x4 configuration (i.e., 4 cores per CCX, distributed across 8 CCDs).

You may also find this KB article helpful.
It explores the optimal disk configurations for your VM and quantifies the performance impacts of iothreads, aio, and iouring.

Let me know if you find this helpful and if there are any other specific topics you would like us to cover.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Just a quick note. All the storage is in the server, not connected via network :) Using HBAs from Gigabyte that were delivered witht he server.
 
Just a quick note. All the storage is in the server, not connected via network :) Using HBAs from Gigabyte that were delivered witht he server.
You are correct. The articles reference NVMe over fabrics devices. However, the architectural concepts are applicable. The core differences are in the interrupt handling, which is slightly more nuanced for fabric-attached storage.

Good Luck!


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Ok, so I found a a tutorial on Ars technica to better understand FIO. And my results are already higher (around 4GB/s) on the OS level and VM level.

And its way better than our old SATA SSD with controller based server.

NVME: https://paste.brcb.eu/atolaqevak.yaml

virtual on NVME: https://paste.brcb.eu/erilogygut.apache (great to note that this is exactly the same VM with its VMDK migrated and imported to QUEMU, not a VM made freshly on proxmox, but a vmware VM)

virtual on old SSD: https://paste.brcb.eu/ekajekokez.apache

I am struggling to find more tests to check the latency, write/read speeds and way how to measure everything correctly. After I solve this, I will try to remove the RAID10, test only 1 drive and then test smaller RAIDs and one big again with or without cache.

Thats another thing I want to ask: Should I enable caches on VMs and enable compression on the ZFS? IF yes, then what compression? when the mai target is speed and IOPS.

Thanks!
 
If compression then the default LZ4. This got very low overhead and especially with slow devices like HDDs, it could increase performance, as the slow down of the compression might be smaller than the performance gail you get by needing to read/write less data.But with NVMe the bandwidth shouldn't be the bottlenck and LZ4 compression might hurt more then it helps by increasing the latencies?
 
If compression then the default LZ4. This got very low overhead and especially with slow devices like HDDs, it could increase performance, as the slow down of the compression might be smaller than the performance gail you get by needing to read/write less data.But with NVMe the bandwidth shouldn't be the bottlenck and LZ4 compression might hurt more then it helps by increasing the latencies?
Okay, thats what I thought - leaving the compression off for NVMes.

Now I am stuck on a point, where I made some tests, but unsure if those tests (i posted above) all argiht. Afer creating a suite of tests, I will go for testing one drive and then RAID options.
 
Ok, so I found a a tutorial on Ars technica to better understand FIO. And my results are already higher (around 4GB/s) on the OS level and VM level.

And its way better than our old SATA SSD with controller based server.

NVME: https://paste.brcb.eu/atolaqevak.yaml

virtual on NVME: https://paste.brcb.eu/erilogygut.apache (great to note that this is exactly the same VM with its VMDK migrated and imported to QUEMU, not a VM made freshly on proxmox, but a vmware VM)

virtual on old SSD: https://paste.brcb.eu/ekajekokez.apache

I am struggling to find more tests to check the latency, write/read speeds and way how to measure everything correctly. After I solve this, I will try to remove the RAID10, test only 1 drive and then test smaller RAIDs and one big again with or without cache.

Thats another thing I want to ask: Should I enable caches on VMs and enable compression on the ZFS? IF yes, then what compression? when the mai target is speed and IOPS.

Thanks!
So what did you end up doing to improve performance?
 
Hello,

I am currently stuck in the middle of migrations.

We purtchased the GRAID RAID card, but after months of trying it to get working, we diverted to postpone our porxmox migration by a year and we run VMware with the SSDs splitted per VM with heavy backups.


I also found a hardware issue in the setup - one of the drives was only connecting by Gen3 x2 lane and was slowing down the whole ZFS pool.

I will try to setup the ZFS again during the year.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!