High iops in host, low iops in VM

bsinha · Apr 17, 2024

We have 4 HP NVMe drives with the following specs:

Manufacturer: Hewlett Packard
Type: NVMe SSD
Part Number: LO0400KEFJQ
Best Use: Mixed-Use
4KB Random Read: 130000 IOPS
4KB Random Write: 39500 IOPS

Server used for Proxmox: HPE ProLiant DL380 Gen10 - All the NVMe drives are connected directly to the Motherboard's storage controller.

Proxmox VE is installed on 2 x480GB SSD in RAID-1 mode. We have a guest VM running Ubuntu 22.04 LTS edition having 40GB RAM and 14 vCPU.
The storage configuration for the VM is:

SCSI Controller: VirtIO SCSI Single
Bust/device: VirtIO Block
IO Thread: Checked
Async IO: io_uring
cache: no cache

We have used the following command for bechmarking:

In proxmox:
fio --ioengine=psync --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=600 --time_based --name write_4k --filename=/dev/nvme1n1

Write IOPS: 86K

In Guest VM:
fio --ioengine=psync --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=600 --time_based --name write_4k --filename=/dev/sdb

Write IOPS: 16K

Note: The drive for the VM is on one of the NVMes and configured as LVM.

VM Configuration:

What am I really missing here, which is causing this much IOPS difference?

bbgeek17 · Apr 17, 2024

The best performance is always had with "host cpu". I'd recommend testing with that.

You may also want to look at some of the tips here: https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/#tuning-procedure
Not everything will apply, as your storage is local.

Performance tuning requires a look at the system as a whole, disk IO is not only about disk IO.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

bsinha · Apr 17, 2024

bbgeek17 said:
The best performance is always had with "host cpu". I'd recommend testing with that.

You may also want to look at some of the tips here: https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/#tuning-procedure
Not everything will apply, as your storage is local.

Performance tuning requires a look at the system as a whole, disk IO is not only about disk IO.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

I have changed the CPU to 'host cpu',
Keeping the storage controller: scsi-virtio-single
tried aio=io_uring and aio=native both.

The write IOPS became 16K. to 17.5K. Not a significant increase. There must be some other bottleneck that needs to be addressed. Do you have anything else that comes to your mind?

gfngfn256 · Apr 17, 2024

Isn't there going to be an overhead from the VM's /dev/sdb which in fact is an LV (through HV) on that drive & not the raw block storage.
Your host is directly writing to the raw /dev/nvme1n1 which it sees directly as a block-device.

If you did a passthrough of the NVMe (PCIe?) to the VM you could probably draw a comparison.

bsinha · Apr 17, 2024

gfngfn256 said:
Isn't there going to be an overhead from the VM's /dev/sdb which in fact is an LV (through HV) on that drive & not the raw block storage.
Your host is directly writing to the raw /dev/nvme1n1 which it sees directly as a block-device.

Is it normal, that the host will have 86K write iops and the Guest will only have 17K write iops with the same disk? It is absurd. If the guest achieves at least 50k write iops, then it should be fair enough.

I am still sure, that I am missing something. Maybe some configuration at the VM end or I need to install some sort of storage driver in the Guest VM. I do not know. I am just guessing. If anyone can help.

bbgeek17 · Apr 17, 2024

gfngfn256 said:
Isn't there going to be an overhead from the VM's /dev/sdb which in fact is an LV (through HV) on that drive & not the raw block storage.
Your host is directly writing to the raw /dev/nvme1n1 which it sees directly as a block-device.

There is absolutely a penalty, but with the right tuning , on the right hardware it can be minimized.
You can find the summary of our study here: https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/#summary
There is also overhead for LVM, but it should't be that significant.

@bsinha performance is highly hardware dependent, but there are no special hardware drivers to install. I'd recommend looking at CPU/NUMA layout, possibly reduce the number of CPUs. Unfortunately, forum is not optimal venue for such troubleshooting.

Good luck.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

gfngfn256 · Apr 17, 2024

Suggestions for analysis:

Just for testing maybe try an alternate SCSI Controller Type, maybe start with Default (LSI 53C895A).
Try an alternate Ubuntu version newer/older VM.

Happy hunting.

bsinha · Apr 17, 2024

gfngfn256 said:
Suggestions for analysis:

Just for testing maybe try an alternate SCSI Controller Type, maybe start with Default (LSI 53C895A).
Try an alternate Ubuntu version newer/older VM.

Happy hunting.

Which Ubuntu version? did you face the same problem before?

gfngfn256 · Apr 17, 2024

bsinha said:
did you face the same problem before?

I did not. But nor have I done your specific scenario/test. I believe you should try it (not very hard on what appears as a testing ground Datacenter), you may learn something from the results.

One more thing. You mention above:

Is it normal, that the host will have 86K write iops and the Guest will only have 17K write iops with the same disk?

I don't think that is factually correct.
Your host is testing on nvme1n1 but your RAW-DISK is nvme0n1.
They are different NVMe's. Maybe something is up with one of them or their connectors.

bsinha · Apr 17, 2024

gfngfn256 said:
I did not. But nor have I done your specific scenario/test. I believe you should try it (not very hard on what appears as a testing ground Datacenter), you may learn something from the results.

Sure. Maybe I come up with a better result.

gfngfn256 · Apr 17, 2024

Just to continue from my Edited post (see my last post what I added) - as a Test scenario - you could try swapping the 2 disks in their connectors/enclosures for further testing - Probably a lot of HW/SW hassle - but again could be interesting results.

bsinha said:
Sure. Maybe I come up with a better result.

Yes, we may all end up learning a lot. In my experience - during testing - there's ALWAYS surprises, some good & some not so good.

bsinha · Apr 17, 2024

gfngfn256 said:
I did not. But nor have I done your specific scenario/test. I believe you should try it (not very hard on what appears as a testing ground Datacenter), you may learn something from the results.

One more thing. You mention above:

I don't think that is factually correct.
Your host is testing on nvme1n1 but your RAW-DISK is nvme0n1.
They are different NVMe's. Maybe something is up with one of them or their connectors.

They are of the same make and model. Also, I swapped the NVMe before, and then only I came up with this post.

Finally, changing the Storage controller does not change the IOPS count.

Ramalama · Apr 17, 2024

I did the same test on host and an ubuntu VM, but not on nvmes, just a crap server:

host: (pve 8.1 kernel 6.8.4)
Jobs: 1 (f=1): [W(1)][75.0%][w=1796KiB/s][w=449 IOPS][eta 00m:05s]

VM: Ubuntu 23.10
Jobs: 1 (f=1): [W(1)][75.0%][w=724KiB/s][w=181 IOPS][eta 00m:05s]

Seems to me the same degradation, i would simple say that this is normal then.
Settings are all optimized, host, virtio block, no atime, and so on, the usual stuff we all do for max performance.

Cheers

bsinha · Apr 17, 2024

I am wondering, is it what we should expect or can we bring it closer to the RAW disk IOPs? 20% degraded IOPs can be tolerated. But around 80% IOPs degradation is a disgrace.

Ramalama · Apr 17, 2024

bsinha said:
I am wondering, is it what we should expect or can we bring it closer to the RAW disk IOPs? 20% degraded IOPs can be tolerated. But around 80% IOPs degradation is a disgrace.

I think there is really nothing we can do tbh.
Not even the Proxmox team, they don't develop the virtio drivers or qemu or kvm.

Maybe there are some hacky arguments, some parameters that you can add to your qemu config for the VM, but thats above my knowledge.

You know as company, you can solve this issues with money, im doing this in my company all the time. (As admin in a bigger company)
For example, our ERP Applications gets frakingly slow, because its Propritary Crap and uses as Database some sort of a File based Database with no Manual or whatever how to optimize it, its a self developed database from the ERP Vendor.
So the only way to speedup this Crap is with buying new Servers with fastest U.3 Datacenter SSD's, maximize the memory bandwith by populating all channels and the most lanes and channels has genoa... To make it redundant, we need 2 of such servers.
So we buy new Servers, that this Piece of Crap ERP System, runs faster xD

Its a bad comparisation to qemu/kvm/virtio, because they are amazing, but you cannot do much, same as me with that ERP-System.

But don't forget that we tested only one VM, so multiple VM's will probably get 80-90% of raw speed, just one can't.
Maybe because of the iodepth=1 in fio and iothread=1 in the VM config. We have simply to "parallelism" more.

Cheers

PS: I think what @bbgeek17 said is correct, we should try out the suggestions from blockbridge.

_gabriel · Apr 17, 2024

fwiw, with a single Samsung SM963 480 GB , same 17k iops on host (ext4 + LVM-Thin) and Windows VM.
with iodepth=32 , host = 200k & ~860MB/s , Windows VM = 95k & ~370MB/s

bsinha · Apr 18, 2024

Ramalama said:
I think there is really nothing we can do tbh.
Not even the Proxmox team, they don't develop the virtio drivers or qemu or kvm.

Maybe there are some hacky arguments, some parameters that you can add to your qemu config for the VM, but thats above my knowledge.

You know as company, you can solve this issues with money, im doing this in my company all the time. (As admin in a bigger company)
For example, our ERP Applications gets frakingly slow, because its Propritary Crap and uses as Database some sort of a File based Database with no Manual or whatever how to optimize it, its a self developed database from the ERP Vendor.
So the only way to speedup this Crap is with buying new Servers with fastest U.3 Datacenter SSD's, maximize the memory bandwith by populating all channels and the most lanes and channels has genoa... To make it redundant, we need 2 of such servers.
So we buy new Servers, that this Piece of Crap ERP System, runs faster xD

Its a bad comparisation to qemu/kvm/virtio, because they are amazing, but you cannot do much, same as me with that ERP-System.

But don't forget that we tested only one VM, so multiple VM's will probably get 80-90% of raw speed, just one can't.
Maybe because of the iodepth=1 in fio and iothread=1 in the VM config. We have simply to "parallelism" more.

Cheers

PS: I think what @bbgeek17 said is correct, we should try out the suggestions from blockbridge.

If I do jobs=5 then I get around 80K write iops in the VM, which is equivalent to the HOST. That is what your suggestion is. write?

spirit · Apr 18, 2024

bsinha said:
What am I really missing here, which is causing this much IOPS difference?

This is simply the virtualisation of iops, this add latency. (Look at your vm cpu on the host, maybe you'll have 1 vmcore at 100%)
do you have really have applications doing more than 16000 4k sync write with iodepth=1 ?

of course, It'll increase with parrallel write. and generally, applications are not doing a sync for each write. (maybe database with transactions, if you are some 1transaction/sync for each insert in database, but generally you are grouping transaction)

You should be able to increase iodepth=1 with faster frequencies cpu.

Ramalama · Apr 18, 2024

bsinha said:
If I do jobs=5 then I get around 80K write iops in the VM, which is equivalent to the HOST. That is what your suggestion is. write?

Yeah, with one VM and one Job and iodepth=1 and iothread=1, you disabling paralellism as much as possible xD

For example, if you have a PVE-Node with many cores and just a few VM's, (or to be more precisise, just a few total virtual drives), you could assign more iothreads to the drives.

if i have for example 128 Cores in the host, and only 5 VM's and each vm has only one Disk, so total of 5 Disks, this will lead to total iothreads of 5.
But as an 128 Core CPU is very slow in clockspeed, but has a lot of unused cores, i would set on each disk iothread=5 or something, so i get 25 io threads, which is way more balanced for that specific scenario.

With a 32core cpu and 12 vms/disks, i would leave iothread=1, except one vm needs somehow much more disk performance. But i think there is a small penality, with more iothreads the io latency will probably increase minimally either, but i don't think that this is an concern.

------
With iodepth=1 you basically disable the io queue, which makes your benchmark synchronous instead asynchronous.
Synchronous just meaning each I/O request must complete before the next begins.
-> So the system sends only one IO Operation to the NVME Firmware and waits for the "Complete Reply" from the Firmware. And after that, it sends the next 1 operation...

But nvme's and especially the nvme protocol itself, was developed to allow as much io operations as possible at same time, so the queues were greatly increased.
Queue = How much operations you can send at once to the NVME.

Basically your system should send to the nvme firmware as much as possible io operations without waiting for the "complete reply".
This actually never happens, because at some point your system needs to know if they were written or not...

So there are limits to iodepth, up to 64k or sth like that.
iodeth=32 means that fio will queue up to 32 I/O operations to the NVMe drive before waiting for any to complete. This setup is considered asynchronous because it does not require each operation to complete before the next one is issued. This allows for multiple operations to be in flight at the same time.

Some people don't really know the difference between Synchronous and Asynchronous operations, so i write this only for the general easy understanding.
-----

In your scenario, you want to know why Synchronous operations on the Host are so much faster as on the Guest.
So increasing the iodepth makes no sense, to find out why it happens, because that was not the question right?
Thats why i ignore all other comments that are testing asynchronous.

And thats why i told you, with more "parallelism" you can get more iops, i meant genereally tbh any way of parallelism.

What im wondering is, that with more Jobs on the quest VM you are getting more IOPS.
Basically what you are doing with more Jobs is doing more Synchronous Jobs.

So the nvme hast still to reply to every io operation.
And on your host directly the replyes are a lot faster clearly.

Could you do multiple jobs on the host either? like comparing the iospeed of 5 Jobs on the Host, vs 5 Jobs on the VM?

Otherwise its a core-limit issue inside the VM, because nothing else makes sense, why 5 Jobs are faster as one.
EDIT: @spirit is right, i think the same.

High iops in host, low iops in VM

Member

Distinguished Member

Member

Well-Known Member

Member

Distinguished Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Well-Known Member

Member

Distinguished Member

Well-Known Member