Disk Speeds inside a VM.

YSpud · Dec 16, 2024

Going so nuts regarding same issues people have above.. bare metal we get 2500+ iops no problem on gen3 pci nvme enterprise ssd's... in a vm.. 700... using lxc container, same tests get around 1500 iops.. much better and that would be totally fantastic if we could get that in our vm's (every vm has it's own dedicated pcie nvme enterprise ssd attached to it - dell r640 / 512gb ram - dual xeon 3ghz 'gold' 72 cores)

.. serenity now .. i am at a loss ... i just cant figure out why we cant even get 50 percent of bare metal ... maddeningly frustrating....

p.s. - - got my hands on a dell r750 with 4th gen nvme ssd's ... host gets 11K+ IOPS while the guest gets about 2500... so that would work for our needs (we need above 1000 IOPS to support the db application we are attempting to run) but server costs go from 4k/each to 12k for the config and storage we need using gen4...

helm · Dec 16, 2024

YSpud said:
.. serenity now .. i am at a loss ... i just cant figure out why we cant even get 50 percent of bare metal ... maddeningly frustrating....

In Bare metal benchmark the data goes from the 8 CPU-caches (its running on 8 cores) over the PCIe Bus to the NVMe.

In the VM the data ist copied from the 8 CPU-Caches of different Cores to a single Core cache for the virtual-block device, because the virtio-driver is (was) single-threaded by default. Than an other single Core CPU copies the data from the kernel running on baremetal. During the copies between the CPU caches slower RAM may be used.

every vm has it's own dedicated pcie nvme enterprise ssd attached to it

Using PCIe passtrough may help, but I do not know the impact of the IOMMU.
In both cases additional copies may be necessary in case the PCIe-HBA can not reach the memory by DMA.

So run top -H (displays "processes/threads" per Core, one CPU core can utilized up tp 100%) on the host and while running the benchmark in the VM to see the real CPU usage or limits.

Real applications do not have usually such high IO-requirements, as the data must be processed somewhere.

rofo69 · Dec 16, 2024

Disks passed through via usb or pcie seem to work at full speed in the VM, the issue for me at least, is the OS disk thats running on LVMthin is slow.

helm · Dec 16, 2024

A block-device is probably slower than PCIe passtrough due to the copying of data by the CPU. It is important to understand, that PCIe passtrough means to enable IOMMU and to give the NVMe as PCIe device to the VM, not as the block-device itself or a block-device created by a MBR or GPT partiton on the NVMe as suggested in the first results on Google.

Due to my tests on Proxmox LVMthin is 50% slower, when resizing the thin partition is involved.

YSpud · Dec 16, 2024

helm said:
In Bare metal benchmark the data goes from the 8 CPU-caches (its running on 8 cores) over the PCIe Bus to the NVMe.

In the VM the data ist copied from the 8 CPU-Caches of different Cores to a single Core cache for the virtual-block device, because the virtio-driver is (was) single-threaded by default. Than an other single Core CPU copies the data from the kernel running on baremetal. During the copies between the CPU caches slower RAM may be used.

Using PCIe passtrough may help, but I do not know the impact of the IOMMU.
In both cases additional copies may be necessary in case the PCIe-HBA can not reach the memory by DMA.

So run top -H (displays "processes/threads" per Core, one CPU core can utilized up tp 100%) on the host and while running the benchmark in the VM to see the real CPU usage or limits.

Real applications do not have usually such high IO-requirements, as the data must be processed somewhere.

i was looking into using passthru instead of virtio - - do you happen to have a guide on how to implement this properly ? Ive google-fu'd and its confusing... also if you use direct disk pass-through you lose any kind of migration ability correct ? A certain application we are running requires iops over 1000 to 'pass' certification - - if i cant show it then proxmox isnt going to be the answer to our manageability problem and we'll go bare metal or docker in an lxc route (which seems to work awesome but i dont know what'll happen if we do that for 20 machines running / if we will run into other issues.. thank you so much for taking the time to reply. much appreciated.

helm · Dec 17, 2024

YSpud said:
i was looking into using passthru instead of virtio - - do you happen to have a guide on how to implement this properly ?

No, I have not. A few years ago I tested PCIe passthrough of a graphics-card. After some BIOS updates I managed to get it working. I would only recommended it for machines having high I/O requirements and low availbility. "Premature optimization is the root of all evil" (Donald Knuth)

YSpud said:
Ive google-fu'd and its confusing... also if you use direct disk pass-through you lose any kind of migration ability correct ?

For migration of running VMs you need a shared storage for the Images, something like SAN, iSCSI, NAS, CEPH, DRBD. Containers (Docker or LXC) are usually migrated by restarting. Fine for short running systems, unusable for simple batch-Jobs that can not be restartet.

YSpud said:
A certain application we are running requires iops over 1000 to 'pass' certification

How are these IOPS measured? Is only one process for writing allowed, is sync writing needed or is caching allowed? 1000 is not a high number, a single SATA SSD can do using multiple threads. Try fio with --numjobs=128.
As mentioned, virtio works with one thread per block device, so this may be a limit. If allowed by your application, you can use writeback instead of writetrough.
Network code is async already for a long time. When using NFS (version 4.1 or .2 please) or iSCSI for storing the VM disk-images, you will get better results when connecting the VM direct by network to the storage instead of using disk images. You can also buy a NAS with USV, allowing to use write cache in the NAS (by exporting NFS async).
Assuming you want to store something like MySQL binlogs (transaction logs), 1000 IOPS (linear writes) to a single file synchronized are not easy, especially when HA is needed. 1 ms per synced write is not much time. Better programmed apps switch SQL autocommit of and use transactions, so ony at the end of a transaction sync on the binlog is necessary. Tables and indices may also not need to be written in sync.
Generally I would not suggest to run a high performance SQL-DB in a VM, only bare metal or in a container.

YSpud said:
if i cant show it then proxmox isnt going to be the answer to our manageability problem and we'll go bare metal or docker in an lxc route (which seems to work awesome but i dont know what'll happen if we do that for 20 machines running / if we will run into other issues.. thank you so much for taking the time to reply. much appreciated.

Containers use less resources than VMs and offer nearly bare metal performance. But there are other things to consider:
If you already have "manageability problem", will additional tools help you or make it just more complicated?
Do you have HA requirements? Do you need HA on OS level or can your application provide it, eg. DB logfile shipping? You asked for OS migration, but use local disks?
How often you need to patch your systems? A Docker container is stateless, it need to be patched by a developer, you need a pipeline for updates, it is not just running apt or yum and reboot anymore.
There is no single right solution as someone must be able to pay for it and you must be able to understand it to keep it running.

YSpud · Dec 18, 2024

helm said:
No, I have not. A few years ago I tested PCIe passthrough of a graphics-card. After some BIOS updates I managed to get it working. I would only recommended it for machines having high I/O requirements and low availbility. "Premature optimization is the root of all evil" (Donald Knuth)

For migration of running VMs you need a shared storage for the Images, something like SAN, iSCSI, NAS, CEPH, DRBD. Containers (Docker or LXC) are usually migrated by restarting. Fine for short running systems, unusable for simple batch-Jobs that can not be restartet.

How are these IOPS measured? Is only one process for writing allowed, is sync writing needed or is caching allowed? 1000 is not a high number, a single SATA SSD can do using multiple threads. Try fio with --numjobs=128.
As mentioned, virtio works with one thread per block device, so this may be a limit. If allowed by your application, you can use writeback instead of writetrough.
Network code is async already for a long time. When using NFS (version 4.1 or .2 please) or iSCSI for storing the VM disk-images, you will get better results when connecting the VM direct by network to the storage instead of using disk images. You can also buy a NAS with USV, allowing to use write cache in the NAS (by exporting NFS async).
Assuming you want to store something like MySQL binlogs (transaction logs), 1000 IOPS (linear writes) to a single file synchronized are not easy, especially when HA is needed. 1 ms per synced write is not much time. Better programmed apps switch SQL autocommit of and use transactions, so ony at the end of a transaction sync on the binlog is necessary. Tables and indices may also not need to be written in sync.
Generally I would not suggest to run a high performance SQL-DB in a VM, only bare metal or in a container.

Containers use less resources than VMs and offer nearly bare metal performance. But there are other things to consider:
If you already have "manageability problem", will additional tools help you or make it just more complicated?
Do you have HA requirements? Do you need HA on OS level or can your application provide it, eg. DB logfile shipping? You asked for OS migration, but use local disks?
How often you need to patch your systems? A Docker container is stateless, it need to be patched by a developer, you need a pipeline for updates, it is not just running apt or yum and reboot anymore.
There is no single right solution as someone must be able to pay for it and you must be able to understand it to keep it running.

Wow that's quite the reply. thank you. I even understand some of those words

.

Docker inside LXC containers - - pro's and cons - - what do you know t ? Initial testing is INSANELY good... after some initial tweaking to prevent the LXC from pegging the 72 cores and immediately filling the SWAP it's now 's running like a champ.. Ingesting about 100Mbps with almost zero ssd latency and 50 percent cpu over 12 cores.. Have about 15 other images running as standard VM's - all on separate pcie nvme ssd's and they all seem good as well - - havent noticed any decrease in performance at all - monitoring queue, wait time, etc via zabbix and everything appears stable..

I do worry that im missing something- where is the 'payment' going to come from .. cpu average proc time is up from 10 percent to about 24 percent now is the only overall metric that im noticing having increased but that seems reasonable.. wdyt ? Do you do any consulting work ? I really would love a 2nd pair of eyeballs..

helm · Dec 18, 2024

YSpud said:
I do worry that im missing something- where is the 'payment' going to come from .. cpu average proc time is up from 10 percent to about 24 percent now is the only overall metric that im noticing having increased but that seems reasonable.. wdyt ? Do you do any consulting work ? I really would love a 2nd pair of eyeballs..

Thanks for the positive feedback. I am still learning LXC and know not much more as written in the Wiki. As it reduces the number of layers it is much faster than para-virtualization. As long as you run it as "unprivileged" it is in the worst cast as secure as running an application direct on the Linux OS, but more secure, as you can limit the used resources like CPU and hide other resources, like files. It is difficult to estimate the security risks, the security depends on the security of a single Linux kernel, so one security-hole may be sufficient, to get root-access to the host, while on the other hand in a VM an additional security hole in the para-virtialization is necessary. For apps, not connected direct to the internet, I seen no problems yet. (But a few years ago I also could not imaginge security issuses like Spectre.) There are a lot of reports about running Docker in LXC, also here in this forum.

As long as CPU load increases together with db-transactions per time it is fine and the CPUs are not waiting for I/O anymore.

Well, id did consulting for some years in my career, but not much in virtualization. A good consulant, (good for his employer, not for his customer) would use virtualization, as this is best practice, sell you new hardware and help you to get it running, paid per time, of course. Most customers need consultans to support their opinions (A prophet has no honor in his own country) or for someone to blame when it fails. They do not want other opinions or suggestios for optimizations, as it proves, that they did wrong. This is still my favourite joke. So try to understand how the architecture works, where its limits are and test yourself.

Merry Chistmas and a happy new year.

YSpud · Dec 19, 2024

helm said:
Thanks for the positive feedback. I am still learning LXC and know not much more as written in the Wiki. As it reduces the number of layers it is much faster than para-virtualization. As long as you run it as "unprivileged" it is in the worst cast as secure as running an application direct on the Linux OS, but more secure, as you can limit the used resources like CPU and hide other resources, like files. It is difficult to estimate the security risks, the security depends on the security of a single Linux kernel, so one security-hole may be sufficient, to get root-access to the host, while on the other hand in a VM an additional security hole in the para-virtialization is necessary. For apps, not connected direct to the internet, I seen no problems yet. (But a few years ago I also could not imaginge security issuses like Spectre.) There are a lot of reports about running Docker in LXC, also here in this forum.

As long as CPU load increases together with db-transactions per time it is fine and the CPUs are not waiting for I/O anymore.

Well, id did consulting for some years in my career, but not much in virtualization. A good consulant, (good for his employer, not for his customer) would use virtualization, as this is best practice, sell you new hardware and help you to get it running, paid per time, of course. Most customers need consultans to support their opinions (A prophet has no honor in his own country) or for someone to blame when it fails. They do not want other opinions or suggestios for optimizations, as it proves, that they did wrong. This is still my favourite joke. So try to understand how the architecture works, where its limits are and test yourself.

Merry Chistmas and a happy new year.

Thank you !! All good stuff.. the funny part is I AM the consultant and have been doing IT consulting for 25+ years ( I started my business pretty young !!)

.

I am so not adverse to getting outside help ever. I believe that there's nothing at all wrong with that and im ok with not knowing everything in the world. The older I get the more i learn and the more I understand that i simply don't know everything nor should i expect that of myself. I wish more IT consultants thought like that as well because it's frustrating when someone "Dunning-Kreuger's" me... I know enough to know and respect brilliance in others which is why I always question my own work and am happy to have qualified people look it over and provide feedback.

That said - initial testing is very promising.. I have very granular monitoring setup via zabbix, graylog and then each of my vm's/containers has application specific prometheus/grafana metrics as well... so if something is going on im pretty confident i am going to see it - good or bad. My big concerns are just making sure im not taking resources away from my other vm's while providing near-bare-metal to this one dB write-intensive application... and, of course, security which is much harder for me to evaluate but it is in an unprivelidged container and access is very restricted via multiple layers.

All i see are negatives running docker in a LXC but no real specifics.. so i am going to follow your advice and keep testing myself. I may very well write up a guide for this industry and provide my config and see if others out there want to replicate it and test on their own. That may get me the external / 2nd pair of eyeball feedback that im looking for. Its a very young industry im working in for this particular project and im basically writing the *how to* guide on the fly... thank you so much again for taking the time to write your detailed responses. very very appreciated. best to you !!

helm · Jan 9, 2025

Some points I want to add after holidays:

You will get the best I/O-performance, without many layers and without copying data. This is one of the goals of io_uring, but it is not mature regarding security: https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html
While io_uring is available in QEMU as an alternative to Linux aio (or threads), 2020 Stefano Garzarella made good results with io_uring passthrough POC (still not available in QEMU as far as i know): https://static.sched.com/hosted_fil...0_io_uring_passthrough_Stefano_Garzarella.pdf

As experienced above, thin LVM may be slower than normal/thick LVM caused by managing metadata and spare blocks, this also applies to QCOW2-Images: https://blogs.igalia.com/berto/2020/12/03/subcluster-allocation-for-qcow2-images/
So optimize the QCOW2 Images or just use RAW images for better I/O-performance.

Search

Search

Disk Speeds inside a VM.

YSpud

New Member

helm

Member

rofo69

Member

helm

Member

YSpud

New Member

helm

Member

YSpud

New Member

helm

Member

YSpud

New Member

helm

Member

We value your privacy