LVM disk latency issue

PatrickD25 · Sep 23, 2024

Context : We want to run MySQL databases on Proxmox (kvm).
Problem : MySQL is quite slow on kvm. Compared to native speeds or Xen speeds, it is up to 5x the latency.
Setup : 3-node Proxmox cluster. On 1 node, we add a dedicated SSD disk for Mysql, and use LVM to mount it in the VM.

VM disk info :
scsi3: vm-data-lvm:vm-104-disk-2,iothread=1,size=150G
scsihw: virtio-scsi-single
cache: no-cache
Thick provisioning.

Tool used to test :
ioping -S64M -L -s4k -W -c 10 .
Latency Results in VM on XFS : (ext4 is about 200us higher average)
min/avg/max/mdev = 562.4 us / 663.2 us / 742.1 us / 51.1 us

Same LV, mounted on the host :
min/avg/max/mdev = 120.5 us / 149.5 us / 183.3 us / 20.9 us

Tests in a Xen VM is about as fast as native speeds.

I was expecting something like 5% to 10% slower/higher latency, but this is a bit extreme

For info : We tested a LOT of settings. XFS/EXT4, cache settings, but could not get close to native speeds.
We also tested with sysbench using oltp_write_only and the results were similar.

Is this something we should expect from kvm ?
Anyone seeing better results ?

I guess not knowing if this is normal or not is a big issue here. Any help would be greatly appreciated !

Max Carrara · Sep 25, 2024

Could you perhaps share your VM's config? You can do so via the qm config <vmid> command.

PatrickD25 · Sep 25, 2024

boot: order=scsi0;ide2;net0
cores: 8
cpu: host
ide2: local:iso/debian-12-amd64-20240403.iso,media=cdrom,size=1225200K
memory: 65536
meta: creation-qemu=8.1.5,ctime=1717009155
name: <server>.<domain>.org
net0: virtio=BC:24:11:8C:C9:4F,bridge=vmbr0,tag=281
numa: 0
ostype: l26
scsi0: vm-data:vm-104-disk-0,iothread=1,size=250G
scsi1: vm-data-lvm:vm-104-disk-0,iothread=1,size=500G
scsi2: vm-data-lvm:vm-104-disk-1,iothread=1,size=150G
scsi3: vm-data-lvm:vm-104-disk-2,iothread=1,size=150G
scsihw: virtio-scsi-single
smbios1: uuid=9cf95732-a0f7-4945-8e45-566ffe2b77ef
sockets: 2
unused0: vm-data:vm-104-disk-1
vmgenid: f16debc8-340b-4c60-81da-efb84bbfb2d7

Max Carrara · Sep 26, 2024

Hmm, I don't see anything wrong with your config. I'm curious though, is there are reason for you to have two sockets? Do you happen to have two NUMA nodes on your host? If so, you can try activating NUMA for your VM.

If that's not the case though, can you try setting the number of sockets to one? Unless you really need two sockets, that is. It shouldn't make a performance difference, but since you've already been pretty thorough, it's worth fiddling with IMO.

Could you also share some more details about your underlying storage? Do you have HW RAID or anything of the sort?

Also, do you spot anything in your VM's or host's logs? dmesg, journalctl -x, etc.

PatrickD25 · Sep 26, 2024

The SSD is a hardware RAID (Dell server). But the tests were done using the same LV mounted in VM and mounted directly on the host.
As for Numa, we did some tests and saw no improvements. It would probably make a difference at high IO on network.

The same mount point, but in a LXC container proves to be as fast as host, but we would prefer using kvm.

What about on your side ? Do you have access to a Proxmox host with a local LV on a SSD drive ? Can you see if you get the same difference between VM and Host mounted on your side ?

PatrickD25 · Sep 26, 2024

I found this thread on reddit. No answer to it... But same issue?
https://www.reddit.com/r/Proxmox/comments/11yxiyo/disk_performanceiops_surprising/

Max Carrara · Sep 27, 2024

PatrickD25 said:
The SSD is a hardware RAID (Dell server). But the tests were done using the same LV mounted in VM and mounted directly on the host.
As for Numa, we did some tests and saw no improvements. It would probably make a difference at high IO on network.

The same mount point, but in a LXC container proves to be as fast as host, but we would prefer using kvm.

What about on your side ? Do you have access to a Proxmox host with a local LV on a SSD drive ? Can you see if you get the same difference between VM and Host mounted on your side ?

Very interesting, thanks for letting me know!

PatrickD25 said:
What about on your side ? Do you have access to a Proxmox host with a local LV on a SSD drive ? Can you see if you get the same difference between VM and Host mounted on your side ?

I'll see what I can do. I'll report back once I got some results.

Since you said you already tried changing the cache settings, have there been any noticeable differences between the different cache modes (none, writethrough, writeback, directsync)?

Also, have you tried something other than LVM yet, e.g. ZFS? Do note that ZFS isn't compatible with HW RAID at all, so it's best to set your HW RAID card to JBOD mode if you do decide to try it (and then mirror your RAID configuration in ZFS, naturally). I'm not sure if you're experienced with ZFS, but there's a recordsize option that you can change for individual volumes, so you can e.g. set it to 16K to mirror the (default) page size of MySQL. That way you should theoretically be able to eke out some extra performance. Just an idea I wanted to share.

Max Carrara · Sep 30, 2024

PatrickD25 said:
What about on your side ? Do you have access to a Proxmox host with a local LV on a SSD drive ? Can you see if you get the same difference between VM and Host mounted on your side ?

So, I've done some extensive testing on one of our servers (wiped it beforehand and set up PVE from scratch) and my findings pretty much align with yours.

What I've tested were the following constellations on a pretty decent NVME with fio:

Host

pve-manager/8.2.7/3e0176e6bb2ade3b (running kernel: 6.8.12-2-pve)

Bare ext4
ext4 on LVM
Bare xfs
xfs on LVM
ZFS (single disk, dataset with primarycache=metadata, compression=on, recordsize=128K)

Guest

6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux

Code:

# qm config 100
agent: 1
boot: order=scsi0;ide2;net0
cores: 8
cpu: host
ide2: none,media=cdrom
memory: 32768
meta: creation-qemu=9.0.2,ctime=1727455008
name: deb-bench-test
net0: virtio=BC:24:11:9D:4E:B8,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: local-zfs:vm-100-disk-0,iothread=1,size=32G,ssd=1
scsi1: lvm-bench-vm:vm-100-disk-0,cache=writeback,iothread=1,size=200G,ssd=1
scsi2: zfs-bench-vm-single:vm-100-disk-0,cache=writeback,iothread=1,size=200G,ssd=1
scsi3: lvm-bench-vm:vm-100-disk-1,cache=writeback,iothread=1,size=200G,ssd=1
scsi4: zfs-bench-vm-single:vm-100-disk-1,cache=writeback,iothread=1,size=200G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=fa27540b-2ff2-470a-9fdc-e52ad4301f97
sockets: 1

ext4 on LVM (non-thin) storage
xfs on LVM (non-thin) storage
ext4 on ZFS (primarycache=all, compression=on, volblocksize=16K)
xfs on ZFS (primarycache=all, compression=on, volblocksize=16K)

For guests In general, IOPS (and bandwidth) are down by quite a bit for smaller reads and writes, when compared to the same FS being used on the host; however, both IOPS and bandwidth seem to be equal to the same filesystem on the host for larger reads and writes. This is kind of what I'd expect anyways.

What's also to be expected is that ZFS performance is tanking quite a bit compared to using it on the host, as there's no way for its ARC to properly cache anything in the benchmarks I've made. There's probably also some minor overhead due to compression. (And it's ZFS on a single disk, which is not something you usually wanna do anyways).

Either way, I digress; I've found that IO latencies are quite a bit higher than on the host itself, which aligns with your findings. For comparison:

ioping -S64M -L -s4k -W -q on host (counts: 10, 100, 1000):

Code:

--- /mnt/bench/ext4 (ext4 /dev/nvme0n1p1 245.0 GiB) ioping statistics ---
9 requests completed in 654.2 us, 36 KiB written, 13.8 k iops, 53.7 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 69.1 us / 72.7 us / 78.9 us / 3.07 us

--- /mnt/bench/ext4 (ext4 /dev/nvme0n1p1 245.0 GiB) ioping statistics ---
99 requests completed in 7.68 ms, 396 KiB written, 12.9 k iops, 50.4 MiB/s
generated 100 requests in 1.65 min, 400 KiB, 1 iops, 4.04 KiB/s
min/avg/max/mdev = 68.0 us / 77.5 us / 102.4 us / 7.16 us

--- /mnt/bench/ext4 (ext4 /dev/nvme0n1p1 245.0 GiB) ioping statistics ---
999 requests completed in 79.9 ms, 3.90 MiB written, 12.5 k iops, 48.8 MiB/s
generated 1 k requests in 16.7 min, 3.91 MiB, 1 iops, 4.00 KiB/s
min/avg/max/mdev = 43.6 us / 80.0 us / 143.2 us / 11.7 us

ioping -S64M -L -s4k -W -q in VM (counts: 10, 100, 1000):

Code:

--- /mnt/bench/lvm-ext4 (ext4 /dev/sdb1 195.8 GiB) ioping statistics ---
9 requests completed in 3.72 ms, 36 KiB written, 2.42 k iops, 9.46 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 286.4 us / 413.0 us / 476.2 us / 55.5 us

--- /mnt/bench/lvm-ext4 (ext4 /dev/sdb1 195.8 GiB) ioping statistics ---
99 requests completed in 41.8 ms, 396 KiB written, 2.37 k iops, 9.24 MiB/s
generated 100 requests in 1.65 min, 400 KiB, 1 iops, 4.04 KiB/s
min/avg/max/mdev = 233.2 us / 422.7 us / 571.7 us / 47.5 us

--- /mnt/bench/lvm-ext4 (ext4 /dev/sdb1 195.8 GiB) ioping statistics ---
999 requests completed in 418.8 ms, 3.90 MiB written, 2.38 k iops, 9.32 MiB/s
generated 1 k requests in 16.7 min, 3.91 MiB, 1 iops, 4.00 KiB/s
min/avg/max/mdev = 182.7 us / 419.2 us / 658.9 us / 55.4 us

I believe that this is very much due to the virtualization overhead; though, specifically for latency, I cannot say if this has always been the case or not, perhaps some of the more experienced users can weigh in here. In my personal workloads latency was never really an issue (low latency was never really a requirement / I never really hit any points where latency ended up mattering).

I'm not sure what Xen does in particular to make it that fast; perhaps it just passes any IO through directly to LVM, or it does whatever it wants and tells you it's done with whatever operation you give it (kind of like setting cache=unsafe on a disk in PVE, perhaps?). I think the best bet would be to just use a CT if you can, if you really want to minimise the overhead. Alternativel you could also try mounting some kind of network storage inside the VM, if you happen to have one that's particularly fast.

Also, there are a couple extra resources I dug up that might be of interest to you:

Passing a physical disk through to a VM -- from our docs: https://pve.proxmox.com/wiki/Passthrough_Physical_Disk_to_Virtual_Machine_(VM)
- You could e.g. pass through your SSD directly, but you'll miss out on a bunch of features like being able to back up your disk via PVE directly
Low Latency Storage Optimizations for Proxmox, KVM, & QEMU -- by Blockbridge: https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/
Optimizing Proxmox: iothreads, aio, & io_uring -- by Blockbridge: https://kb.blockbridge.com/technote/proxmox-aio-vs-iouring/

You might want to look into optimising MySQL itself, if you haven't already. Specifically, if there's a way to reduce IOPS by using larger RW ops instead of a bunch of smaller ones, you might want to give that a shot.

I hope that helps!

bbgeek17 · Sep 30, 2024

Hi PatrickD25,

Here's where I would start:

- Determine what NUMA node your HBA is connected to.
- Set an affinity policy to bind your VCPUs to said NUMA node.
- Make sure you have at least one free physical CPU core on the NUMA node.
- Ensure that the disk scheduler for your block device on bare metal is set to "noop"
- Ensure that the disk scheduler for your QEMU block device is set to "noop".
- Test the various rq_affinity settings for the block device on bare metal.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

guruevi · Sep 30, 2024

Xen (at least the paid version) does cache reads and if I'm not mistaken write caching too, it kind of assumes you are on 'professional' (proprietary) hardware with BBU or networked storage.

There are differences in speed using a raw device, vs an LVM device also thin provisioned vs thick provisioned. It is hard to make an exact comparison between products without also evaluating all the underlying assumptions of each 'default' installation, some which may not seem related to storage (eg. C-states, interrupt handling, buffers etc) which may improve certain workloads in some conditions and make others worse (in case of CPU exploits, which Xen "works around" because actually fixing them requires massive changes).

I personally saw huge improvements by simply disabling C-states in the UEFI of our servers as this article states, and the power consumption isn't all that different:
https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/
If you trust your guests, disable CPU exploit mitigations, that will also boost performance.

PatrickD25 · Oct 7, 2024

I am setting up a new server for tests. In the meantime, would it be possible for you @guruevi and @bbgeek17 to run the same ioping command and give me your results ? Blockbridge claims "low latency", but I could not find any numbers. And since the increase is pretty significant, a decrease of a couple of % would not be enough.

Is there a known way to investigate deeper ? Contact the KVM devs maybe ? I'm not sure how to do that...

We are looking at the possibility to use CT instead of kvm, but lxc comes with it's own quirks, and we would need to test that properly.

guruevi · Oct 7, 2024

Using Ceph in the VM:
min/avg/max/mdev = 748.4 us / 875.9 us / 1.50 ms / 169.2 us
in the Host:
min/avg/max/mdev = 738.8 us / 939.8 us / 2.30 ms / 282.0 us

Using ZFS on the host EXT4 in the VM
min/avg/max/mdev = 128.3 us / 213.6 us / 907.0 us / 209.7 us
Using ZFS on the host (native file)
min/avg/max/mdev = 40.8 us / 47.4 us / 57.4 us / 5.05 us

Note that ZFS has lots of overhead in this 'worst case scenario' and this is a very simplistic benchmark. I have been able to reach 200k+ IOPS on my VMs with proper benchmarking (deeper queue depths etc) which real applications would be able to do. A single thread, single writer, 1QD, there is a lot that needs to be translated and flushed on every call and I don't see 100-150 microseconds being a huge deal.

bbgeek17 · Oct 9, 2024

Here's the data reported in our low-latency KB article. The hardware and software are now previous generation, but the results are still valid. Performance might be slightly faster on more modern hardware.
All measurements are taken at the block device in the guest or on the host using FIO (not ioping). This is NVMe/TCP on 25G networking.
512B:

host: 20.5us
guest (optimized) 31.6us
guest (non-optimized) 52.4us

4K:

host: 22.3us
guest (optimized) 33.5us
guest (non-optimized) 54.1us

8K:

host: 24.7us
guest (optimized) 36.7us
guest (non-optimized) 57.9us

16K

host: 36.9us
guest (optimized) 42.1us
guest (non-optimized) 69.5us

32K

host: 42.9us
guest (optimized) 50.0us
guest (non-optimized) 77.9us

Summary: A guest can achieve QD1 I/O latency within roughly 10 microseconds of bare metal by optimizing both the host and guest. But, hardware matters!

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

PatrickD25 · Dec 10, 2024

Took us a bit of time, but we found a solution to the issue.

First, we added "processor.max_cstate=1 intel_idle.max_cstate=1" to the Host. It helped a lot.

Second, we installed tuned, and set the host profile to virtual-host and guests to virtual-guest.

Third, for our servers with Perc RAID controllers with a battery backed cache, the best settings are :
* Direct Sync
* IO_uring
* SSD emulation
* Virtio SCSI single
* IO thread

The use of XFS also makes things a bit faster for MySQL.

With these settings, we can reach very close to hardware performance.

Thanks for the assistance.

mfpck · Jan 24, 2025

Hi,

fundamentals

M

Thread 'HW RAID (dell/lsi) SETTINGS SSD ?'

Jan 15, 2025

Hello,

Historical I learned that write back and read ahead are the way to go with an bbu.

Because I see a few performance irregularities so I was reading a lot and it may seem that the recommend settings are write-through and no read ahead when using ssds ?

On top of that if I change the settings this way it seem to activate lsi/dells fast path feature....

Running here dell poweredge with enterprise ssds using h730 raid controller currently not read ahead and write back.

Does aynbody benchmarkt it or is able to clarify ?

Let' talk and test about it ;-)

Thx & Best

same boat here, mongodb on Windows ;-(

M

Thread 'Bloody windows performance issues'

Jan 14, 2025

Hi,

Running pve on a dell r330 with an hw. h730 and 8 enterprise ssds as a raid 10 (no read ahead and write back) - Xfs, lvm:thin and raw disk images.

For my Windows guests I followed this guide;
https://pve.proxmox.com/wiki/Windows_10_guest_best_practices

and this:
https://pve.proxmox.com/wiki/Performance_Tweaks

Btw. Linux guests are fine even with default settings.

Because I observer general performance issues on Windows guests but basicaly regarding disk operations I may change the follwoing settings:
hw raid to write through = fastpath ?
virt harddisk to default, so...

What are your hw. raid volume settings, also write thorugh and no read ahead (fastpath) ?

Greets and Best

Search

Search

LVM disk latency issue

PatrickD25

New Member

Max Carrara

Well-Known Member

PatrickD25

New Member

Max Carrara

Well-Known Member

PatrickD25

New Member

PatrickD25

New Member

Max Carrara

Well-Known Member

Max Carrara

Well-Known Member

bbgeek17

Distinguished Member

guruevi

Well-Known Member

PatrickD25

New Member

guruevi

Well-Known Member

bbgeek17

Distinguished Member

PatrickD25

New Member

mfpck

New Member

Thread 'HW RAID (dell/lsi) SETTINGS SSD ?'

Thread 'Bloody windows performance issues'

We value your privacy