LVM disk latency issue

PatrickD25

New Member
May 13, 2024
6
1
3
Context : We want to run MySQL databases on Proxmox (kvm).
Problem : MySQL is quite slow on kvm. Compared to native speeds or Xen speeds, it is up to 5x the latency.
Setup : 3-node Proxmox cluster. On 1 node, we add a dedicated SSD disk for Mysql, and use LVM to mount it in the VM.

VM disk info :
scsi3: vm-data-lvm:vm-104-disk-2,iothread=1,size=150G
scsihw: virtio-scsi-single
cache: no-cache
Thick provisioning.

Tool used to test :
ioping -S64M -L -s4k -W -c 10 .
Latency Results in VM on XFS : (ext4 is about 200us higher average)
min/avg/max/mdev = 562.4 us / 663.2 us / 742.1 us / 51.1 us

Same LV, mounted on the host :
min/avg/max/mdev = 120.5 us / 149.5 us / 183.3 us / 20.9 us

Tests in a Xen VM is about as fast as native speeds.

I was expecting something like 5% to 10% slower/higher latency, but this is a bit extreme

For info : We tested a LOT of settings. XFS/EXT4, cache settings, but could not get close to native speeds.
We also tested with sysbench using oltp_write_only and the results were similar.

Is this something we should expect from kvm ?
Anyone seeing better results ?

I guess not knowing if this is normal or not is a big issue here. Any help would be greatly appreciated !
 
Could you perhaps share your VM's config? You can do so via the qm config <vmid> command.
 
boot: order=scsi0;ide2;net0
cores: 8
cpu: host
ide2: local:iso/debian-12-amd64-20240403.iso,media=cdrom,size=1225200K
memory: 65536
meta: creation-qemu=8.1.5,ctime=1717009155
name: <server>.<domain>.org
net0: virtio=BC:24:11:8C:C9:4F,bridge=vmbr0,tag=281
numa: 0
ostype: l26
scsi0: vm-data:vm-104-disk-0,iothread=1,size=250G
scsi1: vm-data-lvm:vm-104-disk-0,iothread=1,size=500G
scsi2: vm-data-lvm:vm-104-disk-1,iothread=1,size=150G
scsi3: vm-data-lvm:vm-104-disk-2,iothread=1,size=150G
scsihw: virtio-scsi-single
smbios1: uuid=9cf95732-a0f7-4945-8e45-566ffe2b77ef
sockets: 2
unused0: vm-data:vm-104-disk-1
vmgenid: f16debc8-340b-4c60-81da-efb84bbfb2d7
 
Hmm, I don't see anything wrong with your config. I'm curious though, is there are reason for you to have two sockets? Do you happen to have two NUMA nodes on your host? If so, you can try activating NUMA for your VM.

If that's not the case though, can you try setting the number of sockets to one? Unless you really need two sockets, that is. It shouldn't make a performance difference, but since you've already been pretty thorough, it's worth fiddling with IMO.

Could you also share some more details about your underlying storage? Do you have HW RAID or anything of the sort?

Also, do you spot anything in your VM's or host's logs? dmesg, journalctl -x, etc.
 
The SSD is a hardware RAID (Dell server). But the tests were done using the same LV mounted in VM and mounted directly on the host.
As for Numa, we did some tests and saw no improvements. It would probably make a difference at high IO on network.

The same mount point, but in a LXC container proves to be as fast as host, but we would prefer using kvm.

What about on your side ? Do you have access to a Proxmox host with a local LV on a SSD drive ? Can you see if you get the same difference between VM and Host mounted on your side ?
 
The SSD is a hardware RAID (Dell server). But the tests were done using the same LV mounted in VM and mounted directly on the host.
As for Numa, we did some tests and saw no improvements. It would probably make a difference at high IO on network.

The same mount point, but in a LXC container proves to be as fast as host, but we would prefer using kvm.

What about on your side ? Do you have access to a Proxmox host with a local LV on a SSD drive ? Can you see if you get the same difference between VM and Host mounted on your side ?

Very interesting, thanks for letting me know!


What about on your side ? Do you have access to a Proxmox host with a local LV on a SSD drive ? Can you see if you get the same difference between VM and Host mounted on your side ?

I'll see what I can do. I'll report back once I got some results.

Since you said you already tried changing the cache settings, have there been any noticeable differences between the different cache modes (none, writethrough, writeback, directsync)?

Also, have you tried something other than LVM yet, e.g. ZFS? Do note that ZFS isn't compatible with HW RAID at all, so it's best to set your HW RAID card to JBOD mode if you do decide to try it (and then mirror your RAID configuration in ZFS, naturally). I'm not sure if you're experienced with ZFS, but there's a recordsize option that you can change for individual volumes, so you can e.g. set it to 16K to mirror the (default) page size of MySQL. That way you should theoretically be able to eke out some extra performance. Just an idea I wanted to share.
 
What about on your side ? Do you have access to a Proxmox host with a local LV on a SSD drive ? Can you see if you get the same difference between VM and Host mounted on your side ?

So, I've done some extensive testing on one of our servers (wiped it beforehand and set up PVE from scratch) and my findings pretty much align with yours.

What I've tested were the following constellations on a pretty decent NVME with fio:

Host

pve-manager/8.2.7/3e0176e6bb2ade3b (running kernel: 6.8.12-2-pve)
  1. Bare ext4
  2. ext4 on LVM
  3. Bare xfs
  4. xfs on LVM
  5. ZFS (single disk, dataset with primarycache=metadata, compression=on, recordsize=128K)
Guest

6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Code:
# qm config 100
agent: 1
boot: order=scsi0;ide2;net0
cores: 8
cpu: host
ide2: none,media=cdrom
memory: 32768
meta: creation-qemu=9.0.2,ctime=1727455008
name: deb-bench-test
net0: virtio=BC:24:11:9D:4E:B8,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: local-zfs:vm-100-disk-0,iothread=1,size=32G,ssd=1
scsi1: lvm-bench-vm:vm-100-disk-0,cache=writeback,iothread=1,size=200G,ssd=1
scsi2: zfs-bench-vm-single:vm-100-disk-0,cache=writeback,iothread=1,size=200G,ssd=1
scsi3: lvm-bench-vm:vm-100-disk-1,cache=writeback,iothread=1,size=200G,ssd=1
scsi4: zfs-bench-vm-single:vm-100-disk-1,cache=writeback,iothread=1,size=200G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=fa27540b-2ff2-470a-9fdc-e52ad4301f97
sockets: 1
  1. ext4 on LVM (non-thin) storage
  2. xfs on LVM (non-thin) storage
  3. ext4 on ZFS (primarycache=all, compression=on, volblocksize=16K)
  4. xfs on ZFS (primarycache=all, compression=on, volblocksize=16K)


For guests In general, IOPS (and bandwidth) are down by quite a bit for smaller reads and writes, when compared to the same FS being used on the host; however, both IOPS and bandwidth seem to be equal to the same filesystem on the host for larger reads and writes. This is kind of what I'd expect anyways.

What's also to be expected is that ZFS performance is tanking quite a bit compared to using it on the host, as there's no way for its ARC to properly cache anything in the benchmarks I've made. There's probably also some minor overhead due to compression. (And it's ZFS on a single disk, which is not something you usually wanna do anyways).

Either way, I digress; I've found that IO latencies are quite a bit higher than on the host itself, which aligns with your findings. For comparison:

ioping -S64M -L -s4k -W -q on host (counts: 10, 100, 1000):
Code:
--- /mnt/bench/ext4 (ext4 /dev/nvme0n1p1 245.0 GiB) ioping statistics ---
9 requests completed in 654.2 us, 36 KiB written, 13.8 k iops, 53.7 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 69.1 us / 72.7 us / 78.9 us / 3.07 us

--- /mnt/bench/ext4 (ext4 /dev/nvme0n1p1 245.0 GiB) ioping statistics ---
99 requests completed in 7.68 ms, 396 KiB written, 12.9 k iops, 50.4 MiB/s
generated 100 requests in 1.65 min, 400 KiB, 1 iops, 4.04 KiB/s
min/avg/max/mdev = 68.0 us / 77.5 us / 102.4 us / 7.16 us

--- /mnt/bench/ext4 (ext4 /dev/nvme0n1p1 245.0 GiB) ioping statistics ---
999 requests completed in 79.9 ms, 3.90 MiB written, 12.5 k iops, 48.8 MiB/s
generated 1 k requests in 16.7 min, 3.91 MiB, 1 iops, 4.00 KiB/s
min/avg/max/mdev = 43.6 us / 80.0 us / 143.2 us / 11.7 us

ioping -S64M -L -s4k -W -q in VM (counts: 10, 100, 1000):
Code:
--- /mnt/bench/lvm-ext4 (ext4 /dev/sdb1 195.8 GiB) ioping statistics ---
9 requests completed in 3.72 ms, 36 KiB written, 2.42 k iops, 9.46 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 286.4 us / 413.0 us / 476.2 us / 55.5 us

--- /mnt/bench/lvm-ext4 (ext4 /dev/sdb1 195.8 GiB) ioping statistics ---
99 requests completed in 41.8 ms, 396 KiB written, 2.37 k iops, 9.24 MiB/s
generated 100 requests in 1.65 min, 400 KiB, 1 iops, 4.04 KiB/s
min/avg/max/mdev = 233.2 us / 422.7 us / 571.7 us / 47.5 us

--- /mnt/bench/lvm-ext4 (ext4 /dev/sdb1 195.8 GiB) ioping statistics ---
999 requests completed in 418.8 ms, 3.90 MiB written, 2.38 k iops, 9.32 MiB/s
generated 1 k requests in 16.7 min, 3.91 MiB, 1 iops, 4.00 KiB/s
min/avg/max/mdev = 182.7 us / 419.2 us / 658.9 us / 55.4 us

I believe that this is very much due to the virtualization overhead; though, specifically for latency, I cannot say if this has always been the case or not, perhaps some of the more experienced users can weigh in here. In my personal workloads latency was never really an issue (low latency was never really a requirement / I never really hit any points where latency ended up mattering).

I'm not sure what Xen does in particular to make it that fast; perhaps it just passes any IO through directly to LVM, or it does whatever it wants and tells you it's done with whatever operation you give it (kind of like setting cache=unsafe on a disk in PVE, perhaps?). I think the best bet would be to just use a CT if you can, if you really want to minimise the overhead. Alternativel you could also try mounting some kind of network storage inside the VM, if you happen to have one that's particularly fast.

Also, there are a couple extra resources I dug up that might be of interest to you:
You might want to look into optimising MySQL itself, if you haven't already. Specifically, if there's a way to reduce IOPS by using larger RW ops instead of a bunch of smaller ones, you might want to give that a shot.

I hope that helps!
 
Last edited:
  • Like
Reactions: bbgeek17
Hi PatrickD25,

Here's where I would start:

- Determine what NUMA node your HBA is connected to.
- Set an affinity policy to bind your VCPUs to said NUMA node.
- Make sure you have at least one free physical CPU core on the NUMA node.
- Ensure that the disk scheduler for your block device on bare metal is set to "noop"
- Ensure that the disk scheduler for your QEMU block device is set to "noop".
- Test the various rq_affinity settings for the block device on bare metal.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: Max Carrara
Xen (at least the paid version) does cache reads and if I'm not mistaken write caching too, it kind of assumes you are on 'professional' (proprietary) hardware with BBU or networked storage.

There are differences in speed using a raw device, vs an LVM device also thin provisioned vs thick provisioned. It is hard to make an exact comparison between products without also evaluating all the underlying assumptions of each 'default' installation, some which may not seem related to storage (eg. C-states, interrupt handling, buffers etc) which may improve certain workloads in some conditions and make others worse (in case of CPU exploits, which Xen "works around" because actually fixing them requires massive changes).

I personally saw huge improvements by simply disabling C-states in the UEFI of our servers as this article states, and the power consumption isn't all that different:
https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/
If you trust your guests, disable CPU exploit mitigations, that will also boost performance.
 
I am setting up a new server for tests. In the meantime, would it be possible for you @guruevi and @bbgeek17 to run the same ioping command and give me your results ? Blockbridge claims "low latency", but I could not find any numbers. And since the increase is pretty significant, a decrease of a couple of % would not be enough.

Is there a known way to investigate deeper ? Contact the KVM devs maybe ? I'm not sure how to do that...

We are looking at the possibility to use CT instead of kvm, but lxc comes with it's own quirks, and we would need to test that properly.
 
Using Ceph in the VM:
min/avg/max/mdev = 748.4 us / 875.9 us / 1.50 ms / 169.2 us
in the Host:
min/avg/max/mdev = 738.8 us / 939.8 us / 2.30 ms / 282.0 us

Using ZFS on the host EXT4 in the VM
min/avg/max/mdev = 128.3 us / 213.6 us / 907.0 us / 209.7 us
Using ZFS on the host (native file)
min/avg/max/mdev = 40.8 us / 47.4 us / 57.4 us / 5.05 us

Note that ZFS has lots of overhead in this 'worst case scenario' and this is a very simplistic benchmark. I have been able to reach 200k+ IOPS on my VMs with proper benchmarking (deeper queue depths etc) which real applications would be able to do. A single thread, single writer, 1QD, there is a lot that needs to be translated and flushed on every call and I don't see 100-150 microseconds being a huge deal.
 
Here's the data reported in our low-latency KB article. The hardware and software are now previous generation, but the results are still valid. Performance might be slightly faster on more modern hardware.
All measurements are taken at the block device in the guest or on the host using FIO (not ioping). This is NVMe/TCP on 25G networking.
512B:
  • host: 20.5us
  • guest (optimized) 31.6us
  • guest (non-optimized) 52.4us
4K:
  • host: 22.3us
  • guest (optimized) 33.5us
  • guest (non-optimized) 54.1us
8K:
  • host: 24.7us
  • guest (optimized) 36.7us
  • guest (non-optimized) 57.9us
16K
  • host: 36.9us
  • guest (optimized) 42.1us
  • guest (non-optimized) 69.5us
32K
  • host: 42.9us
  • guest (optimized) 50.0us
  • guest (non-optimized) 77.9us
Summary: A guest can achieve QD1 I/O latency within roughly 10 microseconds of bare metal by optimizing both the host and guest. But, hardware matters!


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Took us a bit of time, but we found a solution to the issue.

First, we added "processor.max_cstate=1 intel_idle.max_cstate=1" to the Host. It helped a lot.

Second, we installed tuned, and set the host profile to virtual-host and guests to virtual-guest.

Third, for our servers with Perc RAID controllers with a battery backed cache, the best settings are :
* Direct Sync
* IO_uring
* SSD emulation
* Virtio SCSI single
* IO thread

The use of XFS also makes things a bit faster for MySQL.

With these settings, we can reach very close to hardware performance.

Thanks for the assistance.
 
  • Like
Reactions: Kingneutron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!