ZFS+VM/LXC disk performance benchmarking, part 3 (ZFS in LXC=weird?)

sevimo

New Member
Mar 2, 2025
15
9
3
Continuing testing of disk performance (see previous parts at https://forum.proxmox.com/threads/zfs-vm-lxc-disk-performance-benchmarking-part-1-zfs-slow.166701/ and https://forum.proxmox.com/threads/z...nchmarking-part-2-zfs-in-vm-very-slow.166705/), so won't repeat the background.

I saved best for last, and in part 3 I'll focus on tests in LXC that I simply cannot explain. So here we go :)

Test 1 is the same as in the last thread, 10G of 4k writes at iodepth=1 with sync=1:

IOPS:
raw partition, host84,000
zvol, host18,000
zvol, VM6,800
zvol, VM, sync=010,000
zvol, LXC18,000
zvol, LXC, sync=0145,000
LVM, LXC55,000

Code:
fio --filename=/dev/zvol/b-zmirror1-nvme01/vm-101-disk-0 --ioengine=libaio --loops=1 --size=10G --time_based --runtime=60 --group_reporting --stonewall --name=cc1 --description="CC1" --rw=write --bs=4k --direct=1 --iodepth=1 --numjobs=1 --sync=1

OK, so sync=1 with zvol in LXC is still very slow, as was already mentioned. LVM in LXC is doing much better, with some reasonable numbers. But LXC on zvol with sync=0 shows performance that is almost double of the raw performance on the host. WTH?! Not exactly realistic, so it seems some kind of caching is in play here (I assume), but it seems that it only manifests itself in LXC - not on host and not under VM.

But if that would be just extra positive performance under unsafe conditions (no sync), I wouldn't be mad. But some other tests show massive _declines_ in performance under LXC only, and not under VM or on host directly.

Test 2 is similar to test 1, but it is random 4k access and iodepth is now 64 instead of 1:
Code:
fio --filename=/dev/zvol/b-zmirror1-nvme01/vm-101-disk-0 --ioengine=libaio --loops=1 --size=10G --time_based --runtime=60 --group_reporting --stonewall --name=cc1 --description="CC1" --rw=randwrite --bs=4k --direct=1 --iodepth=64 --numjobs=1 --sync=1

This is where it gets really weird:

IOPS:
raw partition, host255,000
zvol, host82,000
zvol, host, sync=082,000
LVM, host226,000
LVM, VM140,000
zvol, VM64,000
LVM, LXC130,000
zvol, LXC3,000 ?!

So more iodepth generates more performance, LVM on host is almost as good as raw, and zvol on host is still notably worse - but comparatively better than test 1, with "only" about 3x deterioration. Interestingly, sync on or off no longer impacts performance at all. LVM on host has very little degradation, and even under VM is pretty decent, less than 2x degradation. zvol under VM did worse than on host, but not as dramatic as in part 2, so also kinda expected.

So far, so good.

So LVM under LXC did a little bit worse than under VM, which is a bit surprising, but it's close enough that we call it even. However, zvol under LXC in this test makes no sense to me. Barely 3k IOPS, when host can do 255k raw?? That's 85x times slower! This can't be right, something is off here. It's still almost 20x times slower than even same zvol under VM. I have similar results for random read workloads, so perhaps the issue is with deep iodepths? But iodepth=1 test 1 shows other weird patterns, see above.

I am not quite sure what to make out of these results. It's somewhat similar to the issues in my previous thread (https://forum.proxmox.com/threads/sync-writes-to-zfs-zvol-disk-are-not-sync-under-pve.163066/) which turned out to be an actual regression in ZFS code that has since been fixed. Perhaps these results point to something similar? Or is there another explanation? I am quite perplexed.


Config data:
VM uname -a (Debian 12 live):
Code:
Linux fiotest 6.1.0-29-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.123-1 (2025-01-02) x86_64 GNU/Linux

LXC uname -a (Debian 12):
Code:
Linux fiopct 6.8.12-10-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-10 (2025-04-18T07:39Z) x86_64 GNU/Linux

pveversion -v
Code:
proxmox-ve: 8.4.0 (running kernel: 6.8.12-10-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8.12-10-pve-signed: 6.8.12-10
proxmox-kernel-6.8: 6.8.12-10
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph: 19.2.1-pve3
ceph-fuse: 19.2.1-pve3
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.1-1
proxmox-backup-file-restore: 3.4.1-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.10
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: 4.2025.02-3
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.1
pve-firmware: 3.15-3
pve-ha-manager: 4.0.7
pve-i18n: 3.4.2
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2

ZFS properties:
Code:
Exceeded message limit, but nothing overly exciting here: sync=standard, compression=on

cat /etc/pve/qemu-server/101.conf
Code:
agent: 1
boot: order=ide2;net0
cores: 4
cpu: x86-64-v2-AES
ide2: local:iso/debian-live-12.9.0-amd64-standard.iso,media=cdrom,size=1499968K
memory: 2048
meta: creation-qemu=9.0.2,ctime=1740711583
name: fiotest
net0: virtio=BC:24:11:0F:05:39,bridge=vmbr0,firewall=1,tag=11
numa: 0
ostype: l26
scsi0: b-lvm-thk-nvme01:vm-101-disk-0,aio=io_uring,backup=0,cache=none,discard=on,iothread=1,replicate=0,size=11G,ssd=1
scsi1: b-zmirror1-nvme01:vm-101-disk-0,aio=io_uring,backup=0,cache=none,discard=on,iothread=1,replicate=0,size=11G,ssd=1
scsi2: ztest:vm-101-disk-0,aio=io_uring,backup=0,cache=none,discard=on,iothread=1,replicate=0,size=111G,ssd=1
scsi3: b-lvm-thn-nvme01:vm-101-disk-0,aio=io_uring,backup=0,cache=none,discard=on,iothread=1,replicate=0,size=111G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=7e99887f-d252-43b2-9d91-7027cb2f84c8
sockets: 1
vmgenid: 7b4eb2f9-ab24-4264-ac43-d9602341249b

cat /etc/pve/lxc/105.conf
Code:
arch: amd64
cores: 4
features: nesting=1
hostname: fiopct
memory: 2048
net0: name=eth0,bridge=vmbr0,firewall=1,hwaddr=BC:24:11:7D:CF:EC,ip=dhcp,tag=11,type=veth
ostype: debian
rootfs: b-zmirror1-nvme01:subvol-105-disk-0,mountoptions=discard;lazytime,size=120G
swap: 0
unprivileged: 1
 
Just a minor technicallity:
What do you mean by zvol in LXC? Do you mean ZFS dataset in LXC?
(Zvol is a block device, which needs another filesystem to be used with files, e.g. ZVOL+ext4)
 
  • Like
Reactions: Johannes S
Just a minor technicallity:
What do you mean by zvol in LXC? Do you mean ZFS dataset in LXC?
(Zvol is a block device, which needs another filesystem to be used with files, e.g. ZVOL+ext4)
You're right, as I mentioned my original plan was to benchmark as close to the same setup across host/VM/LXC, but I was unable to pass writable block device to LXC. I understand that it is possible (is it?), but I didn't spend much time on this, given that ultimately most containers I am interested in will be accessing root disk only (so indeed a dataset/subvol in ZFS), not mounting a separate block device. So for LXC testing fio does use file on a root disk.

While this is indeed a difference that will account for some of the performance impact, I wouldn't expect it to make _that much_ of a difference. Am I wrong? If someone can tell me how to pass a block device from host (e.g., /dev/zvol/b-zmirror1-nvme01/vm-101-disk-0) to LXC so that fio can run on it directly, I can rerun the tests and determine if it's indeed zvol vs zdataset issue, or is it still an LXC issue. I don't think mount points work with raw block devices in /dev, I tried this and was unable to start LXC with the MP in the config...
 
While this is indeed a difference that will account for some of the performance impact, I wouldn't expect it to make _that much_ of a difference. Am I wrong?
Read and write amplification is a huge difference for 4K block that is stored on a 16K volblocksize.
If someone can tell me how to pass a block device from host (e.g., /dev/zvol/b-zmirror1-nvme01/vm-101-disk-0) to LXC so that fio can run on it directly, I can rerun the tests and determine if it's indeed zvol vs zdataset issue, or is it still an LXC issue. I don't think mount points work with raw block devices in /dev, I tried this and was unable to start LXC with the MP in the config...
This is only scientifically relevant, because no one in the real world would use such a setup and it is not directly possible in PVE, which should say enough.
 
  • Like
Reactions: Johannes S
Read and write amplification is a huge difference for 4K block that is stored on a 16K volblocksize.
I assume that would only be material if it is the disk itself that is a bottleneck. The disk is clearly capable of much higher speeds, as evidenced by raw partition on host performance. Just for kicks I have just created additional zvols with volblocksize=4k and volblocksize=128k, and reran the test on the host. The differences to 16k are fairly insignificant, I'd say within margin of error. If anything, 128k volblocksize got slightly better IOPS at 20k, but it's close enough to 16k case of 18k IOPS, while the device itself is capable of 84k IOPS at least. 18k vs 20k is the kind of performance impact from that kind of tuning that I could expect (e.g., measurable, but not necessarily material).
 
I assume that would only be material if it is the disk itself that is a bottleneck.
No, if the data is uncached and you want to write synchronously, you need to read a 16K block, change 4K and write the 16K back. Thas is per se much slower than just writing a 4K block. In most tests, the block is already in cache, so you will not see a big difference by writing 4K or 16K to disk.

To make benchmarking even more complicated, you would need to disable block caching (primarycache=metadata) entirely in order to not test the ARC. Otherwise you would not have a real world scenario but benchmark the cache. Benchmarking is in general so complicated to get scientifically correct values and needs a lot of multi-tier analysing and stuff. This is a very deep rabbit hole.
 
No, if the data is uncached and you want to write synchronously, you need to read a 16K block, change 4K and write the 16K back. Thas is per se much slower than just writing a 4K block. In most tests, the block is already in cache, so you will not see a big difference by writing 4K or 16K to disk.

To make benchmarking even more complicated, you would need to disable block caching (primarycache=metadata) entirely in order to not test the ARC. Otherwise you would not have a real world scenario but benchmark the cache. Benchmarking is in general so complicated to get scientifically correct values and needs a lot of multi-tier analysing and stuff. This is a very deep rabbit hole.

Just for science, I disabled ARC, and results (on the host) stayed pretty much exactly the same. This is not really surprising, as most of the testing here is for sync writes, and ARC or not, sync writes return after physically writing to ZIL (which happens to be on the same drive here).

Everything that you're describing makes sense, and definitely plays a role when disk performance is a limiting factor. Which in my tests it is most definitely not - that's the observation that I am making here, something else limits ZFS performance in these tests to a much lower level than is possible by a physical device. I am not sure what that is.