ZFS+VM/LXC disk performance benchmarking, part 1 (ZFS=slow?)

sevimo

New Member
Mar 2, 2025
15
9
3
I have resumed performance testing of disk performance (see also initial thread at https://forum.proxmox.com/threads/sync-writes-to-zfs-zvol-disk-are-not-sync-under-pve.163066/), and stumbled on some more non-intuitive results. I'll group them into separate threads, as symptoms are different.

So overall I am running a bunch of different fio tests, and compare on-host performance with the same tests within VM and LXC container. The focus was on ZFS, but I also did some tests using LVM and thin LVM trying to pinpoint issues. Most fio tests are done using block devices, except for LXC ones that write to a file on the root disk (backed by such block device). I was unable to pass block device directly to a container, though I did not try very hard. Tests are done on Intel P4510 NVMe SSD (4TB).

So in part 1 I'll focus on tests that seem to show that ZFS vols seem to be unreasonably slow, even without any virtualization.

Test 1 is the same as in the last thread, 10G of 4k writes at iodepth=1 with sync=1:

IOPS:
raw partition, host84,000
zvol, host18,000
zvol, VM6,800
zvol, LXC18,000
LVM, host84,000
zvol, host, sync=048,000

Code:
fio --filename=/dev/zvol/b-zmirror1-nvme01/vm-101-disk-0 --ioengine=libaio --loops=1 --size=10G --time_based --runtime=60 --group_reporting --stonewall --name=cc1 --description="CC1" --rw=write --bs=4k --direct=1 --iodepth=1 --numjobs=1 --sync=1

Let's ignore VM and LXC numbers for now, and just look at performance on the host. What I see is that even in the absence of any virtualization, using zvols drops performance in such workloads by a factor of 5! I understand that ZFS is not exactly a speed demon, but that drop sounds excessive (and it gets worse under VMs). I don't see any weird high CPU usage, so doesn't seem to be that.

Now, I am not sure if that's how ZFS results are supposed to look like? Perhaps it is just what it is, and I just need to adjust expectations with respect to ZFS performance.
There is some indication that it might be the case: https://forum.proxmox.com/threads/high-iops-in-host-low-iops-in-vm.145268/
Reported IOPS numbers there are almost the same (they didn't test zvol on host though), so perhaps that's normal, or I should say expected? Just for reference, LVM runs at pretty much host speed under the same test conditions.

Even with sync=0 the test runs almost twice as slow on ZFS comparing to raw partition (or LVM) with sync=1! Random reads mode exhibits similar slowdowns, so it is not restricted to writes.

Bottom line is, is several times slowdown on zvols with certain workloads expected?

Config data:
VM uname -a (Debian 12 live):
Code:
Linux fiotest 6.1.0-29-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.123-1 (2025-01-02) x86_64 GNU/Linux

LXC uname -a (Debian 12):
Code:
Linux fiopct 6.8.12-10-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-10 (2025-04-18T07:39Z) x86_64 GNU/Linux

pveversion -v
Code:
proxmox-ve: 8.4.0 (running kernel: 6.8.12-10-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8.12-10-pve-signed: 6.8.12-10
proxmox-kernel-6.8: 6.8.12-10
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph: 19.2.1-pve3
ceph-fuse: 19.2.1-pve3
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.1-1
proxmox-backup-file-restore: 3.4.1-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.10
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: 4.2025.02-3
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.1
pve-firmware: 3.15-3
pve-ha-manager: 4.0.7
pve-i18n: 3.4.2
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2

ZFS properties:
Code:
Exceeded message limit, but nothing overly exciting here: sync=standard, compression=on

cat /etc/pve/qemu-server/101.conf
Code:
agent: 1
boot: order=ide2;net0
cores: 4
cpu: x86-64-v2-AES
ide2: local:iso/debian-live-12.9.0-amd64-standard.iso,media=cdrom,size=1499968K
memory: 2048
meta: creation-qemu=9.0.2,ctime=1740711583
name: fiotest
net0: virtio=BC:24:11:0F:05:39,bridge=vmbr0,firewall=1,tag=11
numa: 0
ostype: l26
scsi0: b-lvm-thk-nvme01:vm-101-disk-0,aio=io_uring,backup=0,cache=none,discard=on,iothread=1,replicate=0,size=11G,ssd=1
scsi1: b-zmirror1-nvme01:vm-101-disk-0,aio=io_uring,backup=0,cache=none,discard=on,iothread=1,replicate=0,size=11G,ssd=1
scsi2: ztest:vm-101-disk-0,aio=io_uring,backup=0,cache=none,discard=on,iothread=1,replicate=0,size=111G,ssd=1
scsi3: b-lvm-thn-nvme01:vm-101-disk-0,aio=io_uring,backup=0,cache=none,discard=on,iothread=1,replicate=0,size=111G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=7e99887f-d252-43b2-9d91-7027cb2f84c8
sockets: 1
vmgenid: 7b4eb2f9-ab24-4264-ac43-d9602341249b

cat /etc/pve/lxc/105.conf
Code:
arch: amd64
cores: 4
features: nesting=1
hostname: fiopct
memory: 2048
net0: name=eth0,bridge=vmbr0,firewall=1,hwaddr=BC:24:11:7D:CF:EC,ip=dhcp,tag=11,type=veth
ostype: debian
rootfs: b-zmirror1-nvme01:subvol-105-disk-0,mountoptions=discard;lazytime,size=120G
swap: 0
unprivileged: 1
 
  • Like
Reactions: UdoB and Johannes S
Different issue, but for those interested about how CPU-intensive ZFS is, I did run max sequential write (on host, not sync'ed) like this:

Code:
fio --filename=/dev/mapper/b--lvm--thk--nvme01-vm--101--disk--0 --ioengine=libaio --loops=1 --size=10G --time_based --runtime=60 --group_reporting --stonewall --name=cc1 --description="CC1" --rw=write --bs=1M --direct=1 --iodepth=64 --numjobs=1 --sync=0

Raw partition and thick LVM had no issues maxing out disk bandwidth at ~ 2.7GB/s. LVM's CPU utilization never went above 5%. ZFS got capped by 100% CPU, and only reached about ~1.6GB/s. This is on '8 x Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz (1 Socket)' - not exactly a high-performance CPU, but the difference in utilization is pretty stark. Nothing else was running on the host (pre- and post- CPU utilization <1%).
 
ZFS has a lot of overhead because of all its features. It also requires a lot of sync writes but you drives (guessing from another post) should be able to handle that. Personally, I don't care about a performance loss of 50% in favor of the checksumming and other features. Feel free to no use ZFS. Maybe test BTRFS which has a lot of the same features? I consider switching myself.
 
  • Like
Reactions: UdoB
ZFS has a lot of overhead because of all its features. It also requires a lot of sync writes but you drives (guessing from another post) should be able to handle that. Personally, I don't care about a performance loss of 50% in favor of the checksumming and other features. Feel free to no use ZFS. Maybe test BTRFS which has a lot of the same features? I consider switching myself.
My intent is not to dissuade from using ZFS, this includes not dissuading myself :) I like ZFS a lot, and will likely use it in many cases, as for my workloads it is not common to be constrained by disk IO. However, I was taken aback by differences in performance on _some_ workloads. As I haven't seen definitive statements a-la "sync'ed low-queue workloads might be an order of magnitude slower on ZFS", I assumed (and still hope) that it is something that I am doing wrong, and can be significantly improved. 50% I could accept, but I am looking at 5-10 times in some cases. That's not something that can be solved by simply throwing better resources at (faster disk/CPU).

And BTRFS is probably the last system that I will consider, given its history and stability (real or perceived) :) If what I am seeing for zvols is typical, the answer will be to use ZFS if disk IO is not a constraint at all (a lot of use cases), and where it matters to stick with something else - perhaps thick LVM for max performance, or even thin LVM might be acceptable. But on a case-by-case basis.

Also will need to see if such performance difference scales with downgrading to SATA SSDs, or it only shows with very fast NVMe drives.
 
There have been Youtube videos about Linux not keeping up with lots of (event-driven) NVMe drives. I'm not sure what kind of ZFS vdev configuration you use. RAIDz1/2/3 is really unsuitable for VMs. But then again, your performance test don't seem to include VM overhead. Maybe you're on to something real and ZFS on Linux can fix it.
 
There have been Youtube videos about Linux not keeping up with lots of (event-driven) NVMe drives. I'm not sure what kind of ZFS vdev configuration you use. RAIDz1/2/3 is really unsuitable for VMs. But then again, your performance test don't seem to include VM overhead. Maybe you're on to something real and ZFS on Linux can fix it.
vdev config is just a single disk, I planned to run on 2-disk mirrors, but figured to start with smaller number of variables in a single disk.
 
vdev config is just a single disk, I planned to run on 2-disk mirrors, but figured to start with smaller number of variables in a single disk.
What is the make and model of the single drive?

EDIT: Sorry, I got mixed up with another thread and I did not realize you already said this.
 
Last edited:
What I see is that even in the absence of any virtualization, using zvols drops performance in such workloads by a factor of 5!
If you factor in your zvolblocksize, you will even have more fun. The benchmarked 4K stuff is actually not very world-relateable in most cases. PVE uses not 4K as zvolblockize in favor of better compression and better "real life performance", so you will have read and write amplification.

I already commented on another thread of yours that you mix up ZVOL (block device for VM) and a normal ZFS dataset (filesystem for LXC), which have different goals and different implementations, e.g. the dataset has no comparable zvolblocksize, it has a recordsize which limits its maximum size. Larger recordsize will be better compressible and therefore much better in real world performance, yet also bad in synthetic fio tests.