ZFS+VM/LXC disk performance benchmarking, part 1 (ZFS=slow?)

sevimo · May 26, 2025

I have resumed performance testing of disk performance (see also initial thread at https://forum.proxmox.com/threads/sync-writes-to-zfs-zvol-disk-are-not-sync-under-pve.163066/), and stumbled on some more non-intuitive results. I'll group them into separate threads, as symptoms are different.

So overall I am running a bunch of different fio tests, and compare on-host performance with the same tests within VM and LXC container. The focus was on ZFS, but I also did some tests using LVM and thin LVM trying to pinpoint issues. Most fio tests are done using block devices, except for LXC ones that write to a file on the root disk (backed by such block device). I was unable to pass block device directly to a container, though I did not try very hard. Tests are done on Intel P4510 NVMe SSD (4TB).

So in part 1 I'll focus on tests that seem to show that ZFS vols seem to be unreasonably slow, even without any virtualization.

Test 1 is the same as in the last thread, 10G of 4k writes at iodepth=1 with sync=1:

IOPS:

raw partition, host	84,000
zvol, host	18,000
zvol, VM	6,800
zvol, LXC	18,000
LVM, host	84,000
zvol, host, sync=0	48,000

Code:

fio --filename=/dev/zvol/b-zmirror1-nvme01/vm-101-disk-0 --ioengine=libaio --loops=1 --size=10G --time_based --runtime=60 --group_reporting --stonewall --name=cc1 --description="CC1" --rw=write --bs=4k --direct=1 --iodepth=1 --numjobs=1 --sync=1

Let's ignore VM and LXC numbers for now, and just look at performance on the host. What I see is that even in the absence of any virtualization, using zvols drops performance in such workloads by a factor of 5! I understand that ZFS is not exactly a speed demon, but that drop sounds excessive (and it gets worse under VMs). I don't see any weird high CPU usage, so doesn't seem to be that.

Now, I am not sure if that's how ZFS results are supposed to look like? Perhaps it is just what it is, and I just need to adjust expectations with respect to ZFS performance.
There is some indication that it might be the case: https://forum.proxmox.com/threads/high-iops-in-host-low-iops-in-vm.145268/
Reported IOPS numbers there are almost the same (they didn't test zvol on host though), so perhaps that's normal, or I should say expected? Just for reference, LVM runs at pretty much host speed under the same test conditions.

Even with sync=0 the test runs almost twice as slow on ZFS comparing to raw partition (or LVM) with sync=1! Random reads mode exhibits similar slowdowns, so it is not restricted to writes.

Bottom line is, is several times slowdown on zvols with certain workloads expected?

Config data:

VM uname -a (Debian 12 live):

Code:

Linux fiotest 6.1.0-29-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.123-1 (2025-01-02) x86_64 GNU/Linux

LXC uname -a (Debian 12):

Code:

Linux fiopct 6.8.12-10-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-10 (2025-04-18T07:39Z) x86_64 GNU/Linux

pveversion -v

Code:

proxmox-ve: 8.4.0 (running kernel: 6.8.12-10-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8.12-10-pve-signed: 6.8.12-10
proxmox-kernel-6.8: 6.8.12-10
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph: 19.2.1-pve3
ceph-fuse: 19.2.1-pve3
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.1-1
proxmox-backup-file-restore: 3.4.1-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.10
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: 4.2025.02-3
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.1
pve-firmware: 3.15-3
pve-ha-manager: 4.0.7
pve-i18n: 3.4.2
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2

ZFS properties:

Code:

Exceeded message limit, but nothing overly exciting here: sync=standard, compression=on

cat /etc/pve/qemu-server/101.conf

Code:

agent: 1
boot: order=ide2;net0
cores: 4
cpu: x86-64-v2-AES
ide2: local:iso/debian-live-12.9.0-amd64-standard.iso,media=cdrom,size=1499968K
memory: 2048
meta: creation-qemu=9.0.2,ctime=1740711583
name: fiotest
net0: virtio=BC:24:11:0F:05:39,bridge=vmbr0,firewall=1,tag=11
numa: 0
ostype: l26
scsi0: b-lvm-thk-nvme01:vm-101-disk-0,aio=io_uring,backup=0,cache=none,discard=on,iothread=1,replicate=0,size=11G,ssd=1
scsi1: b-zmirror1-nvme01:vm-101-disk-0,aio=io_uring,backup=0,cache=none,discard=on,iothread=1,replicate=0,size=11G,ssd=1
scsi2: ztest:vm-101-disk-0,aio=io_uring,backup=0,cache=none,discard=on,iothread=1,replicate=0,size=111G,ssd=1
scsi3: b-lvm-thn-nvme01:vm-101-disk-0,aio=io_uring,backup=0,cache=none,discard=on,iothread=1,replicate=0,size=111G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=7e99887f-d252-43b2-9d91-7027cb2f84c8
sockets: 1
vmgenid: 7b4eb2f9-ab24-4264-ac43-d9602341249b

cat /etc/pve/lxc/105.conf

Code:

arch: amd64
cores: 4
features: nesting=1
hostname: fiopct
memory: 2048
net0: name=eth0,bridge=vmbr0,firewall=1,hwaddr=BC:24:11:7D:CF:EC,ip=dhcp,tag=11,type=veth
ostype: debian
rootfs: b-zmirror1-nvme01:subvol-105-disk-0,mountoptions=discard;lazytime,size=120G
swap: 0
unprivileged: 1

sevimo · May 27, 2025

Different issue, but for those interested about how CPU-intensive ZFS is, I did run max sequential write (on host, not sync'ed) like this:

Code:

fio --filename=/dev/mapper/b--lvm--thk--nvme01-vm--101--disk--0 --ioengine=libaio --loops=1 --size=10G --time_based --runtime=60 --group_reporting --stonewall --name=cc1 --description="CC1" --rw=write --bs=1M --direct=1 --iodepth=64 --numjobs=1 --sync=0

Raw partition and thick LVM had no issues maxing out disk bandwidth at ~ 2.7GB/s. LVM's CPU utilization never went above 5%. ZFS got capped by 100% CPU, and only reached about ~1.6GB/s. This is on '8 x Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz (1 Socket)' - not exactly a high-performance CPU, but the difference in utilization is pretty stark. Nothing else was running on the host (pre- and post- CPU utilization <1%).

leesteken · May 27, 2025

ZFS has a lot of overhead because of all its features. It also requires a lot of sync writes but you drives (guessing from another post) should be able to handle that. Personally, I don't care about a performance loss of 50% in favor of the checksumming and other features. Feel free to no use ZFS. Maybe test BTRFS which has a lot of the same features? I consider switching myself.

sevimo · May 27, 2025

leesteken said:
ZFS has a lot of overhead because of all its features. It also requires a lot of sync writes but you drives (guessing from another post) should be able to handle that. Personally, I don't care about a performance loss of 50% in favor of the checksumming and other features. Feel free to no use ZFS. Maybe test BTRFS which has a lot of the same features? I consider switching myself.

My intent is not to dissuade from using ZFS, this includes not dissuading myself

I like ZFS a lot, and will likely use it in many cases, as for my workloads it is not common to be constrained by disk IO. However, I was taken aback by differences in performance on _some_ workloads. As I haven't seen definitive statements a-la "sync'ed low-queue workloads might be an order of magnitude slower on ZFS", I assumed (and still hope) that it is something that I am doing wrong, and can be significantly improved. 50% I could accept, but I am looking at 5-10 times in some cases. That's not something that can be solved by simply throwing better resources at (faster disk/CPU).

And BTRFS is probably the last system that I will consider, given its history and stability (real or perceived)

If what I am seeing for zvols is typical, the answer will be to use ZFS if disk IO is not a constraint at all (a lot of use cases), and where it matters to stick with something else - perhaps thick LVM for max performance, or even thin LVM might be acceptable. But on a case-by-case basis.

Also will need to see if such performance difference scales with downgrading to SATA SSDs, or it only shows with very fast NVMe drives.

leesteken · May 27, 2025

There have been Youtube videos about Linux not keeping up with lots of (event-driven) NVMe drives. I'm not sure what kind of ZFS vdev configuration you use. RAIDz1/2/3 is really unsuitable for VMs. But then again, your performance test don't seem to include VM overhead. Maybe you're on to something real and ZFS on Linux can fix it.

sevimo · May 27, 2025

leesteken said:
There have been Youtube videos about Linux not keeping up with lots of (event-driven) NVMe drives. I'm not sure what kind of ZFS vdev configuration you use. RAIDz1/2/3 is really unsuitable for VMs. But then again, your performance test don't seem to include VM overhead. Maybe you're on to something real and ZFS on Linux can fix it.

vdev config is just a single disk, I planned to run on 2-disk mirrors, but figured to start with smaller number of variables in a single disk.

leesteken · May 27, 2025

sevimo said:
vdev config is just a single disk, I planned to run on 2-disk mirrors, but figured to start with smaller number of variables in a single disk.

What is the make and model of the single drive?

EDIT: Sorry, I got mixed up with another thread and I did not realize you already said this.

Impact · May 27, 2025

sevimo said:
Tests are done on Intel P4510 NVMe SSD (4TB).

sevimo · May 27, 2025

leesteken said:
What is the make and model of the single drive?

It's buried in the original message, but it's Intel P4510 4TB.

leesteken · May 27, 2025

This appears to be be continued at https://forum.proxmox.com/threads/z...nchmarking-part-2-zfs-in-vm-very-slow.166705/ .

LnxBil · May 27, 2025

sevimo said:
What I see is that even in the absence of any virtualization, using zvols drops performance in such workloads by a factor of 5!

If you factor in your zvolblocksize, you will even have more fun. The benchmarked 4K stuff is actually not very world-relateable in most cases. PVE uses not 4K as zvolblockize in favor of better compression and better "real life performance", so you will have read and write amplification.

I already commented on another thread of yours that you mix up ZVOL (block device for VM) and a normal ZFS dataset (filesystem for LXC), which have different goals and different implementations, e.g. the dataset has no comparable zvolblocksize, it has a recordsize which limits its maximum size. Larger recordsize will be better compressible and therefore much better in real world performance, yet also bad in synthetic fio tests.

Search

Search

ZFS+VM/LXC disk performance benchmarking, part 1 (ZFS=slow?)

sevimo

New Member

sevimo

New Member

leesteken

Distinguished Member

sevimo

New Member

leesteken

Distinguished Member

sevimo

New Member

leesteken

Distinguished Member

Impact

Active Member

sevimo

New Member

leesteken

Distinguished Member

LnxBil

Distinguished Member

We value your privacy