Extremely slow performance on ZFS

jollel

New Member
Feb 24, 2025
1
0
1
Norway
Hi everyone,

I've searched the forum endlessly and can't find anything that works for me. I’m struggling with extremely poor random read/write performance on my ZFS pool (rpool) in Proxmox, despite using enterprise SSDs and trying multiple optimizations. I’m an amateur when it comes to ZFS tuning, so I’d really appreciate any guidance!

System setup:

Proxmox Version: 8.3.3
Storage: 2x Kingston DC600M 480GB SATA SSDs (2 days old)
Server: HPE DL20 Gen9 (no HW RAID-controller)

FIO benchmark results:

Direct SSD performance (fio test on /dev/sda)

read: IOPS=91.3k, BW=357MiB/s write: IOPS=39.2k, BW=153MiB/s

ZFS pool performance:

read: IOPS=2686, BW=10.5MiB/s write: IOPS=1161, BW=4.6MiB/s


Things I’ve tried so far:

"sync=disabled" → No effect. "recordsize=16K" → No improvement. primarycache=metadata → No difference. atime=off → No change. logbias=throughput → No noticeable impact. zfs_txg_timeout=30 → No significant effect. ashift=12 → default. Ensured SSD alignment → Using 4K blocks. Confirmed VM caching is set to Write Back in Proxmox.

Tests done inside a Windows 2025 Server VM (with plenty of RAM and CPU available) show even lower performance. Resulting in such poor performance that just opening the Control Panel can sometimes take 5-8 seconds. And if several simple operations are taking place inside the VM at the same time, everything freezes until "things are ready".

Why is ZFS so much slower than direct SSD performance? What else can I tweak to improve performance?
I’m completely out of ideas at this point. Any help is highly appreciated!

Thanks in advance!
 
The first I notice is that is a really slow SSD, only 357 reads and 153 writes? Is that sequential?

I will give thoughts on what you tinkered with.

sync disabled just forces force sync writes as async. Has no other impact, the default setting is async by default still.
recordsize is a maximum cluster size, its dynamic between the size set via ashift and recordsize, lower recordsize also makes it harder to get good compression. If recordsize matches ashift, you cant get compression on anything other than pure zeroes.
logbias, if you dont have a slog device, always keep it on latency, throughput will increase fragmentation.
txg timeout can speed up async writes providing you have the free dirty cache to absorb the writes and the ssd itself is the bottleneck, if this did help it would be somewhat masking the issue, as you would be writing to ram instead of the disk.
ashift is the minimal cluster i/o size, ideally should match what the underlying storage is, 4k is usually the best value, even with ssds larger than 8k, the firmware will be optimised for 4k. ashift=12 for that.
for zfs you should use the default cache setting of no cache, this is because zfs has its built in dirty cache which isnt affected by page caching system in linux, if you allow write cache in the page cache system, you are then double caching writes.
setting primary cache (ARC) to metadata only wont speed things up, but instead can hurt performance, the reasons you would want to use metadata only is either to avoid double read caching (e.g. innodb pools), or if you have a constrained ARC and want to prioritise its usage on specific datasets only.
 
Last edited: