zfs 2.3.0

I had a bit of time to look at this again.

TLDR is - there were optimizations both from direct IO and other areas.

Taking one of our PVE8 hosts with the same hardware, and updating the kernel to apt install proxmox-kernel-6.14 which is the same build as the latest PVE9 one, and the performance did increase by 20% for:
Code:
rm /dev/zvol/rpool/data/test.file
fio --filename=/dev/zvol/rpool/data/test.file --name=sync_randrw --rw=randrw --bs=4M --direct=0 --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --size=8G --loops=64 --group_reporting

But I don't really see a diff with this command toggling direct:
Code:
fio --filename=/dev/zvol/rpool/data/test.file --name=sync_randrw --rw=randrw --bs=4M --direct=0 --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --size=8G --loops=64 --group_reporting

So it could very well be that other optimizations were at play.

Then focusing on a command to expose the difference between direct=0 and direct=1, I do see a 1.24x difference on PVE9 with identical hardware:
7422 MiB/s with ARC
9274 MiB/s with direct


Direct:
Code:
fio --name=directio-benefit \
    --filename=/dev/zvol/rpool/data/test.file \
    --rw=write \
    --bs=1M \
    --size=4G \
    --ioengine=libaio \
    --direct=1 \
    --numjobs=4 \
    --iodepth=64 \
    --runtime=60 \
    --loops=64 \
    --group_reporting


Cached/ARC:
Code:
fio --name=cached-comparison \
    --filename=/dev/zvol/rpool/data/test.file \
    --rw=write \
    --bs=1M \
    --size=4G \
    --ioengine=libaio \
    --direct=0 \
    --numjobs=4 \
    --iodepth=64 \
    --runtime=60 \
    --loops=64 \
    --group_reporting

So now I'm fairly sure the feature is in PVE9 and it does respond to being enabled. Actual performance gains highly depend on the workload and hardware. In the case above, we're using a ZFS RAID10 array with Samsung PM1743 7.68 TB drives (8 disks) on a Dell R6725 (dual socket) server.
 
So for fazit:
pve8 +default kernel updated to 6.14 kernel gives +20% seq 4M 1job/1iodepth randrw performance while direct=0 or 1 stays the same.
pve9 +default 6.14 is 24% faster if direct=1 vs =0 using seq 1M 4jobs/64iodepth write.
So ... updating pve8 default to pve9 default + using direct=1 gives theo anyhow somethink like a possible 100%*1.2*1.24=149% perf update ... maybe or even may not :-) So anyway a pve update is useful in case of features and performance !!
:)
 
  • Like
Reactions: trey.b
I had raidz1 vdev with 2x8TB disks and had 8TB of storage.
I expanded vdev by adding 2 new disks expecting to have 24TB = 3x8TB + 1x8TB parity using commands:
Code:
zpool attach hdd-pool raidz1-0 /dev/disk/by-id/disk3
zpool attach hdd-pool raidz1-0 /dev/disk/by-id/disk4
But Proxmox reporting that it has 16TB. Added 4Tb at onse instead of 8TB.
photo_2025-08-22_17-07-45.jpg

I has not a lof of data on ZFS, so I moved almost all data from this pool to ext4 hard-drive and write back to ZFS.
After moving the data I has the following output:
zfs list hdd-pool output:
code_language.shell:
NAME       USED  AVAIL  REFER  MOUNTPOINT
hdd-pool  1.08T  13.4T   112K  /hdd-pool
zpool list hdd-pool output:
code_language.shell:
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
hdd-pool  29.1T  2.15T  27.0T        -         -     2%     7%  1.00x    ONLINE  -

Means I have 1.08T of data and it occupies 2x spase like before.
Looks like it still writes data using stripe 1D+1P (1 data + 1 parity)
Instead of 3D+1P

Can it be resolved or I must move all my volumes, datasets and data to another drive, destroy this pool and recreate it from scratch using 4 disks?
 
Last edited:
TL;DR
zfs list after expandsion vdev shows incorrect numbers and Proxmox UI use this values.


After I expanded vdev from 1D+1P to 3D+1P and rewritten all my data to ocupy full stripe width. zfs list shows used, referenced and written values 1.5 less then it must be (compressed data + ovehead).

So, if I take the numbers from post above: 1.08T USED - is incorrect value - real is 1.08 * 1.5 = 1.62T.
Means we need to write 1.62T of final data to 3D+1P. ZFS spreads 1.62T to 3 disks it is 1.62T / 3 = 0.54T and we need 0.54T of parity. 3 * 0.54T + 1 * 0.54T = 2.16T - this number we see as ALLOC.

I recalculate it on different datasets and volumes (take logicalreferenced and refcompressratio to calculate how much data need to spread between disks) and this numbers always are the same: USED always shows 1.5 times less then real.
And because of that I always see thet ALLOC = 2 * USED instead of ALLOC = 1.33 * USED.
 
Last edited:
  • Like
Reactions: SInisterPisces