zfs for storage and 4k rand write/read values are very low

eyup51

New Member
Nov 1, 2024
4
0
1
Hello everyone, I'm new to the forum.

I'm configuring a new server and plan to share storage via NFS. My setup will run Windows/Linux VMs on BL460c nodes, and I anticipate that 4K random write/read performance will significantly impact VM performance. I have a few questions and issues, so I'd appreciate any insights or experiences you can share.

Server Specifications for NFS File Server:
  • HPE DL380 GEN10 Server
  • 2x Gold 6150 CPUs (2.7-3.7 GHz)
  • 512GB or 1024GB 2400 MHz RAM
  • 8x 8TB Intel SSD DC P4510 U.2 NVMe SSDs (configured in ZFS RAID10)
  • Truenas Core

The core issue is a major performance drop that has left me uncertain about using ZFS. Despite running multiple fio tests based on forum and ChatGPT guidance, my results consistently show poor 4K random write performance, almost as if the disk is performing at one-tenth its capability. Here’s an example result:

  • Performance on ext4 or XFS: IOPS = 189k, Bandwidth = 738 MiB/s
  • Performance on ZFS: Write: IOPS = 28.3k, Bandwidth = 110 MiB/s

Sample Test Results:
  • CPU: Ryzen 7900X
  • RAM: 192GB
  • Disks: 2x 4TB Nextorage SSD NE1N4TB
  • ZFS Configuration: Mirror (RAID0)
  • Block Size: 16K
  • Sync: Standard
  • Compression: LZ4
  • Ashift: 12
  • Atime: Off

Test Parameters Used:

Code:
fio --name=test --size=4G --filename=tempfile --bs=4K --rw=randwrite --ioengine=sync --numjobs=64 --iodepth=32 --runtime=60 --group_reporting

Results:

Code:
write: IOPS=4665, BW=18.2MiB/s (19.1MB/s)(1094MiB/60012msec); 0 zone resets
clat (usec): min=4, max=94195, avg=13710.46, stdev=5785.48
lat (usec): min=4, max=94195, avg=13710.57, stdev=5785.39


Questions:
  1. Why does 4K random write/read performance drop so drastically as soon as I use ZFS?
  2. If I add an SLOG or ZIL device, would it help improve these values? Since I'm already using NVMe drives, is an additional NVMe SLOG necessary? What percentage of improvement could I realistically expect?
  3. In a live environment with Proxmox QEMU virtualization, would low 4K random write/read values affect general VM performance (e.g., browsing) on Windows and Linux VMs?
  4. Proxmox documentation suggests that with RAIDZ2, I might only achieve the IOPS of a single disk. Given that ZFS on a single disk seems to perform 10x slower, would RAIDZ2 inherit this reduction?
  5. The specs of the P4510 U.2 NVMe list up to 637,000 IOPS for reads and 139,000 IOPS for writes. The source I linked shows 190,000 IOPS on XFS. With an 8-disk ZFS RAID10 setup, is it technically feasible to achieve 400K 4K random write IOPS?
  6. On the server I will prepare, VM will not run locally, it will only share the disk with Truenas, should RAM be 512GB or 1024GB

Link to similar issue: extremely poor performance for ZFS 4k randwrite on NVMe compared to XFS

Thanks in advance for any guidance or experience you can share!
 
Hi. This is my 2 cents

1. Simples file systems don't require additional work to do.
2. SLOG can help only for sync writes and can help reduce depreciation of your primary nvme (without SLOG and sync=standard -> Double write to the same disk) but for performance I don't see any improvement.
3. By default virtual disk use 512 (for trim...) so it will have 8x of single write to host 4k disk.
4. In disks groups ZFS do write sync, so it will wait for the slowest disk to finish writing. It doesn't meter is it raidz or raidz2 or raidz3. But you can improve it by combining into groups like raidz2 + raizd2 + raidz2 in one pool. Just keep recommendation how much disk per raidzX
5. Look at answer 4
6. The bigger ZFS ARC the less disk operation will be involved over time
 
  • Like
Reactions: eyup51
Hi. This is my 2 cents

1. Simples file systems don't require additional work to do.
2. SLOG can help only for sync writes and can help reduce depreciation of your primary nvme (without SLOG and sync=standard -> Double write to the same disk) but for performance I don't see any improvement.
3. By default virtual disk use 512 (for trim...) so it will have 8x of single write to host 4k disk.
4. In disks groups ZFS do write sync, so it will wait for the slowest disk to finish writing. It doesn't meter is it raidz or raidz2 or raidz3. But you can improve it by combining into groups like raidz2 + raizd2 + raidz2 in one pool. Just keep recommendation how much disk per raidzX
5. Look at answer 4
6. The bigger ZFS ARC the less disk operation will be involved over time
thanks for reply

So, if I configure 8 P4510 drives in a ZFS RAID 10, I would achieve the IOPS value of 4 mirrored drives as shown in the example test, which means a 4K IOPS of 28.3K x 4 = 113K. However, if I set up these 8 P4510 drives in a RAIDZ2, my 4K random write IOPS will only be 28.3K.

This is truly a substantial IOPS loss. It leads me to conclude that while ZFS is excellent for data integrity and security, its 4K write performance is incredibly poor. ARC cache and L2ARC can greatly improve read performance through caching and offer good bandwidth and write performance for large block files. But if we’re considering using ZFS for VM infrastructures or for performance-demanding systems like SQL , it doesn’t seem like the right choice at all.
 
Keep in mind your data in raidzX will be split and multiplied. For example in raidz2 with 6 drives (2 parity) IOPS will count as 4 x slowest IOPS.
But keep in mind ZFS is COW system. It doesn't have random write from software perspective (fio, SQL ....).
 
  • Like
Reactions: eyup51
If you want to use as regular RAID10, example raidz2 + raidz2 you have to count IOPS as first raidz2 group IOS + second raidz2 group.
 
  • Like
Reactions: eyup51
For 8 disks, the disk tolerance in raid10 is 4. If I combine my 8 disks into 2 different raidz2 groups, my disk fault tolerance will still be 4, but will I get higher disk iops than raid 10?
 
If you want to use 8 disks in 2 groups using raidz2 it will be something like this:

Code:
zfs_pool
    raidz2-0
        disk-1
        disk-2
        disk-3
        disk-4
    raidz2-1
        disk-5
        disk-6
        disk-7
        disk-8

In each raidz2 group 2 disks can die. In very good condition you can survive with total 4 disk lost. But keep in mind - 3 disk lost in single group - all pool is lost.

I suggest you to use this calculator https://wintelguy.com/zfs-calc.pl to calculate disk combination vs space availability of use. Not always you can use 100% of disk space.


As of IOPS you can calculate as groups x slowest group IOPS
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!