zfs for storage and 4k rand write/read values are very low

eyup51 · Nov 8, 2024

Hello everyone, I'm new to the forum.

I'm configuring a new server and plan to share storage via NFS. My setup will run Windows/Linux VMs on BL460c nodes, and I anticipate that 4K random write/read performance will significantly impact VM performance. I have a few questions and issues, so I'd appreciate any insights or experiences you can share.

Server Specifications for NFS File Server:

HPE DL380 GEN10 Server
2x Gold 6150 CPUs (2.7-3.7 GHz)
512GB or 1024GB 2400 MHz RAM
8x 8TB Intel SSD DC P4510 U.2 NVMe SSDs (configured in ZFS RAID10)
Truenas Core

The core issue is a major performance drop that has left me uncertain about using ZFS. Despite running multiple fio tests based on forum and ChatGPT guidance, my results consistently show poor 4K random write performance, almost as if the disk is performing at one-tenth its capability. Here’s an example result:

Performance on ext4 or XFS: IOPS = 189k, Bandwidth = 738 MiB/s
Performance on ZFS: Write: IOPS = 28.3k, Bandwidth = 110 MiB/s

Sample Test Results:

CPU: Ryzen 7900X
RAM: 192GB
Disks: 2x 4TB Nextorage SSD NE1N4TB
ZFS Configuration: Mirror (RAID0)
Block Size: 16K
Sync: Standard
Compression: LZ4
Ashift: 12
Atime: Off

Test Parameters Used:

Code:

fio --name=test --size=4G --filename=tempfile --bs=4K --rw=randwrite --ioengine=sync --numjobs=64 --iodepth=32 --runtime=60 --group_reporting

Results:

Code:

write: IOPS=4665, BW=18.2MiB/s (19.1MB/s)(1094MiB/60012msec); 0 zone resets
clat (usec): min=4, max=94195, avg=13710.46, stdev=5785.48
lat (usec): min=4, max=94195, avg=13710.57, stdev=5785.39

Questions:

Why does 4K random write/read performance drop so drastically as soon as I use ZFS?
If I add an SLOG or ZIL device, would it help improve these values? Since I'm already using NVMe drives, is an additional NVMe SLOG necessary? What percentage of improvement could I realistically expect?
In a live environment with Proxmox QEMU virtualization, would low 4K random write/read values affect general VM performance (e.g., browsing) on Windows and Linux VMs?
Proxmox documentation suggests that with RAIDZ2, I might only achieve the IOPS of a single disk. Given that ZFS on a single disk seems to perform 10x slower, would RAIDZ2 inherit this reduction?
The specs of the P4510 U.2 NVMe list up to 637,000 IOPS for reads and 139,000 IOPS for writes. The source I linked shows 190,000 IOPS on XFS. With an 8-disk ZFS RAID10 setup, is it technically feasible to achieve 400K 4K random write IOPS?
On the server I will prepare, VM will not run locally, it will only share the disk with Truenas, should RAM be 512GB or 1024GB

Link to similar issue: extremely poor performance for ZFS 4k randwrite on NVMe compared to XFS

Thanks in advance for any guidance or experience you can share!

eyup51 · Nov 9, 2024

I'm waiting for your recommendation

Nemesiz · Nov 9, 2024

Hi. This is my 2 cents

1. Simples file systems don't require additional work to do.
2. SLOG can help only for sync writes and can help reduce depreciation of your primary nvme (without SLOG and sync=standard -> Double write to the same disk) but for performance I don't see any improvement.
3. By default virtual disk use 512 (for trim...) so it will have 8x of single write to host 4k disk.
4. In disks groups ZFS do write sync, so it will wait for the slowest disk to finish writing. It doesn't meter is it raidz or raidz2 or raidz3. But you can improve it by combining into groups like raidz2 + raizd2 + raidz2 in one pool. Just keep recommendation how much disk per raidzX
5. Look at answer 4
6. The bigger ZFS ARC the less disk operation will be involved over time

eyup51 · Nov 9, 2024

Nemesiz said:
Hi. This is my 2 cents

1. Simples file systems don't require additional work to do.
2. SLOG can help only for sync writes and can help reduce depreciation of your primary nvme (without SLOG and sync=standard -> Double write to the same disk) but for performance I don't see any improvement.
3. By default virtual disk use 512 (for trim...) so it will have 8x of single write to host 4k disk.
4. In disks groups ZFS do write sync, so it will wait for the slowest disk to finish writing. It doesn't meter is it raidz or raidz2 or raidz3. But you can improve it by combining into groups like raidz2 + raizd2 + raidz2 in one pool. Just keep recommendation how much disk per raidzX
5. Look at answer 4
6. The bigger ZFS ARC the less disk operation will be involved over time

thanks for reply

So, if I configure 8 P4510 drives in a ZFS RAID 10, I would achieve the IOPS value of 4 mirrored drives as shown in the example test, which means a 4K IOPS of 28.3K x 4 = 113K. However, if I set up these 8 P4510 drives in a RAIDZ2, my 4K random write IOPS will only be 28.3K.

This is truly a substantial IOPS loss. It leads me to conclude that while ZFS is excellent for data integrity and security, its 4K write performance is incredibly poor. ARC cache and L2ARC can greatly improve read performance through caching and offer good bandwidth and write performance for large block files. But if we’re considering using ZFS for VM infrastructures or for performance-demanding systems like SQL , it doesn’t seem like the right choice at all.

Nemesiz · Nov 10, 2024

Keep in mind your data in raidzX will be split and multiplied. For example in raidz2 with 6 drives (2 parity) IOPS will count as 4 x slowest IOPS.
But keep in mind ZFS is COW system. It doesn't have random write from software perspective (fio, SQL ....).

Nemesiz · Nov 10, 2024

If you want to use as regular RAID10, example raidz2 + raidz2 you have to count IOPS as first raidz2 group IOS + second raidz2 group.

eyup51 · Nov 10, 2024

For 8 disks, the disk tolerance in raid10 is 4. If I combine my 8 disks into 2 different raidz2 groups, my disk fault tolerance will still be 4, but will I get higher disk iops than raid 10?

Nemesiz · Nov 10, 2024

If you want to use 8 disks in 2 groups using raidz2 it will be something like this:

Code:

zfs_pool
    raidz2-0
        disk-1
        disk-2
        disk-3
        disk-4
    raidz2-1
        disk-5
        disk-6
        disk-7
        disk-8

In each raidz2 group 2 disks can die. In very good condition you can survive with total 4 disk lost. But keep in mind - 3 disk lost in single group - all pool is lost.

I suggest you to use this calculator https://wintelguy.com/zfs-calc.pl to calculate disk combination vs space availability of use. Not always you can use 100% of disk space.

As of IOPS you can calculate as groups x slowest group IOPS

Search

Search

zfs for storage and 4k rand write/read values are very low

eyup51

New Member

eyup51

New Member

Nemesiz

Renowned Member

eyup51

New Member

Nemesiz

Renowned Member

Nemesiz

Renowned Member

eyup51

New Member

Nemesiz

Renowned Member

We value your privacy