[SOLVED] High IO Delay

Whattteva · Feb 17, 2023

I don't have a particularly write-heavy environment, which is why the crappy disks I currently have have been working fine for the last 3 months (save for the few times I get spikes).

Thanks all for all the feedbacks!!!
I'll update when I get them up and running.

Whattteva · Feb 22, 2023

UPDATE:
I got the new Intel SSD's today and it seems to indeed be the issue. IO delay problems are now gone. Thanks for all the help!

mcg1103 · Aug 11, 2024

Whattteva said:
UPDATE:
I got the new Intel SSD's today and it seems to indeed be the issue. IO delay problems are now gone. Thanks for all the help!

It seems the new drives are SATA. Is that correct? It seems all the recomendations are for NVMe drives.

Whattteva · Aug 11, 2024

mcg1103 said:
It seems the new drives are SATA. Is that correct? It seems all the recomendations are for NVMe drives.

Wasn't aware of that, but the SATA drives do the job. NVMe probably would've been overkill. I think the main problem is just that my drives didn't have PLP, so sync write performance sucks.

mcg1103 · Aug 11, 2024

Whattteva said:
Wasn't aware of that, but the SATA drives do the job. NVMe probably would've been overkill. I think the main problem is just that my drives didn't have PLP, so sync write performance sucks.

These seem to be MLC. I am suffering the same fate. I have a cluster with Samsung EVO drives. Also consumer grade and the same thing is happening. Any sustained write causes the machine to be unusable until the write is complete. Like migrating VM's or changing ZFS storage. I guess we will try a couple of these since SATA will be easier than NVMe for the machines that we are currently using.

Like in your situation, most of the time this setup works fine.... just when doing a big sync.

Thanks for responding so quickly.

max1337 · Aug 11, 2025

Hello everyone,
we are experiencing similar issues with one of our Proxmox nodes. I am wondering if adding vdevs would help to decrease IO delay?

We are running a pool with the following configuration

Code:

zfs
  mirror-0
    ata-Samsung_SSD_860_EVO_1TB
    ata-Samsung_SSD_860_EVO_1TB
  mirror-1
    ata-Samsung_SSD_860_EVO_1TB
    ata-Samsung_SSD_870_EVO_1TB
  mirror-2
    ata-Samsung_SSD_870_EVO_1TB
    ata-Samsung_SSD_870_EVO_1TB
  mirror-3
    ata-CT2000BX500SSD1_2TB
    ata-CT2000BX500SSD1_2TB

The system hosts VMs with Windows 10/11 for testing purposes which are usually for one test only until they get purged. We experience slow/laggy VMs as soon as larger files are copied to one of the VMs. Currently we run 13 VMs in parallel and handle larger amounts of test data. CPU and RAM are fine.
Are enterprise SSDs with SATA suitable and my only option? Or could I go the route of adding another vdev mirror to reduce IO delay?
If enterprise SSDs are the only way, are any of these good options?
- Intel DC S4500
- Samsung PM893
- Samsung PM863
- Micron 5400 Pro

Thanks
Max

cwt · Aug 11, 2025

Short answer: yes, these drives are suitable. Alternative (if you have no big budget) would be used enterprise drives. Beside Samsung, Intel and Micron are Kingston‘s DC drives. For 13 VMs I would prefer a striped mirror setup (aka RAID1+0). Depending on your workload you could add 2x NVME with plp as a mirrored SLOG to your striped mirror.

_gabriel · Aug 11, 2025

max1337 said:
mirror-3
ata-CT2000BX500SSD1_2TB
ata-CT2000BX500SSD1_2TB

These are not suitable at all.
Even without ZFS they writes slow after short duration.

max1337 · Aug 12, 2025

From your replies I get the following:
- the BX500 are useless for our application
- the Samsung EVOs are not great but do the trick?
- Upgrading to Kingston DC, Samsung PM-Series or Intel DC S4500 are a good option

What would SLOG do for me? I assume in an all flash array NVME is the only way since it is way faster than SATA SSDs? Will be a bit challenging as I am running out of PCIe lanes

Cheers
Max

cwt · Aug 12, 2025

In productive use I would never choose consumer SSDs for anything. They are a total different thing and not suitable for ZFS loads. It's not only about the PLP (power loss protection) but also the much larger cache and how writes are done. A setup of enterprise SSDs in a striped mirror can be sufficient but is not a power horse compared to an equivalent of NMVE drives.

A SLOG attached to a ZFS RAID does the following:

- In ZFS, synchronous writes are first written to the ZFS Intent Log (ZIL) before being committed to the main pool
- The SLOG (Separate Log device) is an optional, dedicated device for storing this ZIL
- It doesn’t store data permanently — it’s a short-term landing zone for synchronous writes to ensure they are safely committed in case of a crash or power loss
- Workloads that generate many small synchronous writes (databases, NFS/SMB with sync=always, VM workloads) benefit the most

Why mirrored?

- The SLOG is a single point of failure for synchronous write integrity. If you have only one SLOG device and it fails, you risk pool corruption for in-flight transactions
- A mirrored SLOG provides redundancy for the ZIL and ensures that even if one NVMe dies, the other still contains valid, crash-consistent logs

Performance gains with SLOG in a RAID1+0 of SSDs:

- All synchronous writes still wait for ZIL commit before returning ACK to the client
- Without a dedicated SLOG, the ZIL is stored on the main pool → each sync write hits your RAID10 vdevs twice (once for the ZIL, once for the actual data)
- With a dedicated SLOG the ZIL writes happen on the NVMe mirror instead of on the main pool → freeing the RAID10 from that extra write burden
- Main pool can focus on normal data writes, improving throughput
- Latency is reduced because the NVMe SLOG completes the ZIL commit much faster

max1337 · Aug 12, 2025

cwt said:
In productive use I would never choose consumer SSDs for anything. They are a total different thing and not suitable for ZFS loads. It's not only about the PLP (power loss protection) but also the much larger cache and how writes are done. A setup of enterprise SSDs in a striped mirror can be sufficient but is not a power horse compared to an equivalent of NMVE drives.

A SLOG attached to a ZFS RAID does the following:

- In ZFS, synchronous writes are first written to the ZFS Intent Log (ZIL) before being committed to the main pool
- The SLOG (Separate Log device) is an optional, dedicated device for storing this ZIL
- It doesn’t store data permanently — it’s a short-term landing zone for synchronous writes to ensure they are safely committed in case of a crash or power loss
- Workloads that generate many small synchronous writes (databases, NFS/SMB with sync=always, VM workloads) benefit the most

Why mirrored?

- The SLOG is a single point of failure for synchronous write integrity. If you have only one SLOG device and it fails, you risk pool corruption for in-flight transactions
- A mirrored SLOG provides redundancy for the ZIL and ensures that even if one NVMe dies, the other still contains valid, crash-consistent logs

Performance gains with SLOG in a RAID1+0 of SSDs:

- All synchronous writes still wait for ZIL commit before returning ACK to the client
- Without a dedicated SLOG, the ZIL is stored on the main pool → each sync write hits your RAID10 vdevs twice (once for the ZIL, once for the actual data)
- With a dedicated SLOG the ZIL writes happen on the NVMe mirror instead of on the main pool → freeing the RAID10 from that extra write burden
- Main pool can focus on normal data writes, improving throughput
- Latency is reduced because the NVMe SLOG completes the ZIL commit much faster

Thank you very much! That answered the question for me.
One more thing I could not figure out, yet: How large does the SLOG SSD need to be? Looks like some decent size like 512G will be plenty. Is that correct?

Cheers
Max

cwt · Aug 12, 2025

Yes, SLOG devices don’t need to be big. SLOG only saves ZIL data of the last 5 seconds. Even for high-end storages 64GB is more than enough.

vmguy · Aug 12, 2025

max1337 said:
Hello everyone,
we are experiencing similar issues with one of our Proxmox nodes. I am wondering if adding vdevs would help to decrease IO delay?

We are running a pool with the following configuration

Code:

zfs mirror-0 ata-Samsung_SSD_860_EVO_1TB ata-Samsung_SSD_860_EVO_1TB mirror-1 ata-Samsung_SSD_860_EVO_1TB ata-Samsung_SSD_870_EVO_1TB mirror-2 ata-Samsung_SSD_870_EVO_1TB ata-Samsung_SSD_870_EVO_1TB mirror-3 ata-CT2000BX500SSD1_2TB ata-CT2000BX500SSD1_2TB

The system hosts VMs with Windows 10/11 for testing purposes which are usually for one test only until they get purged. We experience slow/laggy VMs as soon as larger files are copied to one of the VMs. Currently we run 13 VMs in parallel and handle larger amounts of test data. CPU and RAM are fine.
Are enterprise SSDs with SATA suitable and my only option? Or could I go the route of adding another vdev mirror to reduce IO delay?
If enterprise SSDs are the only way, are any of these good options?
- Intel DC S4500
- Samsung PM893
- Samsung PM863
- Micron 5400 Pro

Thanks
Max

Were those drives ever trimmed?
What is the output of theses commands?

Code:

zpool list -v

Code:

zpool status -t *pool_name*
example: zpool status -t rpool

cwt · Aug 12, 2025

vmguy said:
Were those drives ever trimmed?
What is the output of theses commands?

Code:

zpool list -v

Code:

zpool status -t *pool_name* example: zpool status -t rpool

It's regardless if these have been trimmed or not. Consumer drives on ZFS mirrors won't get any faster if trimmed or not with this kind of workload.

vmguy · Aug 12, 2025

cwt said:
It's regardless if these have been trimmed or not. Consumer drives on ZFS mirrors won't get any faster if trimmed or not with this kind of workload.

I know. I just asked because I was having super high io delay and after trimming it got a lot better. Not better than enterprise ssd, but still....
Anyway, drives were never trimmed ( 1 year use).

max1337 · Aug 13, 2025

Code:

zpool list -v
NAME                                              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zfs                                              4.53T  1.89T  2.64T        -         -    28%    41%  1.00x    ONLINE  -
  mirror-0                                        928G   803G   125G        -         -    50%  86.5%      -    ONLINE
    ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0K844429M   932G      -      -        -         -      -      -      -    ONLINE
    ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0K844377K   932G      -      -        -         -      -      -      -    ONLINE
  mirror-1                                        928G   456G   472G        -         -    35%  49.1%      -    ONLINE
    ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0K844464P   932G      -      -        -         -      -      -      -    ONLINE
    ata-Samsung_SSD_870_EVO_1TB_S6PUNM0T618251X   932G      -      -        -         -      -      -      -    ONLINE
  mirror-2                                        928G   414G   514G        -         -    31%  44.6%      -    ONLINE
    ata-Samsung_SSD_870_EVO_1TB_S6PUNX0RC03088K   932G      -      -        -         -      -      -      -    ONLINE
    ata-Samsung_SSD_870_EVO_1TB_S626NJ0R162083M   932G      -      -        -         -      -      -      -    ONLINE
  mirror-3                                       1.81T   264G  1.55T        -         -    14%  14.2%      -    ONLINE
    ata-CT2000BX500SSD1_2237E665E4D3             1.82T      -      -        -         -      -      -      -    ONLINE
    ata-CT2000BX500SSD1_2237E665E63F             1.82T      -      -        -         -      -      -      -    ONLINE

Code:

zpool status -t zfs
  pool: zfs
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:44:18 with 0 errors on Sun Aug 10 01:08:20 2025
config:

        NAME                                             STATE     READ WRITE CKSUM
        zfs                                              ONLINE       0     0     0
          mirror-0                                       ONLINE       0     0     0
            ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0K844429M  ONLINE       0     0     0  (untrimmed)
            ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0K844377K  ONLINE       0     0     0  (untrimmed)
          mirror-1                                       ONLINE       0     0     0
            ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0K844464P  ONLINE       0     0     0  (untrimmed)
            ata-Samsung_SSD_870_EVO_1TB_S6PUNM0T618251X  ONLINE       0     0     0  (untrimmed)
          mirror-2                                       ONLINE       0     0     0
            ata-Samsung_SSD_870_EVO_1TB_S6PUNX0RC03088K  ONLINE       0     0     0  (untrimmed)
            ata-Samsung_SSD_870_EVO_1TB_S626NJ0R162083M  ONLINE       0     0     0  (untrimmed)
          mirror-3                                       ONLINE       0     0     0
            ata-CT2000BX500SSD1_2237E665E4D3             ONLINE       0     0     0  (untrimmed)
            ata-CT2000BX500SSD1_2237E665E63F             ONLINE       0     0     0  (untrimmed)

errors: No known data errors

Looks like they were not trimmed, yet - whoopsie.

I guess I should enable autotrim then?

Max

vmguy · Aug 13, 2025

Edit: not sure if being in a mirror is a problem. It’s better to wait for other people’s input on this.
But,

You can do it manually with

Code:

zpool trim *pool-name"

and check the process with

Code:

zpool status -t *pool-name*

You will see improvement.

vmguy · Aug 13, 2025

About autotrim, if you

Code:

cat /etc/cron.d/zfsutils-linux

You will see that that TRIM and SCRUB is already scheduled but in your case only SCRUB is being executed.
I have one node where TRIM was never executed as well, but I could not figure out exactly why. It has to do with the script that is executed, which checks if TRIM is supported.

max1337 · Aug 13, 2025

vmguy said:
About autotrim, if you

Code:

cat /etc/cron.d/zfsutils-linux

You will see that that TRIM and SCRUB is already scheduled but in your case only SCRUB is being executed.
I have one node where TRIM was never executed as well, but I could not figure out exactly why. It has to do with the script that is executed, which checks if TRIM is supported.

You are right. The script wants to run TRIM once a month but that does not happen. I will have to investigate why that does not work.
For now I was able to trigger Trim manually which seemingly helped. I will still change the SSDs to enterprise grade as soon as possible and if possible add the nvme SSDs for SLOG.

Thank you very much everybody for helping out!

Max

[SOLVED] High IO Delay

Member

Member

New Member

Member

New Member

Member

Renowned Member

Famous Member

Member

Renowned Member

Member

Renowned Member

New Member

Renowned Member

New Member

Member

New Member

New Member

Member

We value your privacy