[SOLVED] High IO Delay

I don't have a particularly write-heavy environment, which is why the crappy disks I currently have have been working fine for the last 3 months (save for the few times I get spikes).

Thanks all for all the feedbacks!!!
I'll update when I get them up and running.
 
  • Like
Reactions: VDNKH
UPDATE:
I got the new Intel SSD's today and it seems to indeed be the issue. IO delay problems are now gone. Thanks for all the help!
It seems the new drives are SATA. Is that correct? It seems all the recomendations are for NVMe drives.
 
Last edited:
It seems the new drives are SATA. Is that correct? It seems all the recomendations are for NVMe drives.
Wasn't aware of that, but the SATA drives do the job. NVMe probably would've been overkill. I think the main problem is just that my drives didn't have PLP, so sync write performance sucks.
 
Wasn't aware of that, but the SATA drives do the job. NVMe probably would've been overkill. I think the main problem is just that my drives didn't have PLP, so sync write performance sucks.

These seem to be MLC. I am suffering the same fate. I have a cluster with Samsung EVO drives. Also consumer grade and the same thing is happening. Any sustained write causes the machine to be unusable until the write is complete. Like migrating VM's or changing ZFS storage. I guess we will try a couple of these since SATA will be easier than NVMe for the machines that we are currently using.

Like in your situation, most of the time this setup works fine.... just when doing a big sync.

Thanks for responding so quickly.
 
Hello everyone,
we are experiencing similar issues with one of our Proxmox nodes. I am wondering if adding vdevs would help to decrease IO delay?

We are running a pool with the following configuration
Code:
zfs
  mirror-0
    ata-Samsung_SSD_860_EVO_1TB
    ata-Samsung_SSD_860_EVO_1TB
  mirror-1
    ata-Samsung_SSD_860_EVO_1TB
    ata-Samsung_SSD_870_EVO_1TB
  mirror-2
    ata-Samsung_SSD_870_EVO_1TB
    ata-Samsung_SSD_870_EVO_1TB
  mirror-3
    ata-CT2000BX500SSD1_2TB
    ata-CT2000BX500SSD1_2TB

The system hosts VMs with Windows 10/11 for testing purposes which are usually for one test only until they get purged. We experience slow/laggy VMs as soon as larger files are copied to one of the VMs. Currently we run 13 VMs in parallel and handle larger amounts of test data. CPU and RAM are fine.
Are enterprise SSDs with SATA suitable and my only option? Or could I go the route of adding another vdev mirror to reduce IO delay?
If enterprise SSDs are the only way, are any of these good options?
- Intel DC S4500
- Samsung PM893
- Samsung PM863
- Micron 5400 Pro

Thanks
Max
 
Short answer: yes, these drives are suitable. Alternative (if you have no big budget) would be used enterprise drives. Beside Samsung, Intel and Micron are Kingston‘s DC drives. For 13 VMs I would prefer a striped mirror setup (aka RAID1+0). Depending on your workload you could add 2x NVME with plp as a mirrored SLOG to your striped mirror.
 
Last edited:
From your replies I get the following:
- the BX500 are useless for our application
- the Samsung EVOs are not great but do the trick?
- Upgrading to Kingston DC, Samsung PM-Series or Intel DC S4500 are a good option

What would SLOG do for me? I assume in an all flash array NVME is the only way since it is way faster than SATA SSDs? Will be a bit challenging as I am running out of PCIe lanes :D

Cheers
Max
 
In productive use I would never choose consumer SSDs for anything. They are a total different thing and not suitable for ZFS loads. It's not only about the PLP (power loss protection) but also the much larger cache and how writes are done. A setup of enterprise SSDs in a striped mirror can be sufficient but is not a power horse compared to an equivalent of NMVE drives.

A SLOG attached to a ZFS RAID does the following:

- In ZFS, synchronous writes are first written to the ZFS Intent Log (ZIL) before being committed to the main pool
- The SLOG (Separate Log device) is an optional, dedicated device for storing this ZIL
- It doesn’t store data permanently — it’s a short-term landing zone for synchronous writes to ensure they are safely committed in case of a crash or power loss
- Workloads that generate many small synchronous writes (databases, NFS/SMB with sync=always, VM workloads) benefit the most

Why mirrored?

- The SLOG is a single point of failure for synchronous write integrity. If you have only one SLOG device and it fails, you risk pool corruption for in-flight transactions
- A mirrored SLOG provides redundancy for the ZIL and ensures that even if one NVMe dies, the other still contains valid, crash-consistent logs

Performance gains with SLOG in a RAID1+0 of SSDs:

- All synchronous writes still wait for ZIL commit before returning ACK to the client
- Without a dedicated SLOG, the ZIL is stored on the main pool → each sync write hits your RAID10 vdevs twice (once for the ZIL, once for the actual data)
- With a dedicated SLOG the ZIL writes happen on the NVMe mirror instead of on the main pool → freeing the RAID10 from that extra write burden
- Main pool can focus on normal data writes, improving throughput
- Latency is reduced because the NVMe SLOG completes the ZIL commit much faster
 
  • Like
Reactions: UdoB
In productive use I would never choose consumer SSDs for anything. They are a total different thing and not suitable for ZFS loads. It's not only about the PLP (power loss protection) but also the much larger cache and how writes are done. A setup of enterprise SSDs in a striped mirror can be sufficient but is not a power horse compared to an equivalent of NMVE drives.

A SLOG attached to a ZFS RAID does the following:

- In ZFS, synchronous writes are first written to the ZFS Intent Log (ZIL) before being committed to the main pool
- The SLOG (Separate Log device) is an optional, dedicated device for storing this ZIL
- It doesn’t store data permanently — it’s a short-term landing zone for synchronous writes to ensure they are safely committed in case of a crash or power loss
- Workloads that generate many small synchronous writes (databases, NFS/SMB with sync=always, VM workloads) benefit the most

Why mirrored?

- The SLOG is a single point of failure for synchronous write integrity. If you have only one SLOG device and it fails, you risk pool corruption for in-flight transactions
- A mirrored SLOG provides redundancy for the ZIL and ensures that even if one NVMe dies, the other still contains valid, crash-consistent logs

Performance gains with SLOG in a RAID1+0 of SSDs:

- All synchronous writes still wait for ZIL commit before returning ACK to the client
- Without a dedicated SLOG, the ZIL is stored on the main pool → each sync write hits your RAID10 vdevs twice (once for the ZIL, once for the actual data)
- With a dedicated SLOG the ZIL writes happen on the NVMe mirror instead of on the main pool → freeing the RAID10 from that extra write burden
- Main pool can focus on normal data writes, improving throughput
- Latency is reduced because the NVMe SLOG completes the ZIL commit much faster
Thank you very much! That answered the question for me.
One more thing I could not figure out, yet: How large does the SLOG SSD need to be? Looks like some decent size like 512G will be plenty. Is that correct?

Cheers
Max
 
Yes, SLOG devices don’t need to be big. SLOG only saves ZIL data of the last 5 seconds. Even for high-end storages 64GB is more than enough.
 
Hello everyone,
we are experiencing similar issues with one of our Proxmox nodes. I am wondering if adding vdevs would help to decrease IO delay?

We are running a pool with the following configuration
Code:
zfs
  mirror-0
    ata-Samsung_SSD_860_EVO_1TB
    ata-Samsung_SSD_860_EVO_1TB
  mirror-1
    ata-Samsung_SSD_860_EVO_1TB
    ata-Samsung_SSD_870_EVO_1TB
  mirror-2
    ata-Samsung_SSD_870_EVO_1TB
    ata-Samsung_SSD_870_EVO_1TB
  mirror-3
    ata-CT2000BX500SSD1_2TB
    ata-CT2000BX500SSD1_2TB

The system hosts VMs with Windows 10/11 for testing purposes which are usually for one test only until they get purged. We experience slow/laggy VMs as soon as larger files are copied to one of the VMs. Currently we run 13 VMs in parallel and handle larger amounts of test data. CPU and RAM are fine.
Are enterprise SSDs with SATA suitable and my only option? Or could I go the route of adding another vdev mirror to reduce IO delay?
If enterprise SSDs are the only way, are any of these good options?
- Intel DC S4500
- Samsung PM893
- Samsung PM863
- Micron 5400 Pro

Thanks
Max

Were those drives ever trimmed?
What is the output of theses commands?
Code:
zpool list -v
Code:
zpool status -t *pool_name*
example: zpool status -t rpool
 
Were those drives ever trimmed?
What is the output of theses commands?
Code:
zpool list -v
Code:
zpool status -t *pool_name*
example: zpool status -t rpool
It's regardless if these have been trimmed or not. Consumer drives on ZFS mirrors won't get any faster if trimmed or not with this kind of workload.
 
It's regardless if these have been trimmed or not. Consumer drives on ZFS mirrors won't get any faster if trimmed or not with this kind of workload.
I know. I just asked because I was having super high io delay and after trimming it got a lot better. Not better than enterprise ssd, but still....
Anyway, drives were never trimmed ( 1 year use).
 
Code:
zpool list -v
NAME                                              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zfs                                              4.53T  1.89T  2.64T        -         -    28%    41%  1.00x    ONLINE  -
  mirror-0                                        928G   803G   125G        -         -    50%  86.5%      -    ONLINE
    ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0K844429M   932G      -      -        -         -      -      -      -    ONLINE
    ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0K844377K   932G      -      -        -         -      -      -      -    ONLINE
  mirror-1                                        928G   456G   472G        -         -    35%  49.1%      -    ONLINE
    ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0K844464P   932G      -      -        -         -      -      -      -    ONLINE
    ata-Samsung_SSD_870_EVO_1TB_S6PUNM0T618251X   932G      -      -        -         -      -      -      -    ONLINE
  mirror-2                                        928G   414G   514G        -         -    31%  44.6%      -    ONLINE
    ata-Samsung_SSD_870_EVO_1TB_S6PUNX0RC03088K   932G      -      -        -         -      -      -      -    ONLINE
    ata-Samsung_SSD_870_EVO_1TB_S626NJ0R162083M   932G      -      -        -         -      -      -      -    ONLINE
  mirror-3                                       1.81T   264G  1.55T        -         -    14%  14.2%      -    ONLINE
    ata-CT2000BX500SSD1_2237E665E4D3             1.82T      -      -        -         -      -      -      -    ONLINE
    ata-CT2000BX500SSD1_2237E665E63F             1.82T      -      -        -         -      -      -      -    ONLINE

Code:
zpool status -t zfs
  pool: zfs
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:44:18 with 0 errors on Sun Aug 10 01:08:20 2025
config:

        NAME                                             STATE     READ WRITE CKSUM
        zfs                                              ONLINE       0     0     0
          mirror-0                                       ONLINE       0     0     0
            ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0K844429M  ONLINE       0     0     0  (untrimmed)
            ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0K844377K  ONLINE       0     0     0  (untrimmed)
          mirror-1                                       ONLINE       0     0     0
            ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0K844464P  ONLINE       0     0     0  (untrimmed)
            ata-Samsung_SSD_870_EVO_1TB_S6PUNM0T618251X  ONLINE       0     0     0  (untrimmed)
          mirror-2                                       ONLINE       0     0     0
            ata-Samsung_SSD_870_EVO_1TB_S6PUNX0RC03088K  ONLINE       0     0     0  (untrimmed)
            ata-Samsung_SSD_870_EVO_1TB_S626NJ0R162083M  ONLINE       0     0     0  (untrimmed)
          mirror-3                                       ONLINE       0     0     0
            ata-CT2000BX500SSD1_2237E665E4D3             ONLINE       0     0     0  (untrimmed)
            ata-CT2000BX500SSD1_2237E665E63F             ONLINE       0     0     0  (untrimmed)

errors: No known data errors

Looks like they were not trimmed, yet - whoopsie.

I guess I should enable autotrim then?

Max
 
Edit: not sure if being in a mirror is a problem. It’s better to wait for other people’s input on this.
But,

You can do it manually with
Code:
zpool trim *pool-name"

and check the process with
Code:
zpool status -t *pool-name*

You will see improvement.
 
Last edited:
About autotrim, if you

Code:
cat /etc/cron.d/zfsutils-linux

You will see that that TRIM and SCRUB is already scheduled but in your case only SCRUB is being executed.
I have one node where TRIM was never executed as well, but I could not figure out exactly why. It has to do with the script that is executed, which checks if TRIM is supported.
 
About autotrim, if you

Code:
cat /etc/cron.d/zfsutils-linux

You will see that that TRIM and SCRUB is already scheduled but in your case only SCRUB is being executed.
I have one node where TRIM was never executed as well, but I could not figure out exactly why. It has to do with the script that is executed, which checks if TRIM is supported.
You are right. The script wants to run TRIM once a month but that does not happen. I will have to investigate why that does not work.
For now I was able to trigger Trim manually which seemingly helped. I will still change the SSDs to enterprise grade as soon as possible and if possible add the nvme SSDs for SLOG.

Thank you very much everybody for helping out!

Max