ZFS SSD High IO Wait

BaronVonChickenPants

Renowned Member
Dec 16, 2009
12
0
66
Australia
Hello Team,
Trying to get the to bottom of underwhelming disk performance with ZFS, particularly random spikes of High IO wait.
I know this topic has been done to death, I've tried tuning parameters, I have CPU and RAM to spare, on SSD disks but I just can't work out why the whole server just chokes.

Dell R730 2 x Xeon E5-2698 v4, 512Gb RAM with LSI3008 IT mode firmware (H330?)
2 ZFS arrays
3 x Dell 400Gb SAS SSD - RAID Z1
4 x Crucial 4Tb SATA SSD - RAID Z2

Server consists mostly of containers with low load and a few Windows VMs, one of which is an MS Exchange server with 5 active users.
I have disabled swap on all Windows VMs and ensured sufficient memory allocation.
Almost all server reside on the SATA Z2 array.

Initially I thought the issue may have been high activity of small database writes for logs from unifi and uisp self hosted servers, so these were moved to the SAS Z1 array but there was no noticable change in performance.

First screenshot shows idle server with random high IO spikes, last 3 screenshots show a windows update running on the exchange server bringing the whole server to a standstill while the CPUs are almost idle.

Thanks for any advice you can offer.
 

Attachments

  • Screenshot 2025-09-12 141744.png
    Screenshot 2025-09-12 141744.png
    67.2 KB · Views: 11
  • Screenshot 2025-09-15 150817.png
    Screenshot 2025-09-15 150817.png
    174.8 KB · Views: 12
  • Screenshot 2025-09-15 152547.png
    Screenshot 2025-09-15 152547.png
    206 KB · Views: 11
  • Screenshot 2025-09-15 152555.png
    Screenshot 2025-09-15 152555.png
    209.7 KB · Views: 11
Almost all server reside on the SATA Z2 array.

I would try again with "Enterprise Class SSDs" with PLP / "Power-Loss-Protection".

...and with mirrors, not a RaidZ2 - as this gives you only the IOPS of a single device.

Of course SATA is massively slower than PCIe --> if possibly switch technology...

Disclaimer: just random thoughts... before my first coffee :-)
 
It‘s not only the bottleneck of IOPS but also:

- Parity must be recalculated and rewritten for every small change

- Sync-write-heavy applications (e.g., databases) suffer massively under RAIDZ

RAIDZ should only be used for „cold“ storage. For Windows VMs - especially Exchange - you should use striped mirrors on enterprise drives (as @UdoB already mentioned). Other performance problems can arise if your volblocksize doesn’t match the VMs cluster size. For example: Windows formats system disks by default @ 4k. If your dataset/volblocksize is set to 128k or 256k, you‘ll end up in massive overhead (write amplification). In combination with RAIDZ a performance killer. If NVME is out of scope as the main storage type consider using 2x small NVME drives as a mirrored SLOG for a striped mirror setup on S-ATA enterprise drives.
 
  • Like
Reactions: Johannes S and UdoB
Considering how much better enterprise drives do at small and sync writes I'd rather use them for a SLOG. Probably not needed if the main drives are DC ones already. I'm not sure if normal NVMe drives as SLOG would help that much to accelerate a DC drive pool.
1757919585604.png
@BaronVonChickenPants I have some IO troubleshooting tips here you might be interested in.
Would you mind sharing lsblk -o+FSTYPE,MODEL and zpool status so we can see your storage layout and what models those drives are?
 
Last edited:
I had no idea the performance difference was so significant.

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS FSTYPE MODEL
loop0 7:0 0 500G 0 loop ext4
sda 8:0 0 372.6G 0 disk zfs_member PX05SMB040Y
├─sda1 8:1 0 1007K 0 part
├─sda2 8:2 0 1G 0 part vfat
└─sda3 8:3 0 371.6G 0 part zfs_member
sdb 8:16 0 372.6G 0 disk zfs_member PX05SMB040Y
├─sdb1 8:17 0 1007K 0 part
├─sdb2 8:18 0 1G 0 part vfat
└─sdb3 8:19 0 371.6G 0 part zfs_member
sdc 8:32 0 372.6G 0 disk zfs_member PX05SMB040Y
├─sdc1 8:33 0 1007K 0 part
├─sdc2 8:34 0 1G 0 part vfat
└─sdc3 8:35 0 371.6G 0 part zfs_member
sdd 8:48 0 3.6T 0 disk CT4000BX500SSD1
├─sdd1 8:49 0 3.6T 0 part zfs_member
└─sdd9 8:57 0 8M 0 part
sde 8:64 0 3.6T 0 disk CT4000BX500SSD1
├─sde1 8:65 0 3.6T 0 part zfs_member
└─sde9 8:73 0 8M 0 part
sdf 8:80 0 3.6T 0 disk CT4000BX500SSD1
├─sdf1 8:81 0 3.6T 0 part zfs_member
└─sdf9 8:89 0 8M 0 part
sdg 8:96 0 3.6T 0 disk CT4000BX500SSD1
├─sdg1 8:97 0 3.6T 0 part zfs_member
└─sdg9 8:105 0 8M 0 part

pool: SSD-Z2
state: ONLINE
scan: scrub repaired 0B in 02:20:25 with 0 errors on Sun Sep 14 02:44:32 2025
config:

NAME STATE READ WRITE CKSUM
SSD-Z2 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-CT4000BX500SSD1_2509E9ABAFC2 ONLINE 0 0 0
ata-CT4000BX500SSD1_2509E9ABAEB2 ONLINE 0 0 0
ata-CT4000BX500SSD1_2509E9ABAFAC ONLINE 0 0 0
ata-CT4000BX500SSD1_2509E9ABA89E ONLINE 0 0 0

errors: No known data errors

pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:03:40 with 0 errors on Sun Sep 14 00:27:56 2025
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
scsi-350000397bc882aa5-part3 ONLINE 0 0 0
scsi-350000397bc882aa9-part3 ONLINE 0 0 0
scsi-350000397bc882a8d-part3 ONLINE 0 0 0

errors: No known data errors
 
I have used NVMe for SLOG in the past but from what I've read adding NVMe to the R730 is tricky, only specific drives supported in specific configurations.

I was hoping some half decent SATA SSDs would be good enough but it would appear not.

Edit: Upon further reading it seems this is only an issue if you want to boot from NVMe, SLOG or VM storage would be fine with generic PCIe to NVMe adaptors.
 
Last edited:
RAIDz2 is not at all like hardware RAID6 (with BBU). Since you only have 4 drives, why not change it to a stripe of mirrors (which is like RAID10) and you'll improve the IOPS a lot?
However, BX500 drives are a poor choice with ZFS due to the QLC flash and might give you write errors due to time-outs during sustained writes. Search for QLC on the forum and read about all the people who have a hard time believing they wasted their money (and tips about enterprise drives with PLP).
Adding SLOG and/or L2ARC usually does not help but a special device might sometimes, but probably not in your case with already all SSDs.

EDIT: I assumed BX500 4TB used QLC (like other BX500) but it looks like it's TLC instead (although it's not clearly stated on Crucial's website).
 
Last edited:
RAIDz2 is not at all like hardware RAID6 (with BBU). Since you only have 4 drives, why not change it to a stripe of mirrors (which is like RAID10) and you'll improve the IOPS a lot?
However, BX500 drives are a poor choice with ZFS due to the QLC flash and might give you write errors due to time-outs during sustained writes. Search for QLC on the forum and read about all the people who have a hard time believing they wasted their money (and tips about enterprise drives with PLP).
Adding SLOG and/or L2ARC usually does not help but a special device might sometimes, but probably not in your case with already all SSDs.

EDIT: I assumed BX500 4TB used QLC (like other BX500) but it looks like it's TLC instead (although it's not clearly stated on Crucial's website).
The theory was double redundancy for hardware failure but the IO performance is such a punish I would be better off moving to 2 separate mirrors.

Please use code block rather than quotes so the formatting is preserved
Noted, I started with code then second guessed myself and changed it to quotes.

Back to the drawing board to figure out the best way forward....
 
SLOG on a NVME mirror gives you benefits if your VM storage is set to sync=always (commits are written much faster).
 
  • Like
Reactions: Impact
OK so if I were to start again with, for example, 4 x Micron 7450 3.84TB for high performance and keep the existing disks for less demanding uses.

What would be the best practice for configuring the Micron array?
My instinct says RAIDZ1, I don't need the speed of stripped mirror, I "shouldn't" need the redundancy of Z2 with enterprise hardware.
Should I just have 2 mirror arrays?
 
Just out of curiosity, please post the output of

Code:
zpool list -v

and

Code:
zpool status -t
 
Just out of curiosity, please post the output of

Code:
zpool list -v

and

Code:
zpool status -t
Code:
 zpool list -v
NAME                                   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
SSD-Z2                                14.5T  2.29T  12.3T        -         -    26%    15%  1.00x    ONLINE  -
  raidz2-0                            14.5T  2.29T  12.3T        -         -    26%  15.8%      -    ONLINE
    ata-CT4000BX500SSD1_2509E9ABAFC2  3.64T      -      -        -         -      -      -      -    ONLINE
    ata-CT4000BX500SSD1_2509E9ABAEB2  3.64T      -      -        -         -      -      -      -    ONLINE
    ata-CT4000BX500SSD1_2509E9ABAFAC  3.64T      -      -        -         -      -      -      -    ONLINE
    ata-CT4000BX500SSD1_2509E9ABA89E  3.64T      -      -        -         -      -      -      -    ONLINE
rpool                                 1.09T   391G   721G        -         -    22%    35%  1.00x    ONLINE  -
  raidz1-0                            1.09T   391G   721G        -         -    22%  35.1%      -    ONLINE
    scsi-350000397bc882aa5-part3       372G      -      -        -         -      -      -      -    ONLINE
    scsi-350000397bc882aa9-part3       372G      -      -        -         -      -      -      -    ONLINE
    scsi-350000397bc882a8d-part3       372G      -      -        -         -      -      -      -    ONLINE

Code:
zpool status -t

  pool: SSD-Z2
 state: ONLINE
  scan: scrub repaired 0B in 02:20:25 with 0 errors on Sun Sep 14 02:44:32 2025
config:

        NAME                                  STATE     READ WRITE CKSUM
        SSD-Z2                                ONLINE       0     0     0
          raidz2-0                            ONLINE       0     0     0
            ata-CT4000BX500SSD1_2509E9ABAFC2  ONLINE       0     0     0  (trim unsupported)
            ata-CT4000BX500SSD1_2509E9ABAEB2  ONLINE       0     0     0  (trim unsupported)
            ata-CT4000BX500SSD1_2509E9ABAFAC  ONLINE       0     0     0  (trim unsupported)
            ata-CT4000BX500SSD1_2509E9ABA89E  ONLINE       0     0     0  (trim unsupported)

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:03:40 with 0 errors on Sun Sep 14 00:27:56 2025
config:

        NAME                              STATE     READ WRITE CKSUM
        rpool                             ONLINE       0     0     0
          raidz1-0                        ONLINE       0     0     0
            scsi-350000397bc882aa5-part3  ONLINE       0     0     0  (untrimmed)
            scsi-350000397bc882aa9-part3  ONLINE       0     0     0  (untrimmed)
            scsi-350000397bc882a8d-part3  ONLINE       0     0     0  (untrimmed)

errors: No known data errors
 
yeah, you never trimmed the ssd's.
I will help a little with the IO wait. But like many mentioned already, nothing beats enterprise drives.
 
And I have to ask before committing to the money, with the following in mind what advantages am I going to see with high end desktop TLC like FireCuda, 990 Pro, T500 vs enterprise hardware;
R730xd is limited to PCIE 3 x16 bifurcated to 4x per drive
I have redundant PSU with 3000kva UPS's and expanded battery packs

Performance specs are comparable, in some cases better with the desktop hardware.

I don't mind spending the money, I just want to make sure I can justify it.
 
High peak speeds on consumer NVMe look great in benchmarks but don’t matter in real workloads. Enterprise SSDs are built for consistency, with stable performance even under sustained load, while desktop drives quickly drop off once their cache is exhausted. They also deliver far higher endurance (DWPD), full power-loss protection, and better error handling. Enterprise firmware is validated for 24/7 use, and advanced monitoring makes failures predictable. In short: consumer NVMEs are fast on paper, enterprise drives are fast all the time - and that’s what counts in a server.

Take a Samsung 990 Pro 2TB (typical „high end“ consumer TLC, no big difference to FireCuda or similar drives) vs. an Intel P4610 1.6TB (enterprise NVMe):

Endurance: 990 Pro = ~1,200 TBW total (≈0.3 DWPD for 5 years). P4610 = ~12,000 TBW (≈3 DWPD for 5 years). → That’s 10× higher endurance.

Sustained writes: 990 Pro starts at ~6,000 MB/s but can drop below 1,000 MB/s once the SLC cache is gone. P4610 delivers a steady ~3,000 MB/s 24/7 without falling off.

Power-loss protection: 990 Pro = none. P4610 = full PLP with capacitors, so in-flight data isn’t lost.

Latency consistency: 990 Pro can spike to 10–50 ms under load. P4610 stays in the low 100 µs range even at saturation.
 
@cwt
High peak speeds on consumer NVMe look great in benchmarks but don’t matter in real workloads. Enterprise SSDs are built for consistency, with stable performance even under sustained load, while desktop drives quickly drop off once their cache is exhausted. They also deliver far higher endurance (DWPD), full power-loss protection, and better error handling. Enterprise firmware is validated for 24/7 use, and advanced monitoring makes failures predictable. In short: consumer NVMEs are fast on paper, enterprise drives are fast all the time - and that’s what counts in a server.

Take a Samsung 990 Pro 2TB (typical „high end“ consumer TLC, no big difference to FireCuda or similar drives) vs. an Intel P4610 1.6TB (enterprise NVMe):

Endurance: 990 Pro = ~1,200 TBW total (≈0.3 DWPD for 5 years). P4610 = ~12,000 TBW (≈3 DWPD for 5 years). → That’s 10× higher endurance.

Sustained writes: 990 Pro starts at ~6,000 MB/s but can drop below 1,000 MB/s once the SLC cache is gone. P4610 delivers a steady ~3,000 MB/s 24/7 without falling off.

Power-loss protection: 990 Pro = none. P4610 = full PLP with capacitors, so in-flight data isn’t lost.

Latency consistency: 990 Pro can spike to 10–50 ms under load. P4610 stays in the low 100 µs range even at saturation.

Thanks for taking the time to break it all down in such detail, I really appreciate it.
For reference: @udobexplained why RAIDZ is a bad idea for vm storage here:

https://forum.proxmox.com/threads/fabu-can-i-use-zfs-raidz-for-my-vms.159923/

Thanks, I'll have a read while I wait for my Enterprise drives to arrive.