Server-Disk I/O delay 100% during cloning and backup

ripper12004 · Oct 30, 2025

After 2 days of testing I wasn't able to reproduce my IO load issue on the test systems.

I'm running the same hardware as production: HP DL360 Gen10 w/ P408i-a, HP DL360 Gen9 w/ P440ar and HP DL360p Gen8 w/ P420i (4 x 2TB 870 EVO in RAID10, Smart Path enabled).
And the exact same versions of PVE: 8.3.0 (migrating from it) and 9.0.11

Until now I haven't seen the same "dd" process pop up at the end of the migration process and using 100% of IO and freezing the system and crashing some of the VMs, no matter what I did: cloning, live migration, migration of shutdown VMs

Also, my production PVE 9.0.11 HP DL360 Gen10 w/ P408i-a has no issues when running backup tasks and/or cloning, only when migrating VMs from PVE 8.3.0 and restoring them from the yet to be updated PBS v3.4.6

It's weird... I'll keep trying to replicate and diagnose the issue and update here if I make any progress.

Thanks!

LSX3 · Feb 1, 2026

I’m still having this problem and I’m not really making any progress.
Does anyone have any ideas on what else I could try to get closer to a solution?
I’m completely stuck and can’t figure it out on my own.

uzumo · Feb 1, 2026

Isn't this what happens when cloning with consumer NVMe?

kioxia kbg40zns256gb
IO Pressure Stall 60%

This doesn't happen with enterprise SAS.

husmm3280ass200
IO Pressure Stall 1.4%

Did you purchase an enterprise SSD?

*The above shows the results of cloning the same virtual machine onto each SSD after reviewing this thread.

LSX3 · Feb 1, 2026

I'm using enterprise SSDs (2.5")
I installed VMware ESXi v7 and v8 on the same machine with exactly the same hardware, and these problems don't occur there

lebela · Feb 1, 2026

LSX3 said:
I’m still having this problem and I’m not really making any progress.
Does anyone have any ideas on what else I could try to get closer to a solution?
I’m completely stuck and can’t figure it out on my own.

Were these ssd's used before? I had very similar symptoms with zfs, then it turned out, that unless you explicitly set the org.debian:periodic-trim pool property to enable, the monthly cronjob for trimming won't do anything with sata and sas disks. The default behavior with the property set to auto only trims nvme disks. You can manually start a pool trim with zpool trim <poolname>. Also you can check the last time the pools were trimmed with zpool status -t
I've seen you mentioned ESXi on the same hardware, on vmfs automatic space reclamation is enabled by default.

uzumo · Feb 1, 2026

The exact SSD model has not been specified yet; please provide the exact model.

Furthermore, using volumes from the HP Smart Array P408i-a SR Gen10 with ZFS is not only not recommended, it is explicitly discouraged.

*No one creates such environments that are considered non-recommended.

LSX3 · Feb 1, 2026

lebela said:
Were these ssd's used before? I had very similar symptoms with zfs, then it turned out, that unless you explicitly set the org.debian:periodic-trim pool property to enable, the monthly cronjob for trimming won't do anything with sata and sas disks. The default behavior with the property set to auto only trims nvme disks. You can manually start a pool trim with zpool trim <poolname>. Also you can check the last time the pools were trimmed with zpool status -t
I've seen you mentioned ESXi on the same hardware, on vmfs automatic space reclamation is enabled by default.

Yes, the SSDs have been in use the whole time
The problem does not occur only with ZFS - it also happens with a hardware RAID (Yes, I know that hardware RAID is not recommended, but I happen to have this hardware)

I have not explicitly enabled TRIM
Apart from the package repositories and the analysis tools mentioned in my previous posts here, PVE is still in its default (factory) state

LSX3 · Feb 1, 2026

uzumo said:
The exact SSD model has not been specified yet; please provide the exact model.

Furthermore, using volumes from the HP Smart Array P408i-a SR Gen10 with ZFS is not only not recommended, it is explicitly discouraged.

*No one creates such environments that are considered non-recommended.

I don’t have the hardware with me right now, but these are explicitly enterprise SSDs, not consumer-grade drives
As I already mentioned in the previous post, I’m using the hardware exactly as it is

If the RAID controller is in HBA mode (which it is in the ZFS setup), what impact is it supposed to have?
In HBA mode it should just pass the disks through 1:1 and not do anything on its own

I need the RAID controller in order to connect the drives to the server

Where do you know from that this RAID controller is discouraged?
Is there a list of discouraged/unsupported RAID controllers somewhere?

uzumo · Feb 1, 2026

Please understand that ZFS does not recommend hardware RAID when building your system.

In a sense, it's close to common sense. You should look into it.

Edit: I'm correcting this because I just learned it's being used in HBA mode.
I don't know what kind of results you'd get in HBA mode either. If the controller doesn't handle I/O, it should be the same as an HBA

lebela · Feb 1, 2026

LSX3 said:
Yes, the SSDs have been in use the whole time
The problem does not occur only with ZFS - it also happens with a hardware RAID (Yes, I know that hardware RAID is not recommended, but I happen to have this hardware)

I have not explicitly enabled TRIM
Apart from the package repositories and the analysis tools mentioned in my previous posts here, PVE is still in its default (factory) state

If at least as much data was written to the ssd's as their capacity without trimming, this kind of slowdowns are expected regardless of the storage subsystem in use. With lvm on a hw-raid controller, I would create a separate logical volume for the pve installation and vm data, then blkdiscard the vm data block device first, only after that create the lvm-thin storage on top of it. Also setting issue_discards = 1 to lvm.conf could help, and also enable discard on the individual vm disks. The latter should be used with zfs too - so the trim/discard commands from the vm can reach the storage system of the host.

LSX3 · Feb 1, 2026

uzumo said:
Please understand that ZFS does not recommend hardware RAID when building your system.

In a sense, it's close to common sense. You should look into it.

It may be true that a hardware RAID controller is generally “not recommended” in some setups
However, as I already explained, it is required in my case in order to connect the drives directly to the server

Also, I can reproduce the exact same problem on a desktop PC with two 2.5" SATA enterprise SSDs
There is no RAID controller installed there, and the result is identical

From my perspective, this means the issue cannot be caused by the RAID controller - regardless of whether it is recommended or not

LSX3 · Feb 1, 2026

lebela said:
If at least as much data was written to the ssd's as their capacity without trimming, this kind of slowdowns are expected regardless of the storage subsystem in use. With lvm on a hw-raid controller, I would create a separate logical volume for the pve installation and vm data, then blkdiscard the vm data block device first, only after that create the lvm-thin storage on top of it. Also setting issue_discards = 1 to lvm.conf could help, and also enable discard on the individual vm disks. The latter should be used with zfs too - so the trim/discard commands from the vm can reach the storage system of the host.

No, the drives are 4 TB each and there are 8 drives installed in total
The VM data itself is only around 500 GB

Also, discard is enabled on the virtual disks of my VMs

uzumo · Feb 1, 2026

I am not familiar with enterprise-grade SSDs in capacities such as 256GB, 2TB, or 4TB, so I will refrain from commenting unless specific models are provided.

LSX3 · Feb 1, 2026

I have set a bwlimit
With that, I can prevent the utilization from reaching 100%
With the following configuration, it only goes up to about 60%:

--bwlimit clone=500000,migration=500000,move=500000,restore=500000

lebela · Feb 1, 2026

LSX3 said:
No, the drives are 4 TB each and there are 8 drives installed in total
The VM data itself is only around 500 GB

Also, discard is enabled on the virtual disks of my VMs

Sorry, i wasn't clear enough

It's data written to the disks over their entire lifetime up to this date.
If all of the blocks are written to without discarding blocks containing deleted data, then all new writes require a discard (done by the ssd controller) on the blocks before you can start the actual write, increasing latency.

LSX3 · Feb 1, 2026

lebela said:
Sorry, i wasn't clear enough It's data written to the disks over their entire lifetime up to this date.
If all of the blocks are written to without discarding blocks containing deleted data, then all new writes require a discard (done by the ssd controller) on the blocks before you can start the actual write, increasing latency.

Okay, that makes sense to me

Can you tell me how I can enable that?
I’d like to test whether this fixes the problem

lebela · Feb 1, 2026

LSX3 said:
Okay, that makes sense to me

Can you tell me how I can enable that?
I’d like to test whether this fixes the problem

You can test it with the zpool trim <poolname> command. It shouldn't take much time, but you can check it's status with zpool status <poolname> -t. If it was the problem, you can enable the periodic trimming by zfs set org.debian:periodic-trim=enable <poolname>.

LSX3 · Feb 1, 2026

lebela said:
You can test it with the zpool trim <poolname> command. It shouldn't take much time, but you can check it's status with zpool status <poolname> -t. If it was the problem, you can enable the periodic trimming by zfs set org.debian:periodic-trim=enable <poolname>.

Thank you very much!
I will test it and provide feedback

beisser · Feb 1, 2026

the exact models of the ssds in use would be nice to know. they raid-controller should be able to show the models. or cli-utils for the controller if available.

LSX3 · Feb 1, 2026

According to the purchase documents, the drives are 8 × 3.84 TB 2.5" Samsung PM893 Datacenter Enterprise 24/7 RAID SATA SSDs, rated for 7008 TBW and 97K IOPS.

Server-Disk I/O delay 100% during cloning and backup

New Member

Member

Active Member

Member

New Member

Active Member

Member

Member

Active Member

New Member

Member

Member

Active Member

Member

New Member

Member

New Member

Member

Well-Known Member

Member

We value your privacy