Server-Disk I/O delay 100% during cloning and backup

LSX3

New Member
May 5, 2025
28
2
3
Bayern - Germany
Hello everyone,

I am experiencing a reproducible issue with Proxmox VE 9.0. After a fresh installation, I am using the default LVM-Thin datastore that PVE creates during setup.

Whenever I create a full clone from a VM template (disk format is raw), the I/O delay immediately spikes to 100%. As a result, other running VMs freeze completely and no longer respond. Even the QEMU guest agent does not respond anymore. The only way to recover is to issue a Stop command on the affected VMs.

Steps to reproduce:
  1. Fresh install of PVE 9.0
  2. Keep the default LVM-Thin datastore
  3. Create a VM template
  4. Perform a full clone from the template (disk = raw)
  5. I/O delay spikes to 100% → other VMs freeze → only “Stop” helps
Additional notes:
  • The problem occurs on multiple reinstalled hosts, always the same result.
  • No special configuration changes were made to storage or VMs.
  • Using full clone, not linked clone.
  • Disk format = raw.
Has anyone else seen this behavior on PVE 9.0 with LVM-Thin?

Any suggestions or workarounds would be greatly appreciated.

Thanks in advance!


1759242272041.png
 
Last edited:
Please share lsblk -o+FSTYPE,MODEL. I want to see what kind of disks you're using.
 
Last edited:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 1 0B 0 disk
sdb 8:16 0 22.4T 0 disk
├─sdb1 8:17 0 1007K 0 part
├─sdb2 8:18 0 1G 0 part /boot/efi
└─sdb3 8:19 0 22.4T 0 part
├─pve-swap 252:0 0 8G 0 lvm [SWAP]
├─pve-root 252:1 0 200G 0 lvm /
├─pve-data_tmeta 252:2 0 15.9G 0 lvm
│ └─pve-data-tpool 252:4 0 22.1T 0 lvm
│ ├─pve-data 252:5 0 22.1T 1 lvm
└─pve-data_tdata 252:3 0 22.1T 0 lvm
└─pve-data-tpool 252:4 0 22.1T 0 lvm
├─pve-data 252:5 0 22.1T 1 lvm

1759245601743.png
 
Hmm, no proper model shown. Judging from the size and name I assume this is some kind of RAID or fake RAID? Can you check the model number of the drive(s) and what kind of RAID it is?
 
Last edited:
Hmm, no proper model shown. Judging from the size and name I assume this is some kind of RAID or fake RAID? Can you check the model number of the drive(s) and what kind of RAID it is?
Yes, this is an HP ML350 Gen10
It is equipped with an HP Smart Array P408i-a SR Gen10 RAID controller (2GB cache, 8-port modular, RAID levels 0, 1, 10, 5, 50, 6, 60, HBA mode, including BBU)
There are 8 enterprise SSDs installed, configured as a RAID6
The RAID controller provides the RAID6 volume, which is used by PVE as the datastore
 
I have now removed my hard drives from the RAID controller and connected them directly
After that, I created a ZFS pool
I'm still able to reproduce the same behavior
Can anyone else reproduce this issue as well?

1759333076221.png
 
Last edited:
I’m experiencing the same issue where the I/O utilization goes up to 100% when restarting VMs — apparently whenever some load is generated.
Is anyone else having this problem?
Can anyone reproduce this behavior?

The problem occurs with both an LVM thin volume and ZFS.

1760291168959.png
 
Last edited:
I'm completely out of ideas. I'm seeing the same issue with the backup, and it occurs consistently.
Has anyone else experienced or can reproduce this behavior?

1761148181017.png
 
I'm running into the same problem again, and I'm close to not being able to properly use Proxmox because my VMs keep freezing.
Can anyone confirm this behavior?
Is anyone else experiencing this, or has someone maybe already found a solution?
 
Hello,

Running into the same issue here with PVE 9.0.11 running on a HP DL360 Gen10 using P408i-a SR and 4 consumer SSDs in RAID10 (used as PVE datastore).

Whenever I migrate or restore a VM from backup the IO usage spikes to 90% for minutes at a time (depending on how large the VM is) and all running VMs freeze (some recover, some do not and require a forced stop).

This issue seems to be worse when migrating/restoring larger VMs (over 32GB), because that's when the IO usage stays high for 3-10 minutes at a time.

Haven't been able to find a solution for this yet, but I can consistently replicate the issue.
Will try to troubleshoot with iotop-c, but at this time I do not have the delayacct kernel arg active at boot.

Edit: IO load during backup seems normal, around 5-6%
1761577837507.png

Edit 2: iotop-c shows dd using 100% of IO (for about 2 minutes after copying the data from the remote host) during a migration from PVE8 to PVE9 with the VM powered off
1761595557411.png
 
Last edited:
I know and expect some issues using consumer SSDs (these are 2TB SSDs with DRAM cache).

What I can't explain is why the same 4 SSDs in RAID10 on a HP DL360 Gen9 using Smart Array P840 and running PVE8, and even the same 4 SSDs in RAID10 on a HP DL360 Gen8 using Smart Array P420i also running PVE8, do not have this issue regarless of the VM size.
To me, at least right now, it seems to be something related to HP Gen10 with P408i RAID controller and PVE9...
 
No ZFS, the drives are not exposed to PVE.
Just the RAID controller running in Mixed mode, using a RAID10 array of 4 Samsung 870 2TB SSDs as datastore.

I have a couple of Gen10 machines with the same P408i controller I'm not using at the moment.
I'll spin up a PVE cluster with them to test different storage configurations and settings and report here in a couple of days.
Thanks
 
if these are Samsung QVO, they are not suitable at all.
QLC drives write very slow after cache (which is different from DRAM cache).
There is plenty topics on the forum about it, mainly with ZFS because it reveals faster the slowness.
 
All drives are the EVO variant as I know to stay away from the QVO ones.
I will install PVE 8 and 9 on the same Gen10 hardware and compare between them, and I will also install PVE 9 on Gen9 hardware.
Will post some conclusions here as soon as possible.
Thank you
 
I’m experiencing the same problem with my test host.
This host has the following hardware:

  • Intel Xeon CPU E5-2698 v3 – 2.30GHz – 2 sockets
  • 600 GB RAM
  • 6 × 2 TB enterprise SSDs
  • 1 × 256 GB enterprise SSD (for the PVE operating system)

The six drives are configured as a ZFS RAIDZ2 (RAID6).
I can reproduce the problem on this system as well.