Please help me solve high IO delay

dragon73

New Member
Nov 18, 2024
4
0
1
Hello everyone,

I’ve been facing a persistent issue for some time now, and I would greatly appreciate advice from more experienced members on how to resolve it. I’ve tried Googling and experimenting with different solutions, but so far, nothing has worked.

The problem is the high I/O delay I experience, which completely freezes all the VMs running on my server. This typically happens during moderately intensive disk operations. For example, if I use `yt-dlp` inside one of the Linux VMs to download a slightly longer YouTube video and then copy it elsewhere (e.g., to a remote share or server), the I/O delay often jumps to 100% at stays like that for a couple of minutes making the all VMs unusable. This delay sometimes occurs immediately after the download or during the file transfer, causing the entire system to freeze for several minutes.

Here’s my setup:
- Proxmox version: 8.1.4 (but it was the same with version 7)
- Hardware: Intel i7-2600 CPU, 32 GB RAM, a mix of SSDs and HDDs.
- Storage setup:
- Proxmox is installed on a ZFS pool of two mirrored 128 GB consumer-grade SSDs.
- VMs run on another ZFS pool (`zfs-storage`) with two mirrored 1 TB consumer-grade SSDs.
- An additional HDD is formatted with ext4 and mounted inside one of the VMs.

At one point, I added an Intel SSDSC2KB48 SSD as a ZIL drive for the `zfs-storage` pool, hoping to alleviate the I/O delay bottleneck. Unfortunately, this did not resolve the issue.

I understand that the CPU and memory aren’t particularly high-end. However, this is a home server currently running only:

1. A VM with Incredible PBX.
2. A VM with Ubuntu, which is used only occasionally.
3. A CT running a few docker containers.

I chose ZFS for its redundancy benefits, knowing that it typically requires more RAM than my setup offers. I love Proxmox but these freezes are discouraging me from utilizing it more as I would like. Would it make sense to switch to a setup with software RAID and than manually install Proxmox on top of it instead of using automated install and opting for ZFS?

Any advice on resolving these issues would be highly appreciated. Thank you!
 

Attachments

  • io_delay.JPG
    io_delay.JPG
    67.6 KB · Views: 15
There‘s a good chance you run into the „don‘t use consumer grade ssds for zfs“-problem. If your ssds run out of cache (which happens very quickly) data has to be written directly. Beside the faster wear out you‘ll experience delays and sluggish performance.

What ssds models do you use?
 
They are definitely ordinary consumer grades, those smaller one used for the system are Patriot Burst, and the bigger ones for the storage are Patriot P210.

Didn't think ZFS is so picky about it, especially because Proxmox ISO didn't offer some other type of software RAID so I though it's only natural to choose ZFS. :oops:
 
You may want to look into your ARC size. (It may be better to increase its max size if it is capped.)

I use consumer grade SSDs with ZFS for my server as well. (Since that is what my hoster put into the server.)

When I write large amounts of data to the disk, when the disks cannot keep up it starts using the ARC untill it can write it to disk. And when the drives cannot write the data fast enough to the SSDs, the ARC should kick in untill it is able to write everything to the SSDs.
 
Last edited:
  • Like
Reactions: dragon73
if I use `yt-dlp` inside one of the Linux VMs to download a slightly longer YouTube video and then copy it elsewhere (e.g., to a remote share or server), the I/O delay often jumps to 100% at stays like that for a couple of minutes making the all VMs unusable. This delay sometimes occurs immediately after the download or during the file transfer, causing the entire system to freeze for several minutes.
download = writing AND THEN copy it elsewhere to remote = reading is zfs arc problem on consumer-grade ssd's or writing to remote target ...
 
  • Like
Reactions: dragon73
How much RAM does your host have?

The ARC needs to be big enough to store the files it cannot write to the SSD.
So if your total file size is 5GB then it cannot be stored in the ARC as the ARC is too tiny.
Also when calculating the needed size for the ARC, also expect 2GB/3GB of overhead as it will also store frequently used files to speedup access times. (There is a good reason why the default ARC size is 50% of the total RAM size.)

The rule of thump is:
[base of 2GB] + [1 GB/TB of storage] (This is minimum and maximum size)
https://forum.proxmox.com/threads/rule-of-thumb-how-much-arc-for-zfs.34307/

And my rule of thump is:
[base of 2GB] + [1 GB/TB of storage] (With a minimum size of 8GB and a max size of 16GB)
 
  • Like
Reactions: dragon73
My rule of thump allows the ARC to also be used as a file cache location to reduce access times and since I am writing 5GB+ files somewhat frequently I want it to be big enough to not notice it starts using the ARC and running out of space in the ARC. But also not being so big, im just wasting RAM.

The ARC is sort of a plan B for writing files. If it cannot directly write it to the SSD (because it as already busy writing/reading), it writes it to the ARC cache until it can write it to the SSD.
 
Yes but they also state in there own wiki that you need at least a ARC size of 8GB.
For me it always sets the minimum at 8GB and maximum on 16GB since I run a server with 256GB of RAM.
But I never found a reason to increase the ARC size and reducing it can cause for issues like @dragon73 described as it cannot function correctly with that little RAM/ARC size.
https://pve.proxmox.com/wiki/ZFS_on_Linux
 
There‘s a good chance you run into the „don‘t use consumer grade ssds for zfs“-problem. If your ssds run out of cache (which happens very quickly) data has to be written directly. Beside the faster wear out you‘ll experience delays and sluggish performance.

What ssds models do you use?

Hi,

i search for the technology used in these drives, its not that clear but it seems to be qlc (http://www.madshrimps.be/articles/article/1001277/Patriot-P210-2TB-2.5-SSD-Review/2#axzz8ry0ZAeNQ) WITHOUT any ram. If you look at the performance test of this site (http://www.madshrimps.be/articles/article/1001277/Patriot-P210-2TB-2.5-SSD-Review/8#axzz8ry0ZAeNQ) you will see that after some data the write performance of the drive dropes to 13 MB/s (dont know how much iops this is, but is is very low). So what you see is expected with these drives. Either never fillup the slc cache (which seems hard with these drives and ZFS) or by better drives, even normal spinning HDDS will perform much better if the slc cache is filled up...

If you want to go forward with this hardware, i would install debian, mdraid with the drives and then install proxmox on top of that with thin-lvm. This will perform better but will also be slow... Or use old HDDs, which are better than these SSDs if slc cache is filled up and you have the performance all the time ...

Whatever you do, make backups! These drives will die without warning.

So long
 
Last edited:
  • Like
Reactions: dragon73
Thank you all for the suggestions and the comments.
I will reconsider changing some of the hardware used. When starting with Proxmox, apparently I wasn't fully aware of all the details that make big difference in the way the system works.
 
Hi,
It's all about the drivers. The VirtIO SCSI drivers (vioscsi) are not stable under heavy disk load.
Switching to VirtIO Block (or virtio-SCSI-single) and applying the following settings to each disk:
  • cache=writethrough (safe) [or writeback (but not safe)]
  • io thread=1
  • async io=threads
Other solutions:
  • Bus/Device: SATA (not ideal, low IO, but stable)
  • Use a CT instead of a VM

Read this as well:
https://forum.proxmox.com/threads/r...device-system-unresponsive.139160/post-691337
 
Hi,
It's all about the drivers. The VirtIO SCSI drivers (vioscsi) are not stable under heavy disk load.
Switching to VirtIO Block (or virtio-SCSI-single) and applying the following settings to each disk:
  • cache=writethrough (safe) [or writeback (but not safe)]
  • io thread=1
  • async io=threads
Other solutions:
  • Bus/Device: SATA (not ideal, low IO, but stable)
  • Use a CT instead of a VM

Read this as well:
https://forum.proxmox.com/threads/r...device-system-unresponsive.139160/post-691337
No, that’s not correct. As also stated in the PVE wiki, VirtIO is usually the best choice for performance. The TS‘ drives are not suitable for ZFS what leads to the poor performance.
 
I agree with you that regarding VirtIO, it is the best choice for performance.

I have several proxmox servers running ZFS (RAID 10). I encountered issues with VMs running Windows + SQL Server, which crash and become unresponsive. I am forced to stop the VM, which can cause issues with the databases. However, I have no other choice, as neither network access via RDP nor access through the Proxmox console works.

Initially, I resolved this issue by changing the disk Bus/Device from SCSI to SATA, while keeping the SCSI controller set to VirtIO SCSI (or single) (not ideal, low IO, but stable)

After researching on the forum, particularly by reading this discussion [https://forum.proxmox.com/threads/r...device-system-unresponsive.139160/post-691337], I found that the following combination fixed my issue:

  • The SCSI controller: VirtIO SCSI (or VirtIO SCSI Single)
  • The disk should be of type VirtIO Block, with the following settings applied to each disk:
    • cache=writethrough (safe for me , database Server)
    • io thread=1
    • async io=threads
This issue can be reproduced using the CrystalDiskMark tool (in the Settings menu) for higher random load (up to Q32T16).

Another solution is to use CT instead of VM for SQL Server (Linux based operating system)
 
Last edited: