High IO Delay, Slow performance

MorpheusTrue

New Member
Apr 26, 2024
1
0
1
For about 2 weeks know I´m having trouble with my server. Everytime I´m uploading Data or using a VM, the IO Delay goes up to 30-80 % and needs a few Minutes to go down. In the meantime, I reinstalled Proxmox because of a few reasons. Sadly, that changed nothing. So I think it could be a hardware Problem.

Currently I have just one Lxc running for Docker.

Bildschirmfoto vom 2024-04-26 10-40-08.png

On that peak I transferred 11,4 GB of Music.
Before I did the new installation, I installed 4 new 3TB HDD´s. Only my 1TB SSD is still in my Homeserver.

Bildschirmfoto vom 2024-04-26 10-45-04.png

While writing this, I uploaded Data to my ZFS Volume, and there where no Problems. 110 GB of Data and it ran flawlessly. So its defenetly my SSD. But is there a way to "repair" it?

Edit: Also, this minutely shows up in my system Logs:
Bildschirmfoto vom 2024-04-26 11-14-35.png
 
Last edited:
My PROXMOX Server 8.1.10 runs fine over 2 weeks, yesterday afternoon it startet high IO delay 50, 60% and more. After shutting down all VMs and reboot all was normal, today morning the high IO delay came back. I now shut down all VMs again, but high IO delay persist. Trying and uograde.
The are quite al lot of post with the same problem and no real solution, at least not for my problem as I no scheduled backups or snapshots etc.
And there is no possible reason for that problem.
PROXMOX should at least give some tips to avoid that problem
 
The most common reason for sudden IO delay spikes is that Consumer or Prosumer SSD are used.

While they might offer decent performance in desktops, they are not suited for Hypervisors, since the general workload differs drastically. Enterprise SSDs are more expensive but are made for this kind of workload, which is why these kind of issues mostly appear on Homelab instances.

Consumer SSDs are only really fast as long as you can use their (tiny) SLC cache. Once that cache is full, their performance tanks hard, and then you get the performance of their main cells, which are the way slower TLC or QLC cells. The performance of QLC is especially bad, usually dropping down to HDD levels.

Additionally, Consumer SSDs don't offer Power-Loss-Protection (PLP). While this sounds 'just' like a safety feature, it also enables the disk to do Sync Writes, which drastically increases the performance in these kinds of environments.

The only real solution is to get some Enterprise SSDs. A common recommendation for Home-Labs is to look out for second-hand Enterprise-SSDs.
 
  • Like
Reactions: Kingneutron
My Server is a Fujitsu Primargy RX300 S6 Server with server grade HDs (that's not a consumer enviorement) and it runs fine last three weeks in production and 1 month before while testing PROXMOX without any abnormal IO delay, since yesterday. I did'nt change anything. And aft reboot yesterday evening it runs normal all the night until suddenly this morning ir raised over 50%.
My filesystem is ZFS
 
Hello,

Could you please share the exact storage configuration? Is a hardware controller in use? Is it in HBA mode? How many disks? What kind of raid setup if any?, and more importantly what is the exact model of the disks?

Could you please share the VM config of one of the affected VMs? They are at `/etc/pve/qemu-server/<VM-ID>.conf`.

Regarding the original post, I see four HDDs. Which ZFS raid mode are they using?
 
Hello Maximilano,

thank you for your help, but as I wrote i figured it out. Im am relative new to PROXMOX and to ZFS. My first assume was that a DH has crashed or has failure but zpool status said mirrored no error, so I came to zpool scrub and found it was running, I stopped it and IO delay was gone.
My configuration is two 144GB HD mirrored with a RAID controller for the PROXMOX system an d two single 1TB HD ZFS mirrored RAID1

A second server pbs is configured similar.

Regards

 
Could you check if your HDDs are SMR or CMR? It is known that SMR disks perform badly with scrubs.

Regarding using a raid setup with a hardware controller, this is not recommended. See ZFS's documentation [2].

Regarding the Backup server, we recommend using SSDs [1] for performance. The reason is that HDDs perform poorly when doing random writes/reads.

[1] https://pbs.proxmox.com/docs/installation.html#recommended-server-system-requirements
[2] https://openzfs.github.io/openzfs-docs/Performance and Tuning/Hardware.html#hardware-raid-controllers
 
Last edited:
  • Like
Reactions: Kingneutron
The HDs are 1.0TB WD RE3 SATA 32 MB Cache WD1002FBYS
The 1T HD are not hardware mirrored they two separate disks configured as RAID1 under ZFS
I know my servers are not the youngest and my HDs are not the fastes, but I am IT Consultant and it's my home network, so performance is not really an issue, but an IO delay of 50, 60 up to 90% ist not accepable. But now I know how to solve and can start scrub over night or weekend.

I know for a real production network enterprise grade SSDs are recommended

I migrated everything from ESXi last month. I am still learning, and migrated a customers network last week from ESXi to PROXMOX , to actual server with SSDs and have an other larger installation by an other customer that should be migrated in the next months.

Regards
 
Hello,

Sorry to bother if its a silly question. We have just set up our Proxmox environment with 4 Dell Poweredge nodes.
We got HA running under ZFS, two nodes are primary (that hold all the VMS) and the other two are solely there for firing up machines when HA is triggered (we can say one node has one backup).

When scrubbing, does io delay just affect hard disk performance? or it hurts more on the CPU side? We've got pairs (RAID1) of 1TB SAS Dell Enterprise Grade drives and my IO at idle is fluctuating between 2-10%, with normal operations can get up to 20% and when doing maintenance, it gets up to a whopping 80%.
We currently have on our way a pair of new Xeons to upgrade our current setup, its just to clarify if this is going to mitigate the problem at least a little bit.

Thank you,
Regards.
 
First it is not clear if your RAID 1 is Hardware RAID or Software RAID under ZFS. If it is Hardware RAID ist is very bad and under no circumstances used together with ZFS.
As of my experiance, while scrubbing disk I/O slows down extremly on hard drives. It is hardly recommended to use SSD Drives or at its best NVMe drives.

Rainer
 
We got the card passing through the disks as they are. So in reality, zfs can read and write onto disks directly.
 
Just my experience till now.
The servers I used in May had hardware Raid controllers but I configured each HD as a single channel and let ZFS do the rest meaning RAID1.
In the meantime I throwed away the old server, a they consumed to much power and now I use a NUC device with an NVMe disk. I know thats not for professional enviorement and the is no RAID, but for me that is OK. Since then I never had any problem with scrubbing an did not realize any significant decrease on disk I/O. The I have an USB3 hard drive ZFS too, for a archive volumes, so performance is not the issue. But there a a significant decrease when scrubbing is running too. In example the 600Gb is done in some seconds and if content has change in les the 2 minutes, but if scrubbing is running the backup time goes up to 10 hours and the performance of the total system is degraded. On the other side use of CPU increases not very much.
Conclusion a fast HD channel seams to be extrem important.
My config, NUC with 64GB RAM a 264GB M.2 disk for the system, a 1Tb NVME Disk for about 10 VMs, a 1TB USB 3 HD for the Archive volume, separate PBS server

Rainer