High IO Delay and Unable to Login via Web Interface

colin1234

New Member
Jan 3, 2024
8
0
1
Okay, so here's a new one that I'm having trouble figuring out. For the last week or so it seems I can sometimes not login to the Web Interface. I get the "Unable to Login" error message. I've Googled and tried restarting services etc. via SSH and nothing seems to make a difference. It seems that after some time it will randomly just start working and allow me to login. I did notice this morning immediately before it allowed me to login I had some ridiculously high IO Delay. It seems it was sitting at 40%-50% IO Delay and then as soon as it dropped back to 5% or so I could login (see image below). When the issue is happening, all of the VMs and containers seem to behave with the exception of my Frigate container, which writes to a separate 8TB ZFS spinning disk. It is actually missing recordings at the same time that I am unable to login to the Web Interface.

-I have my boot drives setup as a 2-SSD ZFS Mirror and a separate 2-SSD ZFS Mirror for my VMs/Containers. The boot drives are consumer and the VM drives are Enterprise SSDs. I've verified that none of the partitions are running low on space.

-Running 8.3.0. I updated and rebooted about a week ago. It seems like *maybe* the problems started around then.

-Single Node

Can anyone point me in a direction that might help me find the root cause of this?

IODelay.png
 
Hi, try to get an idea of what is causing the IO delay?

iostat or atop should at least tell you which devices are in troubles, and what processes are impacted (though… it's probably VMs ;))

If you have another monitoring tool up, try to get insight there… And you can also use SMART to get some info about how "damaged" are your SSDs. ZFS (and Ceph) can hammer those quite a bit.

Proxmox processes tend to behave really bad when a storage starts acting up, so that could be improved, maybe… But you probably have an actual hardware issue there.