IODelay > 45% after updating to 8.4.5

hac3ru

Active Member
Mar 6, 2021
52
2
28
34
Hello,

Quite a weird thing I'm facing: I've updated my PVE hosts to 8.4.5 and one of them is now having huge io delays. To get an idea, we used to have <15% io delays, now it's up to 45 - 50%, see below:
1753465703003.png
Anyone got any idea how I can clear this? It's quite bad, especially when the backups start running, cause the host just freezes in place.
This doesn't happen on all hosts, even thought all of them are running the same version, 8.4.5 - updated last night, at midnight. Just when the io delay started to show up - no hardware changes on the server but it was physically moved from one location to another (don't think it matters but .... )

Thank you!
 
Hey Impact, thank you for the reply.

Don't know what to tell you about the hardware. It's an "old" (3-4 years old) SuperMicro server with 768 GBs of memory and 2x AMD EPYC 7302 16-Core Processor. The disk is an iSCSI mount shared across multiple hosts, this is the only one that has issues.
Using `iotop-c -cPo` I was able to see that a Jenkins controller VM (not a builder node) was reading the disk for around 100 MB/s and was keeping it "locked". Move the VM to another node, all is well.
About Jenkins, no new builds were started today, in particular - I'd say that today we had 30-45% less builds - and again, this is not a build node, so I don't think it matters.

P.S. using LVM-Thin over the iSCSI LUN.
The LUN is a RAID 5 over 3 SSDs. I've seen thousands of IOPS going to this storage and nothing complained, from this exact same host. That's what makes me thing something happened to it....

I'll try to move Jenkins back, see if it does the same. I was wondering, can a bad cable/SFP cause this?
L.E.: moved Jenkins back, nothing too bad.... I'll trigger a backup, see how that goes :D

Thank you!
 
Last edited:
I'm not very familiar with iSCSI so I can't help with specifics here but as for the bad cable question grep -sR . /sys/class/net/*/speed would show the speed of all NICs and grep -sRE . /sys/class/net/*/statistics/*(drop|err)* package drops/errors. ip -s a might be a bit more readable for the latter.
 
Last edited:
The speed's 10G, it seems fine and there's no errors on the NIC itself.

The backup seems to make it crash. It works fine for some time, io delay is really low - we're talking under 5%, and all of the sudden, without any obvious reason, the io delay works it's way up to 30-45%, backup read and write speed slow to a crawl and nothing else works anymore.
I haven't see this to happen on a specific VM - so it makes me thing it's not a corrupt VM or something, that could cause this.
Right now, I've backed up around 500GBs worth of backups before it jammed again....