IOWait Increases, Server eventually locks up and/or requires external intervention, please help.

Kursah

New Member
Apr 14, 2024
13
3
3
I had originally created this topic: https://forum.proxmox.com/threads/proxmox-server-webgui-and-ssh-no-longer-accessible.156006/, as I had corrupted my single-disk PVE deployment.

I know have a RAID1 BTRFS with two different SATA SSD's for my current PVE installation, I remounted my ZFS pool to it, moved my proxmox backup USB HDD to a dedicated miniPC with PBS deployed (was a CT on my previous PVE deployment). I am able to restore VM's, but the original issue that caused me to corrupt my PVE deployment in the first place is back in full force, and I'm hoping to diagnose successfully.

First, I'm running very old hardware, I realize that. Everything with PVE worked GREAT up until about a month ago. I originally repurposed this build from a Hyper-V homelab core server to a PVE server, added a GPU for passthru to a Plex CT for accelerated transcoding, and was passing through the CPU GPU to a Win 10 VM for UI acceleration to run my applications (mostly Ninjatrader and other trading apps) effectively over RDP over VPN sessions.

System Specs:
CPU - i7 4770 (4C, 8T, non-k) + Thermalright aftermarket cooler (keeps it below 65C).
MB - Supermicro X10SLQ (OOBM is Intel vPro, Meshcommander mostly used outside of SSH direct to PVE).
RAM - Mushkin 32GB DDR3 (4x8GB DDR3 1600 @ 1600 CL11)
GPU - Asus Phoenix GTX 1660 Super OC 6GB in PCIe 4X Slot (NVENC Encoding for Plex)
Storage:
- SSD 1 - SATA, 250GB
- SSD 2 - SATA, 240GB
- BTRFS RAID1 at 240GB for PVE OS deployment and ISO storage.
RAID: Inspur LSI 9300-8i HBA in IT Mode
- 3TB HGST HDD's x 6 in ZFS RAID10 from TS430 backplane arrangement.

I keep all of this in a modified Lenovo TS430 case with 2x4 3.5 backplanes.

I don't expect high performance, but it seems whatever happened about a month ago has now caused the IO Delay on CPU Usage, which I have attributed to IOWait in Netdata, and linking with PSI values iosome and iofull in ATOP, are all related somehow. I suspect it is more to do with my ZFS pool than anything. This is where I'm hoping to get some guidance. I have done some researching and have installed ATOP for example. I ran a fresh scrub with no VM's running, no errors. All drives SMART values are OK.

It seems like the sever comes to a halt in different areas. The VM's stop responding overall, then the PVE itself after a time does the same, ultimately requiring my out-of-band restart process from Meshcommander or vpro webgui. I am hoping to get a little more mileage out of this hardware until I can afford to upgrade to something truly better. At this point, I can't even restore my full environment without running into issues...for what was so surprisingly smooth and fast a few months ago, it is only fast if I run nothing.

Even running my Unifi 101 CT yesterday, after a time IOWait starting climbing and climbing, until same result, gotta reboot the damn hardware.

I've ran memtest, cpu tests, etc., and those seem to be fine, no errors. Maybe ZFS is too much for this old hardware with recent PVE updates? Maybe there's a bug kicking my ass and I'm just too ignorant at the moment to clearly identify it?

Please help?


Here's some screenshots:

Proxmox dashboard, showing IO Delay staying up. This is after restoring a 56GB Plex LXC suddenly stopped part way through and I ultimately stopped it. I did the same yesterday, but let it sit for over 2hrs to no progress.
1730230458935.png

Here's the WebGUI ZFS storage results:
1730230528272.png

Here's zpool iostat and zpool status results from SSH:
1730230550170.png

Here's IOTop, mostly showing nothing happening:
1730230615177.png

Here's what ATOP looks like when IO Wait is getting higher, iosome and iofull under PSI continually stay in the mid to upper 90% range.
1730230667469.png
 
In atop just sdg and sdh are shown. Could you send output of "zpool iostat -lv 1" or "iostat -xm 1" for about 10s when I/O problem occurs while by both tools the first output cut be cut off.
 
  • Like
Reactions: Kursah
In atop just sdg and sdh are shown. Could you send output of "zpool iostat -lv 1" or "iostat -xm 1" for about 10s when I/O problem occurs while by both tools the first output cut be cut off.
Thank you for the reply.

It took a surprising amount of time for this issue to crop back up, to the point I almost thought somehow, someway maybe I was out of the woods. Not the case.

I was restoring my 4TB file server VM from my PBS device, got about 17% through it, and about 25 minutes ago, the transferring between PBS and PVE all but stopped on the restoration, and IOWAIT went from 0-4% with small peaks here and there to above 35%, and floating up towards 50%.

Here's Netdata:
1730315841343.png

PVE Task Viewer:
1730315288020.png

PVE Dashboard showing system load and IOWait going through the roof:
1730315958692.png


PBS Task Viewer and stats, you can see the bandwidth to the left fall off suddenly about 25-30 minutes ago:
1730315344132.png


And as you requested here's the output from zpool iostat -lv 1 + see attached txt file. I took a screenshot of the top since it looked like a mess in txt form.
1730315736969.png


I was unable to get zpool iostat -xm 1 to work, this error:
1730315775504.png

Lastly is atop -l, KVM seems to be consuming 100% CPU atm, and out of nowhere.
1730316100229.png



I hope this helps, thank you for your time on this matter.

At the moment the PVE UI seems to still be somewhat responsive.

Edit: Forgot to add the alerts posted in Netdata:
1730317266589.png
 

Attachments

  • 1730315259768.png
    1730315259768.png
    198.7 KB · Views: 6
  • SpartanProx1~# zpool iostat -l.txt
    67.4 KB · Views: 2
Last edited:
Adding another reply, in the WebGUI syslog for my PVE server, I see a lot of these VMxxx qmb command 'query-proxmox-support' failed - unable to connect to VM xxx qmp socket. I noticed this during high IOWAIT times before i corrupted my original OS drive. I have attached a screenshot, and also copied the logs to text from approx 8AM to 5PM.

I left the restore job for the past 4.5 hours, no change whatsoever. IOWAIT still high, my two running Windows Server 2019 DC VM's aren't responding, not able to console or rdp, they don't respond to ping. At this point if I try to shut them down or reboot my PVE, I know I'll have to intervene via OOBM or physical power switch. Sigh.

Anyways, screenshot of this unable to connect qmp socket error, not sure if it is relevant, but it starts up at the same time IOWAIT kicks up...so maybe its just a effect of the actual root cause.

Thanks to everyone who takes a moment to look and might have some help to send my way. I wish I could go back to March-Sept when this server ran totally fine. I was ignorantly hoping a new OS deployment might do the trick. Maybe something's wrong with my ZFS? The drives seem to test fine, the raid card in IT mode seems ok, but clearly something is wrong. I am unwilling to migrate back to Hyper-V. I am going to figure this out, and I wish I had the budget to upgrade to newer, faster hardware, but I don't atm.

Please help if you can. Thanks!

1730329357366.png
 

Attachments

  • Oct 30 000008 SpartanProx1 Syslog from WebGUI.txt
    684.3 KB · Views: 1
Morning update, I hardware rebooted my PVE server after submitting the above logs. Let it run idle with no VM's or CT's running, things looked fine. Small spikes in IOWAIT that barely breached 5-6%. I decided to perform a reboot from within the PVE environment to get a "clean reboot". Let that idle for about half an hour, seemed fine as before.

I then initiated the restoration of VM 105, my file server with >4TB data. The job made it about 90 minutes, at which point I was in bed fast asleep. IOWAIT creeped up to the 25-30% range, no further progress was made.

This morning I ultimately cancelled the restoration. I unlocked VM105, attempted to destroy it, which only sent IOWAIT even higher but no actual activity. Cancelled the job.

I did attempt a reboot from within PVE even with high IOWAIT as it does start the PVE shutdown process. I give it some time and lo and behold today it was able to reboot itself without OOBM intervention. So I at least have that much going for me. Though no VM's running, which adds to the IOWAIT and slowness issue.

Didn't see anything that stood out to me in the webgui syslogs.

When IOWAIT is high, system load is high. When those are high, nothing happens on the system. No fans are spun up, no high temps, no hard crashes or total lockups at this point.

Not sure what else to do at this point, can someone please point me in the right direction? Thanks in advance.
 
Good morning to anyone following.

I was able to run my UniFi controller, Server 2019-based DC1 and DC2 VM's all night without issues or creeping IOWAIT. Granted there shouldn't been no major data transfers or reads, but I did suspect both DC1 and DC2 might have more updates to snag once those policies take effect again.

That being said, even though IOWAIT is low, there are consistent spikes as seen in the below screenshot. That doesn't seem like normal behavior to me, what do you guys say?

1730471922989.png

I will be shutting down all VM's and attempting to restore my file server and it's 4TB+ data drive. I suspect I'll end up with high IOWAIT and an all but locked up PVE server again. Right now with those VM's running is the longest I've been able to run VM's on this server w/o locking up in over a month.

Even though I did thoroughly test the RAM with MEMTEST for about 12hrs in recent weeks w/o issue, I do have 32GB of DDR3 to swap in over this weekend and test. I also have an identical 9300i RAID card in IT mode that resides in my other core server (still Hyper-V) that I have had powered off for months that I may swap in as well.

Beyond that, I'm kind of at a loss here and hoping someone will have some pointers, directions, or diagnostic suggestions for me to try. Thanks!
 
Made it one hour into the restore, CPU is pegged at 100%, IOWAIT is pegged. Surely someone else at some point in PVE history has come across this issue before? What am I missing? What am I being ignorant of on this? I'm hoping an outside perspective can at least get me going in a direction.

Anyone have anything? Please?

1730486318801.png

1730486340589.png

1730486408090.png
 
Hi,

i did not see which ssds you are using, but i think this is the reason. The effects you are describing points to a consumer SSD which runs out of its pseudo cache and gets slower than a harddisk. This behavior is expected and even worse if the ssds are QLC ssds. Either get yourself a decent SSD with PLP (Enterprise ssd), or wait till the disk has written the restore (with qlc ssds you literally get only a few MB/s if their pseudo cache runs out) and adjust your workload in normal operation so the cache does not run out. My suggestion would be to get a enterprise ssd, even used once are much better than consumer ssds. The problem without PLP (Power loose protection) is that direct io cannot be cached in ram of the ssd and leads to a write on the flash, which is slow and consumes the lifetime of the flash. With PLP the direct writes can be cached and written once if enough data is cached to write a whole block.
 
  • Like
Reactions: Kursah
Thank you for the response! You are 100% correct, I am using consumer SSD's for the OS, in RAID1 BTRFS. And have been in the 6+ months I've been on Proxmox now.

How can I verify those SSD's are the issue? I would love to ensure that's the case.

If this is truly the issue, I wonder then, with the same exact VM's, data load, IO loads, etc., why did everything work perfectly fine from April through September? Only becoming an issue in October. Further, the issue persists with 2 different consumer SSD's in a different config for PVE OS duties only with a fresh PVE install. Just odd that everything worked great until all of a sudden this past month it has not.

When I was performing the restore what I didn't see is increased IO on the BTRFS array, as I'm restoring to the ZFS which shows the IO traffic, but maybe I need to watch more closely. Is there some sort of default caching to the drives I have dedicated to OS duties that I can diagnose?

I have no VM's or CT's hosted on the SSD BTRFS array, they're all on the RAID10 ZFS array (all HGST 7.2k 3TB HDD's). I will definitely start looking at acquiring an enterprise SSD, that is on the list of future upgrades, but until I can budget for that, I'd like to verify that's the culprit, if you have some pointers on how to verify this is the case I would love to do just that.

What you say makes perfect sense and I have ran across that for such issues in previous searches, but following the diagnosis I did find came up with no clear evidence this is the problem. My server is still pegged atm, with server load at 45, IOWAIT at 100%. I am hoping I can use this current situation to diagnose and confirm what my next step should be.

Thanks again! :)
 
Hi again,

read the thread again, sorry did not read it that close the first time and oversaw that your ZFS is on HDDs. I am not a ZFS expert, but if that only happens on a lot of writes (restore of a vm for example) you may be using up the zfs ram cache and then the cache needs to be written to the zfs drives which only can do x ioops and holds the write till a percentage of the cache is free again. If you are using 3 mirros, each mirror capable of 200 io/s you get about 600 iops in the zfs configuration and the restore if the 4TB vm is eating it all up. What you can try is to stop all vms/ct and do a restore with a bandwith limit set (for example 20 MB/s if your current write from the screenshot where 28MB/s) so zfs can keep up with writing data to the disk. Afterwards start your vms again and you should be fine.

The high load is not really cpu usage, it is that high because each process is waiting for io and counting as 100% load of one core resulting in that insane high load value.
 
  • Like
Reactions: Kursah
That makes sense! Thank you very much!!

I was actually contemplating trying to limit the bandwidth as well. The restore I attempted today was with all VM's off, and after an hour of things going well, things suddenly halted, IOWAIT shot up, the restore stopped making progress.

I have cancelled the restore process now, will reboot to clear IOWAIT (unless there's something I can do w/o rebooting? If so, I've yet to find it.).

I'll give this a test with throttled bandwidth and see what comes of it. Usually it floated around 350-500Mbps over LAN and 100-600MB/s spikes for the ZFS array watching Netdata. I'll aim for 100MB/s to start and see if that is doable. I still worry what may happen after I get everything restored, as I haven't really fixed anything. But if I can get this last VM restored, I can move onto the next.

Is there any logical reasoning to think I should destroy and reconfigure the ZFS? That has been rolling in the back of my mind, but I've yet to find anything to support that's worth any effort.

Thanks again, Retro! Much appreciated.
 
No Problem,

if the zfs array did not run full (always set a limit at around 80 percent) i would not see any reason to recreate it, but as is said i am not an zfs expert.
 
  • Like
Reactions: Kursah
Once all my data is restored, I'll be at approx. 50% utilization of available ZFS storage space, maybe even slightly under 50%.

Started restore with 100MB/s limit, will see how that goes and adjust from there if it fails. I would hope it could at least keep up with that much lol. If not, then whatever is wrong, I have little faith if I do get all my data restored that I'll be able to operate as I had before issues began.

Fingers crossed the restore succeeds and I'm able to get by until I can get my upgrade budget sorted.

Thanks!
 
  • Like
Reactions: Retro1982
I hope the restore goes well, but if the problem occurs again, did you check the smart values of the hard drives? Sometimes smart reports the drive as ok, but the bad block count (Reallocated_Sector_Ct) goes up, which indicates a failing drive so i would check these values extra. If a hdd tries to handle bad blocks the write throughput can go down to a few kb/s and the io delay goes up like crazy.
 
Last edited:
  • Like
Reactions: carles89 and Kursah
Yep, from the WebGUI and SSH, they checked out okay, I have probably checked SMART values a dozen times or more this week thinking along those lines as well. I ran short tests on each drive yesterday as well. I thought potentially it would be a failing drive, but so far all 6 of my 3TB drives are showing healthy in SMART results. SDA-SDF are the HDD's.

Please see screenshots below. I just took screenshots of the webgui results, glad to do the same for SSH results as well, LMK.

1730500198939.png

1730500226208.png

1730500253379.png

1730500303462.png

1730500340385.png

1730500364737.png

1730500398822.png
 
No they look good, all perfect, was just a thought. I am betting with the right bandwith limit all will work out.
 
  • Like
Reactions: Kursah
No they look good, all perfect, was just a thought. I am betting with the right bandwith limit all will work out.

Ironically I killed this PVE instance on the reboot after this last message, IOWAIT shot up, WebGUI became unresponsive, pushed a reboot out via CLI. Clearly didn't wait long enough before pushing a reboot via OOBM, corrupted the proxmox install.

Verified I need a vga monitor or dummy plug hooked up to see everything through the vPro OOBM for LInux boot, moved from my other core server that's not in use atm, it's my older Hyper-V machine.

So here's what I've done so far:
- Reinstalled Proxmox on 2x240GB SSD's in ZFS RAID1 instead of BTRFS RAID1.
- It set cache to 3GB, I commented that out of the cfg file for now so it can default to using up to 16GB of my 32GB RAM when balancing the bootable ZFS and data ZFS.
- I could not import my data ZFS for the life of me, the server would just eventually lock up processing but no increased cpu usage, no increased IOWAIT, it'd just sit there, disk status would become unknown and it'd never import. Since I was going to be restoring from backups anyways, I wiped all disks on the data ZFS and rebuilt it in ZFS RAID10 as before.

So far, everything is humming along great. I restored all my smaller VM's and CT's up to where I was on Friday, and am about 9.5hrs into restoring my 4TB file server VM w/o issue, around 41% done with the big virtual data volume. I don't have anything else running atm, but loads are good.

So maybe the data ZFS was the issue all along even though everything reported healthy? I must not have validated everything I could've/should've. While I would have liked to have diagnosed which would have added value to this topic and to my Proxmox and ZFS experience. Thank goodness I have good backups and PBS is so easy to work with and mount to a new PVE instance. I have to say, if anything, I needed the experience this topic provided thus far with regards to loading a new PVE instance, why you want to backup your /etc/pve directory at minimum. Messing around with different file systems on ancient budget homelab hardware, etc.

Thanks again for the help and dialog!

Screenshots so far, it is odd that the data transfers spike, not steady, maybe just something I'm not familiar with or limitations of my homelab...but everything seems to be okay at this point:

1730732392338.png

1730732461326.png
 
That sounds like you had a somehow *****up ZFS … glad to hear after recreating the array it’s working now!

So Long
 
  • Like
Reactions: Kursah

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!