Extreme high CPU Load and IO Delay issues with latest version of Proxmox VE

radensun

New Member
Jul 23, 2023
2
0
1
Dear All Experts,

I'm having high CPU Load and IO Delay issues on our 8 units of UCS C-Series server.
I've been using Proxmox VE since 2012 and using it on production with several types of servers and I've never had the IO Delay and CPU Load issues like with the current UCS server.

Previously, our server was using Proxmox VE versions 5 & 6, there was no indication of an IO Delay problem on the PVE Dashboard in that version. When we upgraded several servers (we tried the first 3 servers), we encountered this IO Delay problem. And we have tried many ways but we did not find a solution to this problem.

The impact of IO Delay and CPU load issues, memory utilities are not optimal, and data read/write throughput is only around 10-30 MB/s, even data transfers in the same server disk or between servers in a LAN network only get 10-30 MB throughput /s.

I have tried and confirmed several actions to deal with this, but did not find a solution, including:
1. Reconfigure RAID and Virtual Drives
2. Reset BIOS and CIMC (Cisco Server Controller) configurations
3. Upgrading BIOS and CIMC firmware
4. Replace the HDD with a new one, and also upgrade/downgrade the HDD firmware
5. Using JBOD mode without RAID
6. Manual installation of Debian 11/12 from APT installer
7. Tuning the system for IO delay issues
8. Ask support from CISCO community for UCS C-Series (because the managed service for our server was not available)

I also attach the output of the ATOP command which shows high DISK IO activity as the cause of the above issue, while the SAS HDD is in new condition.

I hope there is a solution from Proxmox VE experts regarding the issue that I experienced. Thank You.

best regards,

Sunardi
 

Attachments

  • 2023-07-09_171337.jpg
    2023-07-09_171337.jpg
    408.9 KB · Views: 34
  • 2023-07-09_174603.jpg
    2023-07-09_174603.jpg
    159.9 KB · Views: 37
  • 2023-07-09_174752.jpg
    2023-07-09_174752.jpg
    1,013.9 KB · Views: 34
To be honest there are multiple possible factors that can cause this, it's hard to tell what's the exact cause. Since no one answered for the last 20 hours or so, here are some basic suggestions to troubleshoot the high CPU load and I/O delay issues on your UCS C-Series servers running Proxmox VE that I can think of:
  • Check firmware and driver versions for storage controllers, HBAs, RAID cards.
  • Ensure any hardware RAID is configured correctly, such as RAID chunk size.
  • Evaluate switching to HBA pass-through mode rather than hardware RAID if supported.
  • Check for misconfigured/faulty drives causing excessive retries and slowness.
  • Rule out other intensive processes on Proxmox host using resources - backups, etc.
  • Monitor resource use with 'iostat' and 'iotop' to identify disk, path or process bottlenecks.
Start with firmware, drivers, and storage connectivity. Tune from there. Monitor to isolate the bottleneck
 
  • Like
Reactions: radensun
To be honest there are multiple possible factors that can cause this, it's hard to tell what's the exact cause. Since no one answered for the last 20 hours or so, here are some basic suggestions to troubleshoot the high CPU load and I/O delay issues on your UCS C-Series servers running Proxmox VE that I can think of:
  • Check firmware and driver versions for storage controllers, HBAs, RAID cards.
  • Ensure any hardware RAID is configured correctly, such as RAID chunk size.
  • Evaluate switching to HBA pass-through mode rather than hardware RAID if supported.
  • Check for misconfigured/faulty drives causing excessive retries and slowness.
  • Rule out other intensive processes on Proxmox host using resources - backups, etc.
  • Monitor resource use with 'iostat' and 'iotop' to identify disk, path or process bottlenecks.
Start with firmware, drivers, and storage connectivity. Tune from there. Monitor to isolate the bottleneck

Dear Rason,

Thanks for the replies and suggestions.
I have done and repeated some of these things as I described earlier, but have not found the cause and solution to the problem.

But I'll try again your suggestions on points 2, 3 and 4. Thanks.
 
I have encountered this exact same problem on multiple hosts running on hardware from small mini computers to large enterprise level Dell hardware with RAID controllers. I have not been able to pinpoint the exact cause but also do not remember this ever being an issue with previous versions of Proxmox prior to 7 or 8. I can't remember if Proxmox 7 exhibited this issue or if it started with Proxmox 8.

A simple test used to see which Proxmox host exhibits the issue (some do not) is to create a CT or VM with a sizable disk but small enough to fit on `local` storage. No need to actually start it. Then use the "Move Storage" to move it from where you created it to another storage device. If created on `local-lvm` move it to `local` and then the reverse. I have even done this to NFS mounted storage from a NAS and noticed the same issue because of the limited bandwidth of the 1 gig network connection.

Why this happens on some systems with the exact same version of Proxmox installed as on other systems that do not exhibit the issue, it points to a hardware issue, but the number of systems having the problem that appear to be identical hardware as others that do not have the problem indicates something else. Maybe a driver or firmware or kernel setting or something like that.

This has become more than frustrating as there appears to be no resolution to the problem.

The only workaround I have figure out is to set "Bandwidth Limits" under the Datacenter Options to some tuned value by playing around to find a decent one. Or to just let the system become unresponsive during times of high IO by one or more processes.
 
I have encountered this exact same problem on multiple hosts running on hardware from small mini computers to large enterprise level Dell hardware with RAID controllers. I have not been able to pinpoint the exact cause but also do not remember this ever being an issue with previous versions of Proxmox prior to 7 or 8. I can't remember if Proxmox 7 exhibited this issue or if it started with Proxmox 8.
I spent a day using an old NUC computer and installed Proxmox 6, then Proxmox 7, then upgrade the kernel to 6.2 on Proxmox 7, then Proxmox 8. Testing the IO Delay on each. As it turns out, all behave the same, so this was NOT something new in Proxmox 8 as I had thought.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!