[SOLVED] High IO delay and temps

Airw0lf

Member
Apr 11, 2021
78
5
13
it-visibility.net
Team,

Since a few days there are some changes in the behavior of the Proxmox server in my home-lab.
On boot, there is a high IO delay (10-15%) for up to 30 minutes (versus far less then 1% before).
In addition, the monitoring tool is reporting a "temp1" raise going from 55-60-C to 70-plus-C - probably the VRM's. But I can not find any conclusive answers here.

Does this ring any bells?
Any suggestions where to start analyzing this change?
 
Last edited:
Prophets AFAIK do not frequent this forum. So I'm perplexed - as I am to numerous similar posts on this forum ; "This morning my server is failing, what is wrong with it?" - how can you expect help when you provide zero info. I know you probably think, as do those other posters, that probably many PVE systems got booted this morning & are experiencing similar behavior as yours, but in real life that usually is not the case.

You need to provide at least basic HW (incl. storage), NW & Guest usage (LXC & VM) so we can even have a picture of what you are facing.

Now let's try some prophecy:

Since a few days
Analyze what within your setup has changed. Updates? Infrastructure? NW? etc.

On boot, there is a high IO delay (10-15%) for up to 30 minutes
Check the logs for that period?

going from 55-60-C to 70-plus-C
After that 30-min boot period - do the temps settle? During the initial 30-min period what is accessible/inaccessible? Is something lagging then etc.?

I'd check the following - in this order:

  • Check logs for more info.
  • Physical internal inspection - focusing on the ventilation system, heat sink/s, PSU, cabling & board connections.
  • Disk/Storage checks. SMART data etc.
  • RAM check.

Good luck.
 
Prophets AFAIK do not frequent this forum. So I'm perplexed - as I am to numerous similar posts on this forum ; "This morning my server is failing, what is wrong with it?" - how can you expect help when you provide zero info. I know you probably think, as do those other posters, that probably many PVE systems got booted this morning & are experiencing similar behavior as yours, but in real life that usually is not the case.

You need to provide at least basic HW (incl. storage), NW & Guest usage (LXC & VM) so we can even have a picture of what you are facing.

Now let's try some prophecy:


Analyze what within your setup has changed. Updates? Infrastructure? NW? etc.


Check the logs for that period?


After that 30-min boot period - do the temps settle? During the initial 30-min period what is accessible/inaccessible? Is something lagging then etc.?

I'd check the following - in this order:

  • Check logs for more info.
  • Physical internal inspection - focusing on the ventilation system, heat sink/s, PSU, cabling & board connections.
  • Disk/Storage checks. SMART data etc.
  • RAM check.

Good luck.

I know that it was rather an open question with nothing to go on.

Everything is working as expected - its just different behavior.
There are no problems in the logs - comparing was not possible as the old logs where already purged.

It happens somewhere in the following series of changes over the last 7-10 days:
* pair bonding mode 5 with the on-board adapter and one port of a 4-ports network adapter
* replacing the two 4-TBytes disk raid-0 config with a two 8-TBytes disk raid-1 config - both software raid.
* added an internal drive as a kind of intermediate storage
* added an external (USB) drive for backups

The system is based on an Asus TUF gaming B550-Plus.
The CPU is an AMD Ryzen 7 5700X 8-Core Processor and 64-GBytes of RAM (i.e. 2 modules of 32-GBytes).
One of the video card slots is equiped with a 2-port 10-Gbps network card. And the other with a 4-port 1-Gbps card.
Both of these network cards PCI-e X8 models.
The video card is a PCI-e X1 model and installed in the last PCI-e X1 slot (i.e. the one closest to the PSU).

Cooling is with 2 fans running at maximum speed - one on the CPU and one above the cards.
The build is based on a Q300L case and is open on all sides.

Normally I have 2 VM's running and 3 LXC-containers. One of these VM's gives a 10% CPU load.
This load is based on analyzing packets coming in via one of the 10-Gbps ports.
 
Last edited:
@gfngfn256 and others

It looks like I was able to fix both issues.

The high temps seems to be the result of a bios update - it did a (factory?) reset to the fan settings (among a few other things). Once the original settings where restored the temp went back to the low 50-s.

The fans will be replaced with improved cooling ones. Meaning that even if this would happen again, the temp will not raise to that extend. This is because of the high fan speed going from 1800-rpm to 3300-rpm.

The higher I/O delay on boot (and in other areas) seems to be the result of adding more disk space. Meaning the original raid-0 volume was replaced with a single disk. Assuming that the performance hit was limited - which turned out to be a wrong assumption.

After switching back to the same raid-0 setup the boottime was back to normal. Of course this time with 2 bigger disks compared to the original ones.

@gfngfn256 - thank you for your response and patience
 
Last edited:
Happy you got the issue sorted out.

Maybe mark this thread as Solved. At the top of the thread, choose the Edit thread button, then from the (no prefix) dropdown choose Solved.