[SOLVED] Proxmox cluster high IO delay

Ahmet Bas

Well-Known Member
Aug 3, 2018
74
0
46
32
Hi,


We have a Proxmox cluster with 5 hypervisors. We are using Ceph:

- 3 Ceph monitors
- 47 OSD
- 2 PG (rbd & ceph hdd)
- 3 replicas

Each HV has 4x 10Gbit
2x 10Gbit bond for network
2x 10Gbit bond for storage

For some reason our hypervisors are under heavy IO Delay:

HV01
CPU usage 11% of 72 CPU(s)
Load average 10.60, 10.95, 10.85
IO delay 1.35%
RAM usage 72% of 377.76 GiB
SWAP usage 81,75% of 8GiB


HV02
CPU usage 21% of 72 CPU(s)
Load average 10.65, 10.80, 16.85
IO delay 2.10%
RAM usage 67% of 377.76 GiB
SWAP usage 56,90% of 8GiB

HV03
CPU usage 24% of 72 CPU(s)
Load average 20.33, 19.00, 18.25
IO delay 1.70%
RAM usage 67% of 377.76 GiB
SWAP usage 78,86% of 8GiB

HV04
CPU usage 29% of 40 CPU(s)
Load average 14.03, 13.33, 13.54
IO delay 0.49%
RAM usage 35.77% of 566.82 GiB
SWAP usage 00.00% of 8GiB

HV05
CPU usage 39% of 40 CPU(s)
Load average 14.03, 14.33, 14.08
IO delay 0.09%
RAM usage 45.77% of 377.82 GiB
SWAP usage N/A

The CPU usage is low but for some reason, we have a high IO delay. If we perform a backup/clone it goes up with 5-10%. What is causing this high load? any suggestions.
 
Last edited:
Yes Proxmox is showing OK in the UI.
But PVE is not the only way to get the needed information. What you can find on the shell?
Please check the CEPH state in the shell if there is currently more IOPS then normally or any other problem visible.
 
But PVE is not the only way to get the needed information. What you can find on the shell?
Please check the CEPH state in the shell if there is currently more IOPS then normally or any other problem visible.

Could you be more specific which command I should run and what we need to check?
 
The basic commands like iotop, top, htop, ceph -w, ceph -s, dmesg - T, vnstat, iftop
You can check too your Disks with smartctl, if you do not have such checks in your monitoring.

What shows your longtime metrics from e.g. Check_MK or Grafana? Is there anything abnormal visible?
When was the first time the high IO wait was recorded? Which actions are performed at this time or in front? Do run the newest PVE version and Firmware for all of your hardware?
 
Everything seems to be normal. There are no running tasks at the moment, just the running VMs which are spread over the 5 hypervisors. We are using 5.3.6 and we update our firmware versions of our hypervisors already. The strange part is that HV04 has more CPU usage of the other 3 hypervisors but it has the least IO delay.

HV01 - HV03 has also more CPU cores and better specs compared to HV04 and HV05.
 
Did you checked the VMs itself? Take a look at the metrics and then investigate to this VM which have more load than normal.

What about the IO delay in the past?
 
Did you checked the VMs itself? Take a look at the metrics and then investigate to this VM which have more load than normal.

What about the IO delay in the past?

I have no report of the old values but the UI was not laggy like now nor we did have VM delays.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!