[SOLVED] Proxmox cluster high IO delay

Ahmet Bas · Dec 31, 2018

Hi,

We have a Proxmox cluster with 5 hypervisors. We are using Ceph:

- 3 Ceph monitors
- 47 OSD
- 2 PG (rbd & ceph hdd)
- 3 replicas

Each HV has 4x 10Gbit
2x 10Gbit bond for network
2x 10Gbit bond for storage

For some reason our hypervisors are under heavy IO Delay:

HV01
CPU usage 11% of 72 CPU(s)
Load average 10.60, 10.95, 10.85
IO delay 1.35%
RAM usage 72% of 377.76 GiB
SWAP usage 81,75% of 8GiB

HV02
CPU usage 21% of 72 CPU(s)
Load average 10.65, 10.80, 16.85
IO delay 2.10%
RAM usage 67% of 377.76 GiB
SWAP usage 56,90% of 8GiB

HV03
CPU usage 24% of 72 CPU(s)
Load average 20.33, 19.00, 18.25
IO delay 1.70%
RAM usage 67% of 377.76 GiB
SWAP usage 78,86% of 8GiB

HV04
CPU usage 29% of 40 CPU(s)
Load average 14.03, 13.33, 13.54
IO delay 0.49%
RAM usage 35.77% of 566.82 GiB
SWAP usage 00.00% of 8GiB

HV05
CPU usage 39% of 40 CPU(s)
Load average 14.03, 14.33, 14.08
IO delay 0.09%
RAM usage 45.77% of 377.82 GiB
SWAP usage N/A

The CPU usage is low but for some reason, we have a high IO delay. If we perform a backup/clone it goes up with 5-10%. What is causing this high load? any suggestions.

sb-jw · Dec 31, 2018

Is your CEPH in a healthy state? Whats going on in your CEPH?

Ahmet Bas · Dec 31, 2018

sb-jw said:
Is your CEPH in a healthy state? Whats going on in your CEPH?

Yes Proxmox is showing OK in the UI. But due to the high IO information in the UI is loading very slow.

sb-jw · Dec 31, 2018

Ahmet Bas said:
Yes Proxmox is showing OK in the UI.

But PVE is not the only way to get the needed information. What you can find on the shell?
Please check the CEPH state in the shell if there is currently more IOPS then normally or any other problem visible.

Ahmet Bas · Dec 31, 2018

sb-jw said:
But PVE is not the only way to get the needed information. What you can find on the shell?
Please check the CEPH state in the shell if there is currently more IOPS then normally or any other problem visible.

Could you be more specific which command I should run and what we need to check?

sb-jw · Dec 31, 2018

The basic commands like iotop, top, htop, ceph -w, ceph -s, dmesg - T, vnstat, iftop
You can check too your Disks with smartctl, if you do not have such checks in your monitoring.

What shows your longtime metrics from e.g. Check_MK or Grafana? Is there anything abnormal visible?
When was the first time the high IO wait was recorded? Which actions are performed at this time or in front? Do run the newest PVE version and Firmware for all of your hardware?

Ahmet Bas · Dec 31, 2018

Everything seems to be normal. There are no running tasks at the moment, just the running VMs which are spread over the 5 hypervisors. We are using 5.3.6 and we update our firmware versions of our hypervisors already. The strange part is that HV04 has more CPU usage of the other 3 hypervisors but it has the least IO delay.

HV01 - HV03 has also more CPU cores and better specs compared to HV04 and HV05.

sb-jw · Dec 31, 2018

Did you checked the VMs itself? Take a look at the metrics and then investigate to this VM which have more load than normal.

What about the IO delay in the past?

Ahmet Bas · Dec 31, 2018

sb-jw said:
Did you checked the VMs itself? Take a look at the metrics and then investigate to this VM which have more load than normal.

What about the IO delay in the past?

I have no report of the old values but the UI was not laggy like now nor we did have VM delays.

Search

Search

[SOLVED] Proxmox cluster high IO delay

Ahmet Bas

Well-Known Member

sb-jw

Famous Member

Ahmet Bas

Well-Known Member

sb-jw

Famous Member

Ahmet Bas

Well-Known Member

sb-jw

Famous Member

Ahmet Bas

Well-Known Member

sb-jw

Famous Member

Ahmet Bas

Well-Known Member