Dedicated ceph servers/cluster i/o delay

brucexx · Feb 6, 2020

I am running ceph dedicated 4 node cluster with 10Gbos networks for ceph cluster and ceph public networks over bonded interfaces:

proxmox-ve: 6.1-2 (running kernel: 5.3.13-1-pve)
pve-manager: 6.1-5 (running version: 6.1-5/9bf06119)
pve-kernel-5.3: 6.1-1
pve-kernel-helper: 6.1-1
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 14.2.6-pve1
ceph-fuse: 14.2.6-pve1
corosync: 3.0.2-pve4

Each server has 12 x Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (2 Sockets), 6 cores per socket and 48 GB RAM per server , currently 18-20GB is in use. I have 24 total OSDs, 8 SSDs - pool with SSDs and 16 x 1000K spinners drives - pool with HDDs. Currently I have 39 VMs on spinners and 18 VMs on SSDs. What worries me is that I have on average twice as much i/o delay as CPU usage. So usually at least for now I have 2.5% CPU usage and 5% I/O delay. When I am move drives to ceph the CPU usage jumps to 5% and I/O delay to 10%.

The i/O delay spike here in the middle is due to the disk move to ceph storage but on average as I said I observed that the I/O delay is twice as much as CPU usage.

Should I be concerned or this is normal ?

Thank you

wolfgang · Feb 7, 2020

Hi,

the IO delay is nothing bad in general.

Code:

- iowait: In a word, iowait stands for waiting for I/O to complete. But there
  are several problems:
  1. Cpu will not wait for I/O to complete, iowait is the time that a task is
     waiting for I/O to complete. When cpu goes into idle state for
     outstanding task io, another task will be scheduled on this CPU.
  2. In a multi-core CPU, the task waiting for I/O to complete is not running
     on any CPU, so the iowait of each CPU is difficult to calculate.
  3. The value of iowait field in /proc/stat will decrease in certain
     conditions.
  So, the iowait is not reliable by reading from /proc/stat.

Source: kernel Documentation/filesystems/proc.txt

Have you any problems with the system?

brucexx · Feb 7, 2020

Thank you for that Wolfgang , no problems so far. It is just strange. We updated from 5.x both our clusters (PVE and Ceph on PVE) and before I saw the opposite. The I/O wait was half the CPU usage not it it the CPU usage that is half of I/O wait.

Is anybody else experiencing this ?

Thank you

Samuele Bistoletti · Aug 3, 2021

Hi,
I'm experiencing the same problem... I can see i/O wait spikes on Zabbix also for various VMs on cluster after Proxmox 6 and Ceph upgrade.

Here is a sample graph, we made the update when spikes becomes higher:

Any idea why? Is there a solution?

Thanks

netm-de · Aug 11, 2021

Samuele Bistoletti said:
Hi,
I'm experiencing the same problem... I can see i/O wait spikes on Zabbix also for various VMs on cluster after Proxmox 6 and Ceph upgrade.

Here is a sample graph, we made the update when spikes becomes higher:

View attachment 28347

Any idea why? Is there a solution?

Thanks

Same here,
experiencing increase in IO waits and lower bandwidth since the newest updates.

Search

Search

Dedicated ceph servers/cluster i/o delay

brucexx

Renowned Member

wolfgang

Proxmox Retired Staff

brucexx

Renowned Member

Samuele Bistoletti

Member

netm-de

Member

We value your privacy