Dedicated ceph servers/cluster i/o delay

brucexx

Renowned Member
Mar 19, 2015
265
9
83
I am running ceph dedicated 4 node cluster with 10Gbos networks for ceph cluster and ceph public networks over bonded interfaces:

proxmox-ve: 6.1-2 (running kernel: 5.3.13-1-pve)
pve-manager: 6.1-5 (running version: 6.1-5/9bf06119)
pve-kernel-5.3: 6.1-1
pve-kernel-helper: 6.1-1
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 14.2.6-pve1
ceph-fuse: 14.2.6-pve1
corosync: 3.0.2-pve4

Each server has 12 x Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (2 Sockets), 6 cores per socket and 48 GB RAM per server , currently 18-20GB is in use. I have 24 total OSDs, 8 SSDs - pool with SSDs and 16 x 1000K spinners drives - pool with HDDs. Currently I have 39 VMs on spinners and 18 VMs on SSDs. What worries me is that I have on average twice as much i/o delay as CPU usage. So usually at least for now I have 2.5% CPU usage and 5% I/O delay. When I am move drives to ceph the CPU usage jumps to 5% and I/O delay to 10%.


2020-02-05_22-31-48.jpg

The i/O delay spike here in the middle is due to the disk move to ceph storage but on average as I said I observed that the I/O delay is twice as much as CPU usage.

Should I be concerned or this is normal ?

Thank you
 
Hi,

the IO delay is nothing bad in general.

Code:
- iowait: In a word, iowait stands for waiting for I/O to complete. But there
  are several problems:
  1. Cpu will not wait for I/O to complete, iowait is the time that a task is
     waiting for I/O to complete. When cpu goes into idle state for
     outstanding task io, another task will be scheduled on this CPU.
  2. In a multi-core CPU, the task waiting for I/O to complete is not running
     on any CPU, so the iowait of each CPU is difficult to calculate.
  3. The value of iowait field in /proc/stat will decrease in certain
     conditions.
  So, the iowait is not reliable by reading from /proc/stat.
Source: kernel Documentation/filesystems/proc.txt

Have you any problems with the system?
 
Thank you for that Wolfgang , no problems so far. It is just strange. We updated from 5.x both our clusters (PVE and Ceph on PVE) and before I saw the opposite. The I/O wait was half the CPU usage not it it the CPU usage that is half of I/O wait.

Is anybody else experiencing this ?

Thank you
 
Hi,
I'm experiencing the same problem... I can see i/O wait spikes on Zabbix also for various VMs on cluster after Proxmox 6 and Ceph upgrade.

Here is a sample graph, we made the update when spikes becomes higher:

Schermata da 2021-08-03 18-11-51.png


Any idea why? Is there a solution?

Thanks
 
Hi,
I'm experiencing the same problem... I can see i/O wait spikes on Zabbix also for various VMs on cluster after Proxmox 6 and Ceph upgrade.

Here is a sample graph, we made the update when spikes becomes higher:

View attachment 28347


Any idea why? Is there a solution?

Thanks
Same here,
experiencing increase in IO waits and lower bandwidth since the newest updates.