Dedicated ceph servers/cluster i/o delay

brucexx

Renowned Member
Mar 19, 2015
257
9
83
I am running ceph dedicated 4 node cluster with 10Gbos networks for ceph cluster and ceph public networks over bonded interfaces:

proxmox-ve: 6.1-2 (running kernel: 5.3.13-1-pve)
pve-manager: 6.1-5 (running version: 6.1-5/9bf06119)
pve-kernel-5.3: 6.1-1
pve-kernel-helper: 6.1-1
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 14.2.6-pve1
ceph-fuse: 14.2.6-pve1
corosync: 3.0.2-pve4

Each server has 12 x Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (2 Sockets), 6 cores per socket and 48 GB RAM per server , currently 18-20GB is in use. I have 24 total OSDs, 8 SSDs - pool with SSDs and 16 x 1000K spinners drives - pool with HDDs. Currently I have 39 VMs on spinners and 18 VMs on SSDs. What worries me is that I have on average twice as much i/o delay as CPU usage. So usually at least for now I have 2.5% CPU usage and 5% I/O delay. When I am move drives to ceph the CPU usage jumps to 5% and I/O delay to 10%.


2020-02-05_22-31-48.jpg

The i/O delay spike here in the middle is due to the disk move to ceph storage but on average as I said I observed that the I/O delay is twice as much as CPU usage.

Should I be concerned or this is normal ?

Thank you
 
Hi,

the IO delay is nothing bad in general.

Code:
- iowait: In a word, iowait stands for waiting for I/O to complete. But there
  are several problems:
  1. Cpu will not wait for I/O to complete, iowait is the time that a task is
     waiting for I/O to complete. When cpu goes into idle state for
     outstanding task io, another task will be scheduled on this CPU.
  2. In a multi-core CPU, the task waiting for I/O to complete is not running
     on any CPU, so the iowait of each CPU is difficult to calculate.
  3. The value of iowait field in /proc/stat will decrease in certain
     conditions.
  So, the iowait is not reliable by reading from /proc/stat.
Source: kernel Documentation/filesystems/proc.txt

Have you any problems with the system?
 
Thank you for that Wolfgang , no problems so far. It is just strange. We updated from 5.x both our clusters (PVE and Ceph on PVE) and before I saw the opposite. The I/O wait was half the CPU usage not it it the CPU usage that is half of I/O wait.

Is anybody else experiencing this ?

Thank you
 
Hi,
I'm experiencing the same problem... I can see i/O wait spikes on Zabbix also for various VMs on cluster after Proxmox 6 and Ceph upgrade.

Here is a sample graph, we made the update when spikes becomes higher:

Schermata da 2021-08-03 18-11-51.png


Any idea why? Is there a solution?

Thanks
 
Hi,
I'm experiencing the same problem... I can see i/O wait spikes on Zabbix also for various VMs on cluster after Proxmox 6 and Ceph upgrade.

Here is a sample graph, we made the update when spikes becomes higher:

View attachment 28347


Any idea why? Is there a solution?

Thanks
Same here,
experiencing increase in IO waits and lower bandwidth since the newest updates.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!