RAID drive causing Ceph slowness

CTCcloud

Renowned Member
Apr 6, 2012
153
25
83
We had a drive go bad on a local RAID of a Proxmox node that runs most of it's VMs on Ceph with just a hand full on that RAID. We replaced the drive and it began to rebuild. The problem is that it slowed down very badly all the VMs that are Ceph as well. The IO delay through the roof at around 15-20%.

The server is a Dell R730 with H730P RAID controller and 256GB of RAM. The RAID volume is RAID 50 with 10 2TB sata drives.

Anyway, why would local storage cause problems for Ceph? What can be done to nix or mitigate the problem?

Thanks in advance
 
Hi,

do I understand this correct your local raid has nothing to do with ceph(no osd is on this raid) and also the VM are on Ceph or on the raid?
 
We had a drive go bad on a local RAID of a Proxmox node that runs most of it's VMs on Ceph with just a hand full on that RAID. We replaced the drive and it began to rebuild. The problem is that it slowed down very badly all the VMs that are Ceph as well. The IO delay through the roof at around 15-20%.

The server is a Dell R730 with H730P RAID controller and 256GB of RAM. The RAID volume is RAID 50 with 10 2TB sata drives.

Anyway, why would local storage cause problems for Ceph? What can be done to nix or mitigate the problem?

Thanks in advance
Hi,
do you have OSDs connected to the same Controller? As raid-0? Or is the node osd-mon only?

Udo
 
There are 9 nodes in the cluster. 4 nodes are Ceph only nodes. The node I am speaking of is only to run VMs as a ceph client. There are no OSDs nor any Ceph configuration on this node.
 
Have you on this node a Ceph monitor running?
 
The default is in use, "deadline"

I attempted to use cfq and it seemed to help just a little but didn't fix the issue. The issue didn't fix itself until the node was rebooted.

What we'd like to find out is if there is a recommended scheduler for doing what we are doing or if we should look somewhere else for the answer or if this is a kernel bug/regression ... we don't want this to happen again as it affected a number of customers in the middle of the business day.

Again, there are a few VMs on local storage on that server but the great majority use Ceph as their backend storage not the local. RBD devices don't accept scheduler adjustments so nothing to do there.

Again, what I'm not understanding is how a drive rebuild on the local storage could affect remote Ceph activity. I can understand the drive rebuild affecting those few VMs that are on the local storage but this was affecting EVERYTHING on that node.
 
Yes, we use krbd
Our environment is 95%+ KVM but we do have a couple of containers on Ceph which means krbd is required

Note that you can create 2 storage in proxmox, with same rbd pool, 1 with krbd for your CT and 1 without krbd for your VM.

For your problem, I really don't known, maybe rebuild of raid have impact io scheduler on the host.
 
Yes, I'm familiar with that fact of being able to create more than one storage config per pool but thanks for mentioning it anyway.

Yeah, I still would like to know if there's anything that can be done to mitigate this issue in the future. If cfq scheduler is better for the local disk I've got no problem adjusting the scheduler...would cfq be worthwhile on the ceph nodes themselves? What measures can be taken to help and prevent this from happening again ... we got a lot of "egg on our faces" with the customers right now for this and we need to prevent it if possible.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!