RAID 1 and CEPH

jgibbons

New Member
Feb 4, 2021
2
0
1
24
Hi guys!

I read that CEPH does not support hardware RAID. Does this mean that I cannot create a virtual disk in DELL PERC, consisting of two disks in RAID 1, and pass it to CEPH? I like the idea of the CEPH drives having extra redundancy.
 
Hi!

Will it somehow work: probably, at least in a wider sense.

Will it make problems, result in poorer performance and even maybe to subtle errors where no one can really help: highly probable.

I like the idea of the CEPH drives having extra redundancy.
Ceph is the redundancy! At least it should be.

Ceph really wants to manage disks directly and alone, i.e., no RAID or whatever in between more than a plain stupid HBA, SATA controller or PCIe bus.

Just add more OSDs spread over different hosts, and if you want different racks/rooms and let ceph handle redundancy alone,
it can do so much better (on object level granularity with view of the full cluster and all their disks) than any plain RAID, especially proprietary HW RAID controller, which in my experience aren't really a joy to work with...
 
Last edited:
One more note: For Ceph to work correctly you need a cluster of at least "3" nodes!
Ahh yes I did note that. I come from a VMWare VSAN background. We are planning on launching a 4 node cluster, for extra redundancy. Does this mean that we could configure 2 physical disks in each server with the ability to loose a single disk per node?
 
Does this mean that we could configure 2 physical disks in each server with the ability to loose a single disk per node?

This can be more depending on space usage and time between (disk) outages.

In its basics Ceph is an object store, each object is stored multiple times in the cluster in a redundant way. Normally the 3/2 config is chosen - meaning three copies per objects, and two of that copies required to be written out before ceph gives the writer the OK (avoiding split brains and such).
Now you'll say: yeah well any raid does that, big deal. But there's more:
  1. Ceph does the placement in a smart way. For example, it won't save an object twice on the same OSD (= a disk in Ceph terminology) or even on another OSD in the same host if possible.
  2. Once an OSD (disk) or host goes down ceph notices and tries to restore the lost object copies by using the other copies and re-doing its smart placement until enough copies are again available. This means that as long as there's enough space for the copies and enough different OSDs to spread them all is well (mostly, if OSDs start to fail by the minute you may have a more general problem ;)).
  3. This is all handled fully automated, no stress full manual triggered resilver process required. If an OSD fails, just plugin a new disk set it up as OSD (can be done over the web interface) and Ceph does the rest. Even if a whole host burns down, setup Proxmox VE on a new one, add to cluster install ceph, and you will be good again.
That combined with cheap snapshots, thin-provisioning, division of total space capacity in pools and namespaces for different projects/users/..., and scalability in all directions makes it a really good universal solution to shared storage.

You may want to check out our Ceph chapter in our reference documentation: https://pve.proxmox.com/pve-docs/chapter-pveceph.html
 
I have seen that if the ssd disk that I use to save the ceph db breaks, it completely drops all the OSDs that are linked to that ceph db.
In the case, the database disk that I am using can be part of a mirror at the HW raid level?
In the case that it is not possible due to a performance issue, what would be the procedure that must be carried out so that the OSDs are online?
 
I have seen that if the ssd disk that I use to save the ceph db breaks, it completely drops all the OSDs that are linked to that ceph db.
The short answer is- dont share a db device across multiple OSDs unless you have a sufficiently large deployment. If the deployment is large enough, multiple osd's out on a node does not pose significant risk.

But to answer your question specifically- yes, you can set your db device on a raid controller. never done it so I cant speak to its performance expectation. a SAS12 or 24 controller would probably provide reasonable results.
what would be the procedure that must be carried out so that the OSDs are online?
If the db device is truly dead, you'll need to wipe all osds and recreate. if its alive and present, just rescan the lvms and bring them back online (a reboot is the simplest way to accomplish this.)
 
With ceph it looks quite hard to local failed physical disk.
It have been quite inpossible to sent dummy onsite guy to just replace failed Proxmox ceph harddisk/SSD.
This onsiteperson need information witch disk to replace?? Typical ent level servers having HW raid blinking failed disk on frontpanel.
But with Ceph HDs behing just HBA do not blink anyting, espesially if HD is totally dead :(
So we looking for solution using HW raid to easy way to locate failed HD.

Lest say other way: it is show stopper to use/sell Proxmox or whatever "not possible to locate failed disk at onsite server-room)" environment.
 
Last edited:
There is nothing specific in hardware RAID that makes your drive blink, it is the disk controller that can do this either through software in the OS or through a management mode. Dell iDRAC systems at least it will do this whenever there is a SMART issue but if you have a failed drive without SMART error or the drive is completely dead, you still have to manually indicate which drive (you may be able to make a light flash, but that is not always reliable, my old SuperMicro wouldn’t flash if it doesn’t have a functional drive plugged, which is useless when it is dead.

How to flash a light depends on the controller and the BMC and the enclosure you are using an LSI controller there is a utility, Dell will be able to do it through iDRAC or HP through iLO, IPMI, RedFish, whatever management plane you have available.

To know which drive is a matter of tracing back what the backing drive is, use smartctl to get its serial # and then use the appropriate command to flash the light, in the bay that disk is located which means your enclosure must also be able to “know” which drive is in which bay.

There is not a single way of doing this across every backplane, this isn’t an issue of Proxmox or Linux, this is a matter of does your server have a proper management interface.

Here is a script that attempts to do this provided you have a proper SES enclosure. https://gitlab.com/ole.tange/tangetools/-/blob/master/blink/blink
 
Last edited:
  • Like
Reactions: VictorSTS
It have been quite inpossible to sent dummy onsite guy to just replace failed Proxmox ceph harddisk/SSD.
Remote hands are quite adept in following instructions, but that can only be as successful as your hardware is in being identifiable. Most enterprise level servers have ways for you to identify and turn on id lights.

But with Ceph HDs behing just HBA do not blink anyting, espesially if HD is totally dead
Unless you're using a PC with onboard sata, your sas hba has means to blink slots using the hba control tool. if you have an LSI SAS hba, that is sas2ircu or sas3ircu. You're not the first or the last with this requirement ;)

So we looking for solution using HW raid to easy way to locate failed HD.
same idea. a raid volume with a failed member will blink the defective slot. you can also use the controller raid tool to do so directly. for LSI based controllers thats megacli. As for easy... if you find administering pve with a cli difficult, you may need a different environment. a cli isnt difficult, but it does require an operator who can read and understand.
Lest say other way: it is show stopper to use/sell Proxmox or whatever "not possible to locate failed disk at onsite server-room)" environment.
Agreed. make sure you dont. in practice, that means no consumer grade computers.
ou may be able to make a light flash, but that is not always reliable, my old SuperMicro wouldn’t flash if it doesn’t have a functional drive plugged, which is useless when it is dead.
Yeah that is true. in such a condition, you can usually use sas2/3ircu to find the slot that is missing, and tell remote hands which row/column to pull. their midplanes always use the same order (bottom left to top right.)

When you deploy a server, take note of the serial numbers of each drive and where it is located, so you can instruct the operator which drive to replace.
You could, but that seems like a LOT of bookkeeping for something that doesnt really give you any utility- worse, it just invites disaster since there is no positive id before you pull the drive. just blink the failed drive when it fails.
 
You could, but that seems like a LOT of bookkeeping for something that doesnt really give you any utility- worse, it just invites disaster since there is no positive id before you pull the drive. just blink the failed drive when it fails.
Agreed that blinking is infinitely useful, but we do that most of the time anyways to keep an inventory about what is where and why, keeping track of warranty/leasing periods, etc. It's just once before deploying the server and one change if you need to replace a drive.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!