Ceph is the redundancy! At least it should be.I like the idea of the CEPH drives having extra redundancy.
Ahh yes I did note that. I come from a VMWare VSAN background. We are planning on launching a 4 node cluster, for extra redundancy. Does this mean that we could configure 2 physical disks in each server with the ability to loose a single disk per node?One more note: For Ceph to work correctly you need a cluster of at least "3" nodes!
Does this mean that we could configure 2 physical disks in each server with the ability to loose a single disk per node?
The short answer is- dont share a db device across multiple OSDs unless you have a sufficiently large deployment. If the deployment is large enough, multiple osd's out on a node does not pose significant risk.I have seen that if the ssd disk that I use to save the ceph db breaks, it completely drops all the OSDs that are linked to that ceph db.
If the db device is truly dead, you'll need to wipe all osds and recreate. if its alive and present, just rescan the lvms and bring them back online (a reboot is the simplest way to accomplish this.)what would be the procedure that must be carried out so that the OSDs are online?
Remote hands are quite adept in following instructions, but that can only be as successful as your hardware is in being identifiable. Most enterprise level servers have ways for you to identify and turn on id lights.It have been quite inpossible to sent dummy onsite guy to just replace failed Proxmox ceph harddisk/SSD.
Unless you're using a PC with onboard sata, your sas hba has means to blink slots using the hba control tool. if you have an LSI SAS hba, that is sas2ircu or sas3ircu. You're not the first or the last with this requirementBut with Ceph HDs behing just HBA do not blink anyting, espesially if HD is totally dead
same idea. a raid volume with a failed member will blink the defective slot. you can also use the controller raid tool to do so directly. for LSI based controllers thats megacli. As for easy... if you find administering pve with a cli difficult, you may need a different environment. a cli isnt difficult, but it does require an operator who can read and understand.So we looking for solution using HW raid to easy way to locate failed HD.
Agreed. make sure you dont. in practice, that means no consumer grade computers.Lest say other way: it is show stopper to use/sell Proxmox or whatever "not possible to locate failed disk at onsite server-room)" environment.
Yeah that is true. in such a condition, you can usually use sas2/3ircu to find the slot that is missing, and tell remote hands which row/column to pull. their midplanes always use the same order (bottom left to top right.)ou may be able to make a light flash, but that is not always reliable, my old SuperMicro wouldn’t flash if it doesn’t have a functional drive plugged, which is useless when it is dead.
You could, but that seems like a LOT of bookkeeping for something that doesnt really give you any utility- worse, it just invites disaster since there is no positive id before you pull the drive. just blink the failed drive when it fails.When you deploy a server, take note of the serial numbers of each drive and where it is located, so you can instruct the operator which drive to replace.
Agreed that blinking is infinitely useful, but we do that most of the time anyways to keep an inventory about what is where and why, keeping track of warranty/leasing periods, etc. It's just once before deploying the server and one change if you need to replace a drive.You could, but that seems like a LOT of bookkeeping for something that doesnt really give you any utility- worse, it just invites disaster since there is no positive id before you pull the drive. just blink the failed drive when it fails.