ceph use a SW-Raid as journal

MasterTH

Renowned Member
Jun 12, 2009
229
7
83
www.sonog.de
Hi,

i was wondering if it is possible to add an ssd-sw-raid device as journal for ceph?
Adding it manually with pveceph works, but then the osd is not shown up in the osd-list from the webinterface and the device is marked as down.

i know it is not recommended but in the other case just a single ssd for journal is also not the optimal setup. SW-Raids adds a bit more safety for me.

kind regards
 
I assume you can configure this manually but it does not make any sense for me.

I suggest you follow the recommendations from the Ceph team.
 
While you can do it manually, you should not.

i know it is not recommended but in the other case just a single ssd for journal is also not the optimal setup. SW-Raids adds a bit more safety for me.

It should not add more safety (as ceph is designed to loose every component of a cluster without encountering data loss).

Can you post your Hardware details, so we can make more educated guesses on why you might think a "software raid journal" is safer ?
 
got a three node cluster, with 4x3TB, 128GB RAM, 2x128GB SSDs, 128GBR RAM, NoRAID-Controller.
Proxmox is setup on top off Debian-SW-RAID-Setup.

Reason why i do it this way, quite simple. SW-Raid is in my opinion the cheapest way to get dataprotection.

This setup should provide me the possibillity to make some VMs HA and others not.
 
Hi,
do I understand you right, that you want the ceph-journal running on the same SSDs like the OS?
And the SSDs are enterprise class or consumer?

Anyway - to use an md-device for ceph is an bad idea. ceph do on every write an dsync (AFAIK - kind of sync depends on block-storage journal or file-based journal), Do you know at which time the md-drive ack the dsync? If wrote on both ssd and synced?? If this happens it's safe, but you have one more layer and loose performance.

I suggest an good journal ssd like intel DC S3700 - they are quite stable.

Udo
 
but what happens when this one single ssd fails. journal is gone and all of the data has to be rewritten, right?

Lets brake down what Ceph is designed for:
  • The ability to take the loss of one OSD (or multiples)
  • the ability to take the loss of one node (or multiple)
  • the Ability to take the loss of one Rack (or multiples)
  • ... all the way up to one Campus/Region/Datacenter ...
All you have to do is plan for it and configure it right.

Let's say you have 3 nodes. That lands you in "<=node failure country". Let's say one node completely dies (which losing your the single SSD which houses the journal for your 4 OSDs will do for you) and lets say you are running your pool in replicated 3/1 mode.

What happens is the following:
  • you loose all the data on said OSD's
    • you have 2 more copies of said data, so why would you care ??
  • you get only 2/3 of your write and read performance (because you lost 1/3rd of your OSD's)
    • better than total downtime (am I right?)
    • Unless you have 10G links you won't be maxing your OSD read/write performance anyways. cause 1G = 125 MB/s and 2x1G = 250 MB/s. Your SSD (even with the journals on it) will likely be faster.
      • don't even get me started on the bandwidth 4 OSD's can push.
  • Once you are rebuilding your OSDs your network becomes suggested with the backfilling. That leads to you not even getting your 2/3rd of the performance, since you are using bandwidth for the backfilling.
    • There are Settings for Ceph that make this quite painless.
Reason why i do it this way, quite simple. SW-Raid is in my opinion the cheapest way to get dataprotection.

RAID only increases redundancy. It is not a Backup solution. The only thing it saves you from is having downtime on that node and the process of backfilling at the giant expense of performance and complexity (and dual TBW usage). I'm not even talking of PSU, Node or connectivity failure (which this does not guard against). With 3 node you already have 3x redundancy, is there a business need for it being 6x ?

For the cost of 3 additional (and propper) journal SSD's you can almost buy a 4th node that is designed to only run proxmox-ceph and not host any mon or VM's on it. OSDs excluded.
It definitely would give you more redundancy, the upshot is it might give you a performance boost once you have >2G connectivity per node.

Again: SW-Raiding a journal is a terrible idea, but there are worse ones (i.e. Raid-0 a journal) out there, so if you understand the risks and pitfalls, and are adamant on using it, the worst that can happen is that you to a reset on your whole config and follow the recommendations that time around :p

Edit: Udo's comment on enterprise SSD for journaling are quite spot on. get one that has a TBW rating suitable for your daily business needs regarding written data.
 
Last edited:
...
What happens is the following:
  • you loose all the data on said OSD's
    • you have 2 more copies of said data, so why would you care ??
  • you get only 2/3 of your write and read performance (because you lost 1/3rd of your OSD's).
Hi Q-wulf,
I'm not agree 100% with this statement. In my eyes he will get faster access in an "node-failed-mode", because the chance to read the data from the local node (without network) raised from 33.3% to 50% and only two replicas must be written (with ack) which should also speed up the write slightly.

Nevertheless it schould be used for an short time only until the failed node is back.

Udo
 
  • Like
Reactions: MasterTH
yeah, you are right. I was looking at it from my pov where we have higher link speeds and way more nodes and OSDs per node. In that situation it equalizes before it turns into a disadvantage.

With 3 nodes and only 4 OSD's on a suspected 1G line a 2/3 node is faster, since his 1G would be the bottleneck. Hence getting 50% instead of 33% locally gives you an advantage both in latency as in bandwidth (should the requested data be more than the link speed)
 
  • Like
Reactions: MasterTH
hi,

thanks a lot for your explanation. so i was right - loosing the journal will end up in a total corruption of the node (hoped that i was not right ;) )
in fact that my wiring is the bottleneck for my little cluster i'll leave the journal out of the system and run without it. there should not be that much read/writes on the VMs i'm running on that nodes.


Again - thanks a lot!