but what happens when this one single ssd fails. journal is gone and all of the data has to be rewritten, right?
Lets brake down what Ceph is designed for:
- The ability to take the loss of one OSD (or multiples)
- the ability to take the loss of one node (or multiple)
- the Ability to take the loss of one Rack (or multiples)
- ... all the way up to one Campus/Region/Datacenter ...
All you have to do is plan for it and configure it right.
Let's say you have 3 nodes. That lands you in "<=node failure country". Let's say one node completely dies (which losing your the single SSD which houses the journal for your 4 OSDs will do for you) and lets say you are running your pool in replicated 3/1 mode.
What happens is the following:
- you loose all the data on said OSD's
- you have 2 more copies of said data, so why would you care ??
- you get only 2/3 of your write and read performance (because you lost 1/3rd of your OSD's)
- better than total downtime (am I right?)
- Unless you have 10G links you won't be maxing your OSD read/write performance anyways. cause 1G = 125 MB/s and 2x1G = 250 MB/s. Your SSD (even with the journals on it) will likely be faster.
- don't even get me started on the bandwidth 4 OSD's can push.
- Once you are rebuilding your OSDs your network becomes suggested with the backfilling. That leads to you not even getting your 2/3rd of the performance, since you are using bandwidth for the backfilling.
- There are Settings for Ceph that make this quite painless.
Reason why i do it this way, quite simple. SW-Raid is in my opinion the cheapest way to get dataprotection.
RAID only increases redundancy. It is not a Backup solution. The only thing it saves you from is having downtime on that node and the process of backfilling at the giant expense of performance and complexity (and dual TBW usage). I'm not even talking of PSU, Node or connectivity failure (which this does not guard against). With 3 node you already have 3x redundancy, is there a business need for it being 6x ?
For the cost of 3 additional (and propper) journal SSD's you can almost buy a 4th node that is designed to only run proxmox-ceph and not host any mon or VM's on it. OSDs excluded.
It definitely would give you more redundancy, the upshot is it might give you a performance boost once you have >2G connectivity per node.
Again: SW-Raiding a journal is a terrible idea, but there are worse ones (i.e. Raid-0 a journal) out there, so if you understand the risks and pitfalls, and are adamant on using it, the worst that can happen is that you to a reset on your whole config and follow the recommendations that time around
Edit: Udo's comment on enterprise SSD for journaling are quite spot on. get one that has a TBW rating suitable for your daily business needs regarding written data.