[TUTORIAL] FabU: can I use Ceph in a _very_ small cluster?

UdoB

Distinguished Member
Nov 1, 2016
2,250
1,026
243
Germany
Ceph is great, but it needs some resources above the theoretical minimum to work reliably. My assumptions for the following text:
  • you want to use Ceph because... why not?
  • you want to use High Availability - which requires Shared Storage (note that a complete solution needs more things like a redundant network stack and power supplies)
  • you want to start as small (and cheap) as possible, because this is... “only” a Homelab
You plan for three Nodes. Each node has s single dedicated disk for use as an “OSD”. This is the documented minimum, so it should be fine, shouldn’t it? Well..., it is the absolute minimum for a cluster. With this minimal approach you use Ceph with the default replication settings "size=3/min_size=2". (Never go below that!)

This construct allows one device to fail without data loss. This is great! No...? There are a couple of problem areas with this minimized approach:

Problem #1: there are zero redundant nodes​

When (not:if) one OSD (or a whole node, at this point there is no difference!) fails, Ceph is immediately degraded. There is no room for Ceph to heal itself, so being "degraded" is permanent. For a stable situation you really want to have nodes that can jump in and return to a stable condition - automatically. For Ceph in this picture this means to have at least four nodes. (In this specific aspect; in other regards you really want to have five or more of them...)

Essentially: we want one more node than Ceph has “size=N”.

Problem #2: data usage per OSD during normal operatition​

This detail is often forgotten: let's say you have those three nodes with two OSDs each. When one OSD fails, its direct neighbor will need to take over the data from the dead disk. That lost data can not be given to another node - the only two other nodes already have a copy! This means you can fill all OSDs in this approach only up to 45 percent: the original 45% plus the "other" 45% gets you 90% on this surviving OSD. To reduce this problem you want several OSDs per node or - better! - avoid it (nearly) completely by having more than three nodes.

Essentially: we may need more “spare/unused space” than expected.

Problem #3: realize that Ceph is much more critical for a cluster...​

...than a local SSD is for one of the three nodes: when Ceph goes readonly all VMs in the whole cluster will probably stop immediately (after seconds or a very few minutes). They can not write any data (including log messages, which is practically always done) and will stall.

Essentially: when we are utilizing Ceph it quickly becomes crucial!

Problem #4: the network needs to be fast...​

... as the data-to-be-written will go over the wire multiple times before it is considered "written". So a fast network is a must. This means 10 GBit/s should be considered the minimum. But yeah, technically it works with slower speeds. At least at the beginning, with low load. When high usage leads to congestion it will increase the latency and you will encounter "strange" errors, which may be hard to debug. (Not mentioned: zero network-redundancy is bad.)

Essentially: we want 10 GBit/s or faster.

Problem #5: regarding SSDs/NVMe you probably already know the recommendation regarding Enterprise class devices.​

Those have multiple reasons, please consider them. If you would go for -let's say- seven nodes with five OSDs each the required quality of the OSDs (in a homelab) may be lower, but with a bare minimum number of disks they really need to be high quality.

Essentially: adequate hardware may be more expensive than the budget allows.

Problem #6: RAM​

Ceph comes not for free. For a small cluster we need per MON=1-2 GB; MGR=1-2 GB; MDS=1 GB; OSD=3-5 GB. The small example cluster with three nodes, 3 MON + 1 MGR + 0 MDS + 2 OSD each needs 3*1 + 1*1 + 0 + 3*2*3 = 22 GB up to 3*2 + 1*2 + 3*2*5 = 38 GB RAM (Ref: https://docs.ceph.com/en/mimic/start/hardware-recommendations/#ram )

Essentially: you never can have enough RAM, be it Ceph or ZFS (for ARC)! (Do not over-commit massive amounts of Ram, it does not work very well.)


Todo, not discussed yet​

  • Erasure Coding = better space efficiency with lower performance - needs a higher number of OSDs

Disclaimer, as usual: YMMV! While I am experimenting with Ceph I am definitely not an expert.



Now after all the “bad news”-details above, what do I do if I am a little bit paranoid but want to use Ceph nevertheless?
  • use five nodes (or more) with five MONs! This is a requirement to actually allow two nodes to fail - only with five MONs three survivors can act as wanted
  • increase "size=3/min_size=2"
  • use several OSDs per node
  • use SSD/NVMe only - or learn to add separate WAL/DB from data disks, which again increases the need for more independent devices...

Final word: do not mix up Ceph MON majority with PVE Quota. Technically they are completely independent and have their own requirements and possibly pitfalls.

PS:
 
Last edited:
Great post @UdoB

*Not an expert on ceph but just sharing my experience on what steps to be taken when a disk has failed in our situation

When we detected an error from say dmesg on a disk from ssh console

[15528340.545531] sd 6:0:6:0: [sdg] tag#203 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=5s
[15528340.545701] sd 6:0:6:0: [sdg] tag#203 Sense Key : Medium Error [current]
[15528340.545857] sd 6:0:6:0: [sdg] tag#203 Add. Sense: Unrecovered read error
[15528340.546017] sd 6:0:6:0: [sdg] tag#203 CDB: Read(16) 88 00 00 00 00 00 80 bb 9e 68 00 00 00 80 00 00
[15528340.546187] critical medium error, dev sdg, sector 2159779432 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 0

I would go to the node in question, click on Ceph -> OSD and identify the OSD. Once identified, i would click out, stop and destroy for that OSD follow by a physical replacement of the failed drive.
 
Once identified, i would click out, stop and destroy for that OSD follow by a physical replacement of the failed drive.
I would like to add a small detail:
Once identified, i would click out... Yes, but wait for it to rebuild/take out at this point. Don't do "stop and destroy" too fast. In a small cluster even the broken disk could deliver missing pieces/PGs if something goes unexpected.
Same for ZFS..."keep" the broken disk as is around, it could be the life saver if rebuild/recover won't complete or the next disk dies in exact that moment.

Destroy/delete it only after the full rebuild has completed.
 
Unfortunately it can not simulate my "problem-scenario" with three nodes / two OSD each and one OSD failing.
Yes, because the assumed failure zone is the host. If just an OSD fails it should be replaced.
In small clusters the time to replace a failed disk is more crucial than in larger clusters where the data is more easily replicated to the other OSDs in the nodes.
 
  • Like
Reactions: Johannes S
great writeup @UdoB. I do want to make a few observations:

When (not:if) one OSD (or a whole node, at this point there is no difference!) fails, Ceph is immediately degraded. There is no room for Ceph to heal itself, so being "degraded" is permanent.
thats not a bug, thats a feature. The whole point of ceph is to be shared-nothing. when an OSD fails, the impact is just reduction in usable space. when a node fails, replace it; the filesystem is still coherent and usable in the meanwhile. Nothing here is permanent.

For Ceph in this picture this means to have at least four nodes. (In this specific aspect; in other regards you really want to have five or more of them...)
See above. you can run a perfectly usable cluster with 3 nodes. While true that 3 nodes cannot rebalance, 4 nodes is actually neither here nor there, especially in a high utilization scenario where you will need 33% ADDITIONAL free capacity across the survivor OSD nodes. Cluster planning involves more then just number of nodes.
let's say you have those three nodes with two OSDs each. When one OSD fails, its direct neighbor will need to take over the data from the dead disk. That lost data can not be given to another node - the only two other nodes already have a copy! This means you can fill all OSDs in this approach only up to 45 percent: the original 45% plus the "other" 45% gets you 90% on this surviving OSD. To reduce this problem you want several OSDs per node or - better! - avoid it (nearly) completely by having more than three nodes.
This is true regardless of node count. Ceph doesn't really make sense with only 2 osds/node, so plan for that.
Essentially: we may need more “spare/unused space” than expected.
This is, again, part of cluster planning. OSDs are considered full at 85%, and impact write performance at 80. That means that you should already account on max capacity= (number of Nodes-1)*(osd capacity/node *.08) / nodes.

Essentially: we want 10 GBit/s or faster.
Thats half of it. the other is separation of interfaces to prevent ceph, corosync, and service networks contending for bandwidth. Contention will quickly kill your cluster.
Erasure Coding = better space efficiency with lower performance - needs a higher number of OSDs
No. Just no.

EC can be a solution for filer or objectstore workloads but not virtualization- and CERTAINLY NOT in context of this guide ("home lab")
 
thats not a bug, thats a feature. The whole point of ceph is to be shared-nothing. when an OSD fails, the impact is just reduction in usable space. when a node fails, replace it; the filesystem is still coherent and usable in the meanwhile. Nothing here is permanent.
I describe a three node setup with one node failing. The re-creation of all three configured replicas to at least three nodes can only happen after a third node is available again. Until then it is permanently degraded (but working, of course) as there are only two nodes left. Did I use "permanent" in an unusual manner, in this context? (English is not my first language...)

10 GBit/s:
Thats half of it. the other is separation of interfaces to prevent ceph, corosync, and service networks contending for bandwidth. Contention will quickly kill your cluster.
Yes, I should have mentioned that. The few topics I listed are not meant to be an all-inclusive list, I just wanted to point out the pitfalls of a smallest-possible Ceph setup.

No. Just no.
EC can be a solution for filer or objectstore workloads but not virtualization- and CERTAINLY NOT in context of this guide ("home lab")
That's why I didn't talk about it :cool:

Thanks for the feedback!
 
Ceph doesn't really make sense with only 2 osds/node, so plan for that.
Yes, I am with you.

A "very small cluster" in a Homelab possibly means to use Mini-PCs with really limited space for additional disks. I would not be surprised if several users run such a scenario - me included ;-)

That's why knowing the implications is important...
 
Not really, this is HA for VM/CT not networking.
Not trying to offend you, but people like you are exactly why I would add it :)

HA means HA and not „some parts are redundant“ or „yes our server rooms is HA but please don’t pull this cable“-redundant

If you use only one switch and NIC, what happens when this switch goes down and the 4 nodes can’t talk to each other?
I don’t know how CEPH handles this, MS cluster could not always handle this when I last used it +10y ago.

But even if CEPH can handle it, the service is still down and you basically don’t have HA.
 
Last edited:
Yes, you are right, but topic is very_small_cluster , not build your own datacenter. In a small_small clusters you dont have any redundancy,except maybe ceph switch(besides regular switch for all other traffic).
 
  • Like
Reactions: Johannes S and UdoB
You could add, that HA also requires redundant networks (multiple NICs and Switches).
Yeah, its never gets "complete" and stays compact.

Yes, you are right, but topic is very_small_cluster , not build your own datacenter.
:)

I've edited it to read "* you want to use High Availability - which requires Shared Storage (note that a complete solution needs more things like a redundant network stack and power supplies)"

Thank you for the feedback!
 
but topic is very_small_cluster , not build your own datacenter
IMHO, the size of your setup does not matter.

Either you want HA, then you probably want real HA.

Or you are like 99% of the homelab people that don't need HA, but then why bother with CEPH and all its downsides?

Only reason for CEPH but no HA would be to learn something for work, in my opinion.
 
Last edited:
Either you want HA, then you probably want real HA.
Or you are like 99% of the homelab people that don't need HA, but then way bother with CEPH and all its downsides?
A Homelab with some Mini-PC with very limited connectivity like "only one NVMe and one SATA and one NIC" makes you try solutions which are not suitable for a "real use", commercial cluster. ;-)

Probably my recommendation "five nodes, multiple OSD each" does not fit in a small Homelab at all. My goal was to show some of the pitfalls, so everybody knows them... in beforehand.
 
anyone running ceph in production SHOULD HAVE a redundant switch setup otherwise he/she should be fired from the job.
It really depends. If the cluster is large enough to spread over several racks with rack being the failure zone you could argue that you do not need redundant top of rack switches. The loss of a whole rack can then be easily mitigated.

But yes, in small clusters network redundancy is a must.
 
gurubert yes that is possible but i would not risk the whole rack is down due to a single switch design. switch, cables and interfaces are cheap nowadays. in our network and setup we eliminates all single point failure - sleep better at night :)
 
  • Like
Reactions: Johannes S