[TUTORIAL] FabU: can I use Ceph in a _very_ small cluster?

UdoB · Dec 26, 2024

Ceph is great, but it needs some resources above the theoretical minimum to work reliably. My assumptions for the following text:

you want to use Ceph because... why not?
you want to use High Availability - which requires Shared Storage (note that a complete solution needs more things like a redundant network stack and power supplies)
you want to start as small (and cheap) as possible, because this is... “only” a Homelab

You plan for three Nodes. Each node has s single dedicated disk for use as an “OSD”. This is the documented minimum, so it should be fine, shouldn’t it? Well..., it is the absolute minimum for a cluster. With this minimal approach you use Ceph with the default replication settings "size=3/min_size=2". (Never go below that!)

This construct allows one device to fail without data loss. This is great! No...? There are a couple of problem areas with this minimized approach:

Problem #1: there are zero redundant nodes

When (not:if) one OSD (or a whole node, at this point there is no difference!) fails, Ceph is immediately degraded. There is no room for Ceph to heal itself, so being "degraded" is permanent. For a stable situation you really want to have nodes that can jump in and return to a stable condition - automatically. For Ceph in this picture this means to have at least four nodes. (In this specific aspect; in other regards you really want to have five or more of them...)

Essentially: we want one more node than Ceph has “size=N”.

Problem #2: data usage per OSD during normal operation

This detail is often forgotten: let's say you have those three nodes with two OSDs each. When one OSD fails, its direct neighbor will need to take over the data from the dead disk. That lost data can not be given to another node - the only two other nodes already have a copy! This means you can fill all OSDs in this approach only up to 45 percent: the original 45% plus the "other" 45% gets you 90% on this surviving OSD. To reduce this problem you want several OSDs per node or - better! - avoid it (nearly) completely by having more than three nodes.

Essentially: we may need more “spare/unused space” than expected.

Problem #3: realize that Ceph is much more critical for a cluster...

...than a local SSD is for one of the three nodes: when Ceph goes readonly all VMs in the whole cluster will probably stop immediately (after seconds or a very few minutes). They can not write any data (including log messages, which is practically always done) and will stall.

Essentially: when we are utilizing Ceph it quickly becomes crucial!

Problem #4: the network needs to be fast...

... as the data-to-be-written will go over the wire multiple times before it is considered "written". So a fast network is a must. This means 10 GBit/s should be considered the minimum. But yeah, technically it works with slower speeds. At least at the beginning, with low load. When high usage leads to congestion it will increase the latency and you will encounter "strange" errors, which may be hard to debug. (Not mentioned: zero network-redundancy is bad.)

Essentially: we want 10 GBit/s or faster.

Problem #5: regarding SSDs/NVMe you probably already know the recommendation regarding Enterprise class devices.

Those have multiple reasons, please consider them. If you would go for -let's say- seven nodes with five OSDs each the required quality of the OSDs (in a homelab) may be lower, but with a bare minimum number of disks they really need to be high quality.

Essentially: adequate hardware may be more expensive than the budget allows.

Problem #6: RAM

Ceph comes not for free. For a small cluster we need per MON=1-2 GB; MGR=1-2 GB; MDS=1 GB; OSD=3-5 GB. The small example cluster with three nodes, 3 MON + 1 MGR + 0 MDS + 2 OSD each needs 3*1 + 1*1 + 0 + 3*2*3 = 22 GB up to 3*2 + 1*2 + 3*2*5 = 38 GB RAM (Ref: https://docs.ceph.com/en/mimic/start/hardware-recommendations/#ram )

Essentially: you never can have enough RAM, be it Ceph or ZFS (for ARC)! (Do not over-commit massive amounts of Ram, it does not work very well.)

Todo, not discussed yet

Erasure Coding = better space efficiency with lower performance - needs a higher number of OSDs

Disclaimer, as usual: YMMV! While I am experimenting with Ceph I am definitely not an expert.

Now after all the “bad news”-details above, what do I do if I am a little bit paranoid but want to use Ceph nevertheless?

use five nodes (or more) with five MONs! This is a requirement to actually allow two nodes to fail - only with five MONs three survivors can act as wanted
increase "size=3/min_size=2"
use several OSDs per node
use SSD/NVMe only - or learn to add separate WAL/DB from data disks, which again increases the need for more independent devices...

Final word: do not mix up Ceph MON majority with PVE Quorum. Technically they are completely independent and have their own requirements and possibly pitfalls.

PS:

OSD = Object Storage Daemon = a piece of software. Often, but not always, used as a synonym for “a disk” as one OSD usually handles one disk. Generic services overview: https://docs.ceph.com/en/quincy/cephadm/services/
https://pve.proxmox.com/pve-docs/chapter-pveceph.html
FabU = Frequently answered by Udo - just used as an uncommon search term
https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster
https://fosdem.org/2025/schedule/ev...etic-benchmarks-to-real-world-user-workloads/ -- "... has become a viable option for workloads ranging from High Performance Computing to Kubernetes. In this talk we discuss the configuration tuning we applied to transition from a freshly deployed filesystem to one suitable for a production usecase that mixes decent performance, reliability and capacity."
https://docs.ceph.com/en/reef/rados/troubleshooting/ -- help for diagnosing problems

April 2025: I've dropped my Ceph setup - after all-in-all 15 months of productive use. It had grown to six Nodes with 12 OSDs (plus some more Nodes w/o OSDs), all of them being "Enterprise" SSDs. Now I return to ZFS only, with mirrored vdevs and periodic Replication. While Ceph worked fine its performance (with dedicated but sub-minimum 2.5 GBit/s = simply slow) was not great. And crazy prices for energy made me to want to shrink down the number of Nodes running 24*7. Removing nodes would cripple Ceph earlier or later, so I decided to remove it completely.

kellogs · Dec 26, 2024

Great post @UdoB

*Not an expert on ceph but just sharing my experience on what steps to be taken when a disk has failed in our situation

When we detected an error from say dmesg on a disk from ssh console

[15528340.545531] sd 6:0:6:0: [sdg] tag#203 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=5s
[15528340.545701] sd 6:0:6:0: [sdg] tag#203 Sense Key : Medium Error [current]
[15528340.545857] sd 6:0:6:0: [sdg] tag#203 Add. Sense: Unrecovered read error
[15528340.546017] sd 6:0:6:0: [sdg] tag#203 CDB: Read(16) 88 00 00 00 00 00 80 bb 9e 68 00 00 00 80 00 00
[15528340.546187] critical medium error, dev sdg, sector 2159779432 op 0x0

READ) flags 0x0 phys_seg 16 prio class 0

I would go to the node in question, click on Ceph -> OSD and identify the OSD. Once identified, i would click out, stop and destroy for that OSD follow by a physical replacement of the failed drive.

mr44er · Dec 27, 2024

kellogs said:
Once identified, i would click out, stop and destroy for that OSD follow by a physical replacement of the failed drive.

I would like to add a small detail:
Once identified, i would click out... Yes, but wait for it to rebuild/take out at this point. Don't do "stop and destroy" too fast. In a small cluster even the broken disk could deliver missing pieces/PGs if something goes unexpected.
Same for ZFS..."keep" the broken disk as is around, it could be the life saver if rebuild/recover won't complete or the next disk dies in exact that moment.

Destroy/delete it only after the full rebuild has completed.

gurubert · Dec 27, 2024

UdoB said:
Problem #2: data usage per OSD during normal operatition

There is a neat calculator at https://florian.ca/ceph-calculator/ that will show you how the set the nearfull ratio for a specific number of disks and nodes.

UdoB · Dec 27, 2024

gurubert said:
There is a neat calculator at https://florian.ca/ceph-calculator/ that will show you how the set the nearfull ratio for a specific number of disks and nodes.

Thanks, looks helpful.

Unfortunately it can not simulate my "problem-scenario" with three nodes / two OSD each and one OSD failing.

LnxBil · Dec 27, 2024

UdoB said:
FabU = Frequently answered by Udo

Love this!

gurubert · Dec 28, 2024

UdoB said:
Unfortunately it can not simulate my "problem-scenario" with three nodes / two OSD each and one OSD failing.

Yes, because the assumed failure zone is the host. If just an OSD fails it should be replaced.
In small clusters the time to replace a failed disk is more crucial than in larger clusters where the data is more easily replicated to the other OSDs in the nodes.

alexskysilk · Jan 3, 2025

great writeup @UdoB. I do want to make a few observations:

UdoB said:
When (not:if) one OSD (or a whole node, at this point there is no difference!) fails, Ceph is immediately degraded. There is no room for Ceph to heal itself, so being "degraded" is permanent.

thats not a bug, thats a feature. The whole point of ceph is to be shared-nothing. when an OSD fails, the impact is just reduction in usable space. when a node fails, replace it; the filesystem is still coherent and usable in the meanwhile. Nothing here is permanent.

UdoB said:
For Ceph in this picture this means to have at least four nodes. (In this specific aspect; in other regards you really want to have five or more of them...)

See above. you can run a perfectly usable cluster with 3 nodes. While true that 3 nodes cannot rebalance, 4 nodes is actually neither here nor there, especially in a high utilization scenario where you will need 33% ADDITIONAL free capacity across the survivor OSD nodes. Cluster planning involves more then just number of nodes.

UdoB said:
let's say you have those three nodes with two OSDs each. When one OSD fails, its direct neighbor will need to take over the data from the dead disk. That lost data can not be given to another node - the only two other nodes already have a copy! This means you can fill all OSDs in this approach only up to 45 percent: the original 45% plus the "other" 45% gets you 90% on this surviving OSD. To reduce this problem you want several OSDs per node or - better! - avoid it (nearly) completely by having more than three nodes.

This is true regardless of node count. Ceph doesn't really make sense with only 2 osds/node, so plan for that.

UdoB said:
Essentially: we may need more “spare/unused space” than expected.

This is, again, part of cluster planning. OSDs are considered full at 85%, and impact write performance at 80. That means that you should already account on max capacity= (number of Nodes-1)*(osd capacity/node *.08) / nodes.

UdoB said:
Essentially: we want 10 GBit/s or faster.

Thats half of it. the other is separation of interfaces to prevent ceph, corosync, and service networks contending for bandwidth. Contention will quickly kill your cluster.

UdoB said:
Erasure Coding = better space efficiency with lower performance - needs a higher number of OSDs

No. Just no.

EC can be a solution for filer or objectstore workloads but not virtualization- and CERTAINLY NOT in context of this guide ("home lab")

UdoB · Jan 3, 2025

alexskysilk said:
thats not a bug, thats a feature. The whole point of ceph is to be shared-nothing. when an OSD fails, the impact is just reduction in usable space. when a node fails, replace it; the filesystem is still coherent and usable in the meanwhile. Nothing here is permanent.

I describe a three node setup with one node failing. The re-creation of all three configured replicas to at least three nodes can only happen after a third node is available again. Until then it is permanently degraded (but working, of course) as there are only two nodes left. Did I use "permanent" in an unusual manner, in this context? (English is not my first language...)

10 GBit/s:

alexskysilk said:
Thats half of it. the other is separation of interfaces to prevent ceph, corosync, and service networks contending for bandwidth. Contention will quickly kill your cluster.

Yes, I should have mentioned that. The few topics I listed are not meant to be an all-inclusive list, I just wanted to point out the pitfalls of a smallest-possible Ceph setup.

alexskysilk said:
No. Just no.
EC can be a solution for filer or objectstore workloads but not virtualization- and CERTAINLY NOT in context of this guide ("home lab")

That's why I didn't talk about it

Thanks for the feedback!

UdoB · Jan 3, 2025

alexskysilk said:
Ceph doesn't really make sense with only 2 osds/node, so plan for that.

Yes, I am with you.

A "very small cluster" in a Homelab possibly means to use Mini-PCs with really limited space for additional disks. I would not be surprised if several users run such a scenario - me included ;-)

That's why knowing the implications is important...

IsThisThingOn · Jan 16, 2025

UdoB said:
you want to use High Availability - which requires Shared Storage

You could add, that HA also requires redundant networks (multiple NICs and Switches).

ness1602 · Jan 16, 2025

Not really, this is HA for VM/CT not networking.

IsThisThingOn · Jan 16, 2025

ness1602 said:
Not really, this is HA for VM/CT not networking.

Not trying to offend you, but people like you are exactly why I would add it

HA means HA and not „some parts are redundant“ or „yes our server rooms is HA but please don’t pull this cable“-redundant

If you use only one switch and NIC, what happens when this switch goes down and the 4 nodes can’t talk to each other?
I don’t know how CEPH handles this, MS cluster could not always handle this when I last used it +10y ago.

But even if CEPH can handle it, the service is still down and you basically don’t have HA.

ness1602 · Jan 16, 2025

Yes, you are right, but topic is very_small_cluster , not build your own datacenter. In a small_small clusters you dont have any redundancy,except maybe ceph switch(besides regular switch for all other traffic).

UdoB · Jan 16, 2025

IsThisThingOn said:
You could add, that HA also requires redundant networks (multiple NICs and Switches).

Yeah, its never gets "complete" and stays compact.

ness1602 said:
Yes, you are right, but topic is very_small_cluster , not build your own datacenter.

I've edited it to read "* you want to use High Availability - which requires Shared Storage (note that a complete solution needs more things like a redundant network stack and power supplies)"

Thank you for the feedback!

IsThisThingOn · Jan 16, 2025

ness1602 said:
but topic is very_small_cluster , not build your own datacenter

IMHO, the size of your setup does not matter.

Either you want HA, then you probably want real HA.

Or you are like 99% of the homelab people that don't need HA, but then why bother with CEPH and all its downsides?

Only reason for CEPH but no HA would be to learn something for work, in my opinion.

UdoB · Jan 16, 2025

IsThisThingOn said:
Either you want HA, then you probably want real HA.
Or you are like 99% of the homelab people that don't need HA, but then way bother with CEPH and all its downsides?

A Homelab with some Mini-PC with very limited connectivity like "only one NVMe and one SATA and one NIC" makes you try solutions which are not suitable for a "real use", commercial cluster. ;-)

Probably my recommendation "five nodes, multiple OSD each" does not fit in a small Homelab at all. My goal was to show some of the pitfalls, so everybody knows them... in beforehand.

kellogs · Jan 16, 2025

I think anyone running ceph in production SHOULD HAVE a redundant switch setup otherwise he/she should be fired from the job. This is a given.

gurubert · Jan 17, 2025

kellogs said:
anyone running ceph in production SHOULD HAVE a redundant switch setup otherwise he/she should be fired from the job.

It really depends. If the cluster is large enough to spread over several racks with rack being the failure zone you could argue that you do not need redundant top of rack switches. The loss of a whole rack can then be easily mitigated.

But yes, in small clusters network redundancy is a must.

kellogs · Jan 18, 2025

gurubert yes that is possible but i would not risk the whole rack is down due to a single switch design. switch, cables and interfaces are cheap nowadays. in our network and setup we eliminates all single point failure - sleep better at night

[TUTORIAL] FabU: can I use Ceph in a _very_ small cluster?

Distinguished Member

Problem #1: there are zero redundant nodes​

Problem #2: data usage per OSD during normal operation​

Problem #3: realize that Ceph is much more critical for a cluster...​

Problem #4: the network needs to be fast...​

Problem #5: regarding SSDs/NVMe you probably already know the recommendation regarding Enterprise class devices.​

Problem #6: RAM​

Todo, not discussed yet​

Active Member

Famous Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Well-Known Member

Famous Member

Well-Known Member

Famous Member

Distinguished Member

Well-Known Member

Distinguished Member

Active Member

Distinguished Member

Active Member

We value your privacy

Problem #1: there are zero redundant nodes

Problem #2: data usage per OSD during normal operation

Problem #3: realize that Ceph is much more critical for a cluster...

Problem #4: the network needs to be fast...

Problem #5: regarding SSDs/NVMe you probably already know the recommendation regarding Enterprise class devices.

Problem #6: RAM

Todo, not discussed yet