[TUTORIAL] FabU: can I use Ceph in a _very_ small cluster?

UdoB · 2024-12-26T20:59:22+0100

Ceph is great, but it needs some resources above the theoretical minimum to work reliably. My assumptions for the following text:

you want to use Ceph because... why not?
you want to use High Availability - which requires Shared Storage
you want to start as small (and cheap) as possible, because this is... “only” a Homelab

You plan for three Nodes. Each node has s single dedicated disk for use as an “OSD”. This is the documented minimum, so it should be fine, shouldn’t it? Well..., it is the absolute minimum for a cluster. With this minimal approach you use Ceph with the default replication settings "size=3/min_size=2". (Never go below that!)

This construct allows one device to fail without data loss. This is great! No...? There are a couple of problem areas with this minimized approach:

Problem #1: there are zero redundant nodes

When (not:if) one OSD (or a whole node, at this point there is no difference!) fails, Ceph is immediately degraded. There is no room for Ceph to heal itself, so being "degraded" is permanent. For a stable situation you really want to have nodes that can jump in and return to a stable condition - automatically. For Ceph in this picture this means to have at least four nodes. (In this specific aspect; in other regards you really want to have five or more of them...)

Essentially: we want one more node than Ceph has “size=N”.

Problem #2: data usage per OSD during normal operatition

This detail is often forgotten: let's say you have those three nodes with two OSDs each. When one OSD fails, its direct neighbor will need to take over the data from the dead disk. That lost data can not be given to another node - the only two other nodes already have a copy! This means you can fill all OSDs in this approach only up to 45 percent: the original 45% plus the "other" 45% gets you 90% on this surviving OSD. To reduce this problem you want several OSDs per node or - better! - avoid it (nearly) completely by having more than three nodes.

Essentially: we may need more “spare/unused space” than expected.

Problem #3: realize that Ceph is much more critical for a cluster...

...than a local SSD is for one of the three nodes: when Ceph goes readonly all VMs in the whole cluster will probably stop immediately (after seconds or a very few minutes). They can not write any data (including log messages, which is practically always done) and will stall.

Essentially: when we are utilizing Ceph it quickly becomes crucial!

Problem #4: the network needs to be fast...

... as the data-to-be-written will go over the wire multiple times before it is considered "written". So a fast network is a must. This means 10 GBit/s should be considered the minimum. But yeah, technically it works with slower speeds. At least at the beginning, with low load. When high usage leads to congestion it will increase the latency and you will encounter "strange" errors, which may be hard to debug. (Not mentioned: zero network-redundancy is bad.)

Essentially: we want 10 GBit/s or faster.

Problem #5: regarding SSDs/NVMe you probably already know the recommendation regarding Enterprise class devices.

Those have multiple reasons, please consider them. If you would go for -let's say- seven nodes with five OSDs each the required quality of the OSDs (in a homelab) may be lower, but with a bare minimum number of disks they really need to be high quality.

Essentially: adequate hardware may be more expensive than the budget allows.

Problem #6: RAM

Ceph comes not for free. For a small cluster we need per MON=1-2 GB; MGR=1-2 GB; MDS=1 GB; OSD=3-5 GB. The small example cluster with three nodes, 3 MON + 1 MGR + 0 MDS + 2 OSD each needs 3*1 + 1*1 + 0 + 3*2*3 = 22 GB up to 3*2 + 1*2 + 3*2*5 = 38 GB RAM (Ref: https://docs.ceph.com/en/mimic/start/hardware-recommendations/#ram )

Essentially: you never can have enough RAM, be it Ceph or ZFS (for ARC)! (Do not over-commit massive amounts of Ram, it does not work very well.)

Todo, not discussed yet

Erasure Coding = better space efficiency with lower performance - needs a higher number of OSDs

Disclaimer, as usual: YMMV! While I am experimenting with Ceph I am definitely not an expert.

Now after all the “bad news”-details above, what do I do if I am a little bit paranoid but want to use Ceph nevertheless?

use five nodes (or more) with five MONs! This is a requirement to actually allow two nodes to fail - only with five MONs three survivors can act as wanted
increase "size=3/min_size=2"
use several OSDs per node
use SSD/NVMe only - or learn to add separate WAL/DB from data disks, which again increases the need for more independent devices...

Final word: do not mix up Ceph MON majority with PVE Quota. Technically they are completely independent and have their own requirements and possibly pitfalls.

PS:

OSD = Object Storage Daemon = a piece of software. Often, but not always, used as a synonym for “a disk” as one OSD usually handles one disk. Generic services overview: https://docs.ceph.com/en/quincy/cephadm/services/
https://pve.proxmox.com/pve-docs/chapter-pveceph.html
FabU = Frequently answered by Udo - just used as an unusual search term

kellogs · 2024-12-26T23:04:26+0100

Great post @UdoB

*Not an expert on ceph but just sharing my experience on what steps to be taken when a disk has failed in our situation

When we detected an error from say dmesg on a disk from ssh console

[15528340.545531] sd 6:0:6:0: [sdg] tag#203 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=5s
[15528340.545701] sd 6:0:6:0: [sdg] tag#203 Sense Key : Medium Error [current]
[15528340.545857] sd 6:0:6:0: [sdg] tag#203 Add. Sense: Unrecovered read error
[15528340.546017] sd 6:0:6:0: [sdg] tag#203 CDB: Read(16) 88 00 00 00 00 00 80 bb 9e 68 00 00 00 80 00 00
[15528340.546187] critical medium error, dev sdg, sector 2159779432 op 0x0

READ) flags 0x0 phys_seg 16 prio class 0

I would go to the node in question, click on Ceph -> OSD and identify the OSD. Once identified, i would click out, stop and destroy for that OSD follow by a physical replacement of the failed drive.

mr44er · 2024-12-27T04:21:51+0100

kellogs said:
Once identified, i would click out, stop and destroy for that OSD follow by a physical replacement of the failed drive.

I would like to add a small detail:
Once identified, i would click out... Yes, but wait for it to rebuild/take out at this point. Don't do "stop and destroy" too fast. In a small cluster even the broken disk could deliver missing pieces/PGs if something goes unexpected.
Same for ZFS..."keep" the broken disk as is around, it could be the life saver if rebuild/recover won't complete or the next disk dies in exact that moment.

Destroy/delete it only after the full rebuild has completed.

gurubert · 2024-12-27T10:38:00+0100

UdoB said:
Problem #2: data usage per OSD during normal operatition

There is a neat calculator at https://florian.ca/ceph-calculator/ that will show you how the set the nearfull ratio for a specific number of disks and nodes.

UdoB · 2024-12-27T10:53:32+0100

gurubert said:
There is a neat calculator at https://florian.ca/ceph-calculator/ that will show you how the set the nearfull ratio for a specific number of disks and nodes.

Thanks, looks helpful.

Unfortunately it can not simulate my "problem-scenario" with three nodes / two OSD each and one OSD failing.

LnxBil · 2024-12-27T13:32:02+0100

UdoB said:
FabU = Frequently answered by Udo

Love this!

Search

Search

[TUTORIAL] FabU: can I use Ceph in a _very_ small cluster?

UdoB

Distinguished Member

Problem #1: there are zero redundant nodes

Problem #2: data usage per OSD during normal operatition

Problem #3: realize that Ceph is much more critical for a cluster...

Problem #4: the network needs to be fast...

Problem #5: regarding SSDs/NVMe you probably already know the recommendation regarding Enterprise class devices.

Problem #6: RAM

Todo, not discussed yet

kellogs

Member

mr44er

Renowned Member

gurubert

Distinguished Member

UdoB

Distinguished Member

LnxBil

Distinguished Member

[TUTORIAL] FabU: can I use Ceph in a _very_ small cluster?

UdoB

Distinguished Member

Problem #1: there are zero redundant nodes​

Problem #2: data usage per OSD during normal operatition​

Problem #3: realize that Ceph is much more critical for a cluster...​

Problem #4: the network needs to be fast...​

Problem #5: regarding SSDs/NVMe you probably already know the recommendation regarding Enterprise class devices.​

Problem #6: RAM​

Todo, not discussed yet​

kellogs

Member

mr44er

Renowned Member

gurubert

Distinguished Member

UdoB

Distinguished Member

LnxBil

Distinguished Member

Problem #1: there are zero redundant nodes

Problem #2: data usage per OSD during normal operatition

Problem #3: realize that Ceph is much more critical for a cluster...

Problem #4: the network needs to be fast...

Problem #5: regarding SSDs/NVMe you probably already know the recommendation regarding Enterprise class devices.

Problem #6: RAM

Todo, not discussed yet