[TUTORIAL] FabU: can I use Ceph in a _very_ small cluster?

UdoB

Distinguished Member
Nov 1, 2016
1,925
782
213
Germany
Ceph is great, but it needs some resources above the theoretical minimum to work reliably. My assumptions for the following text:
  • you want to use Ceph because... why not?
  • you want to use High Availability - which requires Shared Storage
  • you want to start as small (and cheap) as possible, because this is... “only” a Homelab
You plan for three Nodes. Each node has s single dedicated disk for use as an “OSD”. This is the documented minimum, so it should be fine, shouldn’t it? Well..., it is the absolute minimum for a cluster. With this minimal approach you use Ceph with the default replication settings "size=3/min_size=2". (Never go below that!)

This construct allows one device to fail without data loss. This is great! No...? There are a couple of problem areas with this minimized approach:

Problem #1: there are zero redundant nodes​

When (not:if) one OSD (or a whole node, at this point there is no difference!) fails, Ceph is immediately degraded. There is no room for Ceph to heal itself, so being "degraded" is permanent. For a stable situation you really want to have nodes that can jump in and return to a stable condition - automatically. For Ceph in this picture this means to have at least four nodes. (In this specific aspect; in other regards you really want to have five or more of them...)

Essentially: we want one more node than Ceph has “size=N”.

Problem #2: data usage per OSD during normal operatition​

This detail is often forgotten: let's say you have those three nodes with two OSDs each. When one OSD fails, its direct neighbor will need to take over the data from the dead disk. That lost data can not be given to another node - the only two other nodes already have a copy! This means you can fill all OSDs in this approach only up to 45 percent: the original 45% plus the "other" 45% gets you 90% on this surviving OSD. To reduce this problem you want several OSDs per node or - better! - avoid it (nearly) completely by having more than three nodes.

Essentially: we may need more “spare/unused space” than expected.

Problem #3: realize that Ceph is much more critical for a cluster...​

...than a local SSD is for one of the three nodes: when Ceph goes readonly all VMs in the whole cluster will probably stop immediately (after seconds or a very few minutes). They can not write any data (including log messages, which is practically always done) and will stall.

Essentially: when we are utilizing Ceph it quickly becomes crucial!

Problem #4: the network needs to be fast...​

... as the data-to-be-written will go over the wire multiple times before it is considered "written". So a fast network is a must. This means 10 GBit/s should be considered the minimum. But yeah, technically it works with slower speeds. At least at the beginning, with low load. When high usage leads to congestion it will increase the latency and you will encounter "strange" errors, which may be hard to debug. (Not mentioned: zero network-redundancy is bad.)

Essentially: we want 10 GBit/s or faster.

Problem #5: regarding SSDs/NVMe you probably already know the recommendation regarding Enterprise class devices.​

Those have multiple reasons, please consider them. If you would go for -let's say- seven nodes with five OSDs each the required quality of the OSDs (in a homelab) may be lower, but with a bare minimum number of disks they really need to be high quality.

Essentially: adequate hardware may be more expensive than the budget allows.

Problem #6: RAM​

Ceph comes not for free. For a small cluster we need per MON=1-2 GB; MGR=1-2 GB; MDS=1 GB; OSD=3-5 GB. The small example cluster with three nodes, 3 MON + 1 MGR + 0 MDS + 2 OSD each needs 3*1 + 1*1 + 0 + 3*2*3 = 22 GB up to 3*2 + 1*2 + 3*2*5 = 38 GB RAM (Ref: https://docs.ceph.com/en/mimic/start/hardware-recommendations/#ram )

Essentially: you never can have enough RAM, be it Ceph or ZFS (for ARC)! (Do not over-commit massive amounts of Ram, it does not work very well.)


Todo, not discussed yet​

  • Erasure Coding = better space efficiency with lower performance - needs a higher number of OSDs

Disclaimer, as usual: YMMV! While I am experimenting with Ceph I am definitely not an expert.



Now after all the “bad news”-details above, what do I do if I am a little bit paranoid but want to use Ceph nevertheless?
  • use five nodes (or more) with five MONs! This is a requirement to actually allow two nodes to fail - only with five MONs three survivors can act as wanted
  • increase "size=3/min_size=2"
  • use several OSDs per node
  • use SSD/NVMe only - or learn to add separate WAL/DB from data disks, which again increases the need for more independent devices...

Final word: do not mix up Ceph MON majority with PVE Quota. Technically they are completely independent and have their own requirements and possibly pitfalls.

PS:
 
Great post @UdoB

*Not an expert on ceph but just sharing my experience on what steps to be taken when a disk has failed in our situation

When we detected an error from say dmesg on a disk from ssh console

[15528340.545531] sd 6:0:6:0: [sdg] tag#203 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=5s
[15528340.545701] sd 6:0:6:0: [sdg] tag#203 Sense Key : Medium Error [current]
[15528340.545857] sd 6:0:6:0: [sdg] tag#203 Add. Sense: Unrecovered read error
[15528340.546017] sd 6:0:6:0: [sdg] tag#203 CDB: Read(16) 88 00 00 00 00 00 80 bb 9e 68 00 00 00 80 00 00
[15528340.546187] critical medium error, dev sdg, sector 2159779432 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 0

I would go to the node in question, click on Ceph -> OSD and identify the OSD. Once identified, i would click out, stop and destroy for that OSD follow by a physical replacement of the failed drive.
 
Once identified, i would click out, stop and destroy for that OSD follow by a physical replacement of the failed drive.
I would like to add a small detail:
Once identified, i would click out... Yes, but wait for it to rebuild/take out at this point. Don't do "stop and destroy" too fast. In a small cluster even the broken disk could deliver missing pieces/PGs if something goes unexpected.
Same for ZFS..."keep" the broken disk as is around, it could be the life saver if rebuild/recover won't complete or the next disk dies in exact that moment.

Destroy/delete it only after the full rebuild has completed.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!