Ceph is great, but it needs some resources above the theoretical minimum to work reliably. My assumptions for the following text:
This construct allows one device to fail without data loss. This is great! No...? There are a couple of problem areas with this minimized approach:
Essentially: we want one more node than Ceph has “size=N”.
Essentially: we may need more “spare/unused space” than expected.
Essentially: when we are utilizing Ceph it quickly becomes crucial!
Essentially: we want 10 GBit/s or faster.
Essentially: adequate hardware may be more expensive than the budget allows.
Essentially: you never can have enough RAM, be it Ceph or ZFS (for ARC)! (Do not over-commit massive amounts of Ram, it does not work very well.)
Disclaimer, as usual: YMMV! While I am experimenting with Ceph I am definitely not an expert.
Now after all the “bad news”-details above, what do I do if I am a little bit paranoid but want to use Ceph nevertheless?
Final word: do not mix up Ceph MON majority with PVE Quota. Technically they are completely independent and have their own requirements and possibly pitfalls.
PS:
- you want to use Ceph because... why not?
- you want to use High Availability - which requires Shared Storage
- you want to start as small (and cheap) as possible, because this is... “only” a Homelab
This construct allows one device to fail without data loss. This is great! No...? There are a couple of problem areas with this minimized approach:
Problem #1: there are zero redundant nodes
When (not:if) one OSD (or a whole node, at this point there is no difference!) fails, Ceph is immediately degraded. There is no room for Ceph to heal itself, so being "degraded" is permanent. For a stable situation you really want to have nodes that can jump in and return to a stable condition - automatically. For Ceph in this picture this means to have at least four nodes. (In this specific aspect; in other regards you really want to have five or more of them...)Essentially: we want one more node than Ceph has “size=N”.
Problem #2: data usage per OSD during normal operatition
This detail is often forgotten: let's say you have those three nodes with two OSDs each. When one OSD fails, its direct neighbor will need to take over the data from the dead disk. That lost data can not be given to another node - the only two other nodes already have a copy! This means you can fill all OSDs in this approach only up to 45 percent: the original 45% plus the "other" 45% gets you 90% on this surviving OSD. To reduce this problem you want several OSDs per node or - better! - avoid it (nearly) completely by having more than three nodes.Essentially: we may need more “spare/unused space” than expected.
Problem #3: realize that Ceph is much more critical for a cluster...
...than a local SSD is for one of the three nodes: when Ceph goes readonly all VMs in the whole cluster will probably stop immediately (after seconds or a very few minutes). They can not write any data (including log messages, which is practically always done) and will stall.Essentially: when we are utilizing Ceph it quickly becomes crucial!
Problem #4: the network needs to be fast...
... as the data-to-be-written will go over the wire multiple times before it is considered "written". So a fast network is a must. This means 10 GBit/s should be considered the minimum. But yeah, technically it works with slower speeds. At least at the beginning, with low load. When high usage leads to congestion it will increase the latency and you will encounter "strange" errors, which may be hard to debug. (Not mentioned: zero network-redundancy is bad.)Essentially: we want 10 GBit/s or faster.
Problem #5: regarding SSDs/NVMe you probably already know the recommendation regarding Enterprise class devices.
Those have multiple reasons, please consider them. If you would go for -let's say- seven nodes with five OSDs each the required quality of the OSDs (in a homelab) may be lower, but with a bare minimum number of disks they really need to be high quality.Essentially: adequate hardware may be more expensive than the budget allows.
Problem #6: RAM
Ceph comes not for free. For a small cluster we need per MON=1-2 GB; MGR=1-2 GB; MDS=1 GB; OSD=3-5 GB. The small example cluster with three nodes, 3 MON + 1 MGR + 0 MDS + 2 OSD each needs 3*1 + 1*1 + 0 + 3*2*3 = 22 GB up to 3*2 + 1*2 + 3*2*5 = 38 GB RAM (Ref: https://docs.ceph.com/en/mimic/start/hardware-recommendations/#ram )Essentially: you never can have enough RAM, be it Ceph or ZFS (for ARC)! (Do not over-commit massive amounts of Ram, it does not work very well.)
Todo, not discussed yet
- Erasure Coding = better space efficiency with lower performance - needs a higher number of OSDs
Disclaimer, as usual: YMMV! While I am experimenting with Ceph I am definitely not an expert.
Now after all the “bad news”-details above, what do I do if I am a little bit paranoid but want to use Ceph nevertheless?
- use five nodes (or more) with five MONs! This is a requirement to actually allow two nodes to fail - only with five MONs three survivors can act as wanted
- increase "size=3/min_size=2"
- use several OSDs per node
- use SSD/NVMe only - or learn to add separate WAL/DB from data disks, which again increases the need for more independent devices...
Final word: do not mix up Ceph MON majority with PVE Quota. Technically they are completely independent and have their own requirements and possibly pitfalls.
PS:
- OSD = Object Storage Daemon = a piece of software. Often, but not always, used as a synonym for “a disk” as one OSD usually handles one disk. Generic services overview: https://docs.ceph.com/en/quincy/cephadm/services/
- https://pve.proxmox.com/pve-docs/chapter-pveceph.html
- FabU = Frequently answered by Udo - just used as an unusual search term