[TUTORIAL] FabU: can I use Ceph in a _very_ small cluster?

Maybe this is something which might be added as a link or of interest for some of you: CERN ETH Zürich did a talk on benchmarking Ceph and it's difficulties in preparing for the real world on this years fosdem in Brussel:
https://fosdem.org/2025/schedule/ev...etic-benchmarks-to-real-world-user-workloads/
There were also some other talks on Ceph (although I didn't manage to get a seat, the room was quite small unfortunely): https://fosdem.org/2025/schedule/room/k3401/


Edit: Wasn't CERN but ETH Zürich, I mixed them up (due to both institutions being in Switzerland I suspect)
 
Last edited:
link or of interest
Thanks, I've added it. And I've downloaded but not seen it yet. All in all there seem to be 793 recordings from Fosdem - and Ceph is not the only interesting topic... :-)
 
  • Like
Reactions: Johannes S
That means that you should already account on max capacity= (number of Nodes-1)*(osd capacity/node *.08) / nodes.
as a consequence of a different thread I need to caveat this.

The above is only true IF each node has the SAME OSD CAPACITY, and the pool rule is replication:3. ACTUAL pool capacity would be 3x capacity of the node with the smallest capacity in case of 3 osd nodes, or total osd capacity/3. the 80% high water mark needs to be observed PER OSD, which in practical terms means a lower amount for the pool because OSD distribution will usually have 5-10% variance in a given node (the mode osd's per node the less the variance.)
 
  • Like
Reactions: Johannes S
the 80% high water mark needs to be observed PER OSD,
If I remember correctly my six node Ceph was clever enough to distribute data in a way that the smaller OSDs were assigned less data - without any manual tuning. A small OSD would get 80% filled at (nearly) the same time as a large one.

One of my core statements was and is "you need some more nodes and several more OSDs than the minimum" for a good experience :-)

Disclaimer, as I've already stated: I've currently dropped my Ceph setup. I can't verify what I say now...
 
If I remember correctly my six node Ceph was clever enough to distribute data in a way that the smaller OSDs were assigned less data - without any manual tuning. A small OSD would get 80% filled at (nearly) the same time as a large one.
So here is the thing about that.

the algorithm will distribute pg's best it can according to the rules, but the SIZE of the pgs is a function of the pool total pg count. the larger then pgs, the more difficult it is to shoehorn them evenly. You may be then tempted to deploy a large number of PGs by default so they end up smaller- which has the cost of potentially reduced performance and increased latency. Everything is a tradeoff.
 
If I remember correctly my six node Ceph was clever enough to distribute data in a way that the smaller OSDs were assigned less data - without any manual tuning. A small OSD would get 80% filled at (nearly) the same time as a large one.

One of my core statements was and is "you need some more nodes and several more OSDs than the minimum" for a good experience :-)

Disclaimer, as I've already stated: I've currently dropped my Ceph setup. I can't verify what I say now...
Can I ask what you running as nodes? Roughly. What was the power draw like?

I have a 5 node cluster currently. 6th machine just needs more storage if I decide to use it. The crush rules automatically weight the osd based on its size. It happened automatically for me too.

I agree with your points from the original post. We will see if I keep mine around.
 
Last edited:
Can I ask what you running as nodes? Roughly. What was the power draw like?
First: I've dropped Ceph.

The nodes in my Homelab are a conglomerate of different manufacturers. First there were "normal" PCs (with Xeon + ECC though) in a Mini-tower like HP ML110/ML310, Lenovo ThinkStation and so on. Those consumed 60 to 80 Watt per node and I saw the need to shrink down. Only one HP MicroServer is left today, utilizing ~30 W or so. The others got "recycled" and became excellent PBS's, turned off most of the time. :-)

For some years the main nodes had been MinisForum HM80 with 14W idle to 25W normal use and >50W peak. Then I thought to consolidate some nodes into larger ones and bought some ASRock DeskMeet X600 w/ 128 GiB Ram. Unfortunately these do suck in much more power! And so I do enter the next loop: I want the features of an X600 with less power consumption.

Worth to mention: I also have some Odroids H2, H3, H4. The H3 pulles 8.2 Watt idle; I did not measure the H4 but probably it is similar. Unfortunately I prefere AMD Ryzen over a (too small for PVE!) Intel N5105 and similar. But for my continously running PBS instance the H3 is fine.

So... I do not have a good recommendation - my hardware suite is too small to judge. And there are too many options, which... nevertheless is really great!

----
For completeness: in my dayjob I use "real" servers (Dell) with three digits Watts per node...
 
Apologies. I had meant to write it in the past tense. I had read you took it apart. Thank you. I was curious what motivated you to move to something different.

My power usage is probably half that, using MS-01 or Nuc 9 extremes. Definitely would add to the monthly cost. I’m also running the enterprise nvmes, but with bonded 10gbe interfaces my performance seems pretty solid, even as I start to add databases later.

I’m not sure if I’d make the same choice if I got to do it over, but it’s working out ok for me currently. The whole setup is pretty quiet.
 
  • Like
Reactions: Johannes S and UdoB
The beauty of ceph it can run on anything. You need extra network card(even 2.5g works in small environment),and one disk per node and off you go!
Yes,performance isnt maximum you can get, but it works, and you can test it, failover,HA and everything. I also have one customer with 3 node 2.5gb card and samsung drives. And believe it or not it works. So start small,and then keep adding.
 
The beauty of ceph it can run on anything.

And believe it or not it works.

Sure! Or at least: maybe ;-)

This thread (the first post) is about my thoughts and insights I found when I'd done that. (For a year, not just testing for a weekend.) In a nutshell: to reach the promised goals you need much more hardware than the absolute minimum.

And as usual: ymmv - if it works for you, it is fine!
 
  • Like
Reactions: Johannes S
I'm trying to wrap my head around the possibilities here.

I've got a small home Proxmox right now, the main problem is I have to take the whole house offline if I want to upgrade, reboot or do any maintenance.
I currently have about 400Gb of VM Images, 1TB would cover foreseeable growth. It's far from heavy IOPs. I have 10gb NICs in each and a switch, so can provide a dedicated storage network. I can put 64Gb of RAM in each node.

Let's say I did a 3 node cluster with 1 x 1Tb SATA SSD (+ Boot drive) in each cluster, Copies = 3, min=2.
- Each node would have a full copy of the disk, writes would be replicated to every node.
- If a node is offline (patching), it would sync upon return.
- If an OSD fails, the node is effectively dead, I'm down to 2 nodes until the SSD is replaced.
- If a node dies, I'm down to 2 nodes until the node is replaced
- If a 2nd node dies, I go read-only and it's dead in the water
- If there is bitrot, CEPH will detect this on scrub or read, much like ZFS

It seems like failure scenarios are actually pretty sweet in this case?

Things get complicated if I want to expand beyond the single disk because automatic repair wouldn't be possible until I expand to 3 SSDS per cluster, and I can't fill beyond 66% because there wouldn't be a room for an automatic migration. hence the general recommendation of 12 OSDs?
 
There is no "migration" with only three nodes. the "3" in your crush rule refers to how many copies on individual nodes that have to exist in order to have a healthy pg (placement group.) the number of OSDs dont matter in this context- you can only use a maximum of the smallest node's capacity.
 
  • Like
Reactions: gurubert and UdoB
It seems like failure scenarios are actually pretty sweet in this case?
Well, my whole point in the first post is that the absolute minimum is not a scenario I would like to use.

If you think it will work fine for your use case: go for it! (No sarcasm, I mean it!)
 
I think what I'm trying to say really is that in that minimal scenario, things are actually pretty straight forward. I lose an OSD, I lose a node. I lose two nodes or two clusters, and I go read-only until I bring it back to 2 node, 2 OSD.

That seems way more redundant that I have with a single node cluster, while (of course) being way less redundant than a 4 or 5 node cluster (Which I don't want the electric bill for!). I'm really just asking "Am I missing something here, is there some complication I'm missing"

The only other question in my mind is how things will operate with a cluster if a node dies - will CEPH stall writes until the dead node is declared dead, or will it continue to operate if it can write min-copies to a the surviving peer?
 
or will it continue to operate if it can write min-copies to a the surviving peer?
Yes, that's the idea behind "min_size=2" :-)
 
  • Like
Reactions: wishy
"Am I missing something here, is there some complication I'm missing"
No.

I think what I'm trying to say really is that in that minimal scenario, things are actually pretty straight forward. I lose an OSD, I lose a node. I lose two nodes or two clusters, and I go read-only until I bring it back to 2 node, 2 OSD.
Yes.

In my case I decided to go with minimum 4 nodes, the electricity bill is the bitter pill to swallow, but I have redundancy to still be online during maintenance.
 
  • Like
Reactions: wishy