[TUTORIAL] FabU: can I use Ceph in a _very_ small cluster?

LnxBil · Jan 22, 2025

kellogs said:
in our network and setup we eliminates all single point failure - sleep better at night

Yes and having more than any one device fault tolerance will you even sleep better. Therefore 5 instead of 3 nodes as a sane minimum for this. If you only want to have no SPOF (so any one device fault tolerance), 3 will also be OK IMHO.

Johannes S · Feb 11, 2025

Maybe this is something which might be added as a link or of interest for some of you: ~~CERN~~ ETH Zürich did a talk on benchmarking Ceph and it's difficulties in preparing for the real world on this years fosdem in Brussel:
https://fosdem.org/2025/schedule/ev...etic-benchmarks-to-real-world-user-workloads/
There were also some other talks on Ceph (although I didn't manage to get a seat, the room was quite small unfortunely): https://fosdem.org/2025/schedule/room/k3401/

Edit: Wasn't CERN but ETH Zürich, I mixed them up (due to both institutions being in Switzerland I suspect)

UdoB · Feb 11, 2025

Johannes S said:
link or of interest

Thanks, I've added it. And I've downloaded but not seen it yet. All in all there seem to be 793 recordings from Fosdem - and Ceph is not the only interesting topic...

alexskysilk · Apr 28, 2025

alexskysilk said:
That means that you should already account on max capacity= (number of Nodes-1)*(osd capacity/node *.08) / nodes.

as a consequence of a different thread I need to caveat this.

The above is only true IF each node has the SAME OSD CAPACITY, and the pool rule is replication:3. ACTUAL pool capacity would be 3x capacity of the node with the smallest capacity in case of 3 osd nodes, or total osd capacity/3. the 80% high water mark needs to be observed PER OSD, which in practical terms means a lower amount for the pool because OSD distribution will usually have 5-10% variance in a given node (the mode osd's per node the less the variance.)

UdoB · Apr 28, 2025

alexskysilk said:
the 80% high water mark needs to be observed PER OSD,

If I remember correctly my six node Ceph was clever enough to distribute data in a way that the smaller OSDs were assigned less data - without any manual tuning. A small OSD would get 80% filled at (nearly) the same time as a large one.

One of my core statements was and is "you need some more nodes and several more OSDs than the minimum" for a good experience

Disclaimer, as I've already stated: I've currently dropped my Ceph setup. I can't verify what I say now...

alexskysilk · Apr 29, 2025

UdoB said:
If I remember correctly my six node Ceph was clever enough to distribute data in a way that the smaller OSDs were assigned less data - without any manual tuning. A small OSD would get 80% filled at (nearly) the same time as a large one.

So here is the thing about that.

the algorithm will distribute pg's best it can according to the rules, but the SIZE of the pgs is a function of the pool total pg count. the larger then pgs, the more difficult it is to shoehorn them evenly. You may be then tempted to deploy a large number of PGs by default so they end up smaller- which has the cost of potentially reduced performance and increased latency. Everything is a tradeoff.

cjlacz · Jun 24, 2025

UdoB said:
If I remember correctly my six node Ceph was clever enough to distribute data in a way that the smaller OSDs were assigned less data - without any manual tuning. A small OSD would get 80% filled at (nearly) the same time as a large one.

One of my core statements was and is "you need some more nodes and several more OSDs than the minimum" for a good experience

Disclaimer, as I've already stated: I've currently dropped my Ceph setup. I can't verify what I say now...

Can I ask what you running as nodes? Roughly. What was the power draw like?

I have a 5 node cluster currently. 6th machine just needs more storage if I decide to use it. The crush rules automatically weight the osd based on its size. It happened automatically for me too.

I agree with your points from the original post. We will see if I keep mine around.

UdoB · Jun 24, 2025

cjlacz said:
Can I ask what you running as nodes? Roughly. What was the power draw like?

First: I've dropped Ceph.

The nodes in my Homelab are a conglomerate of different manufacturers. First there were "normal" PCs (with Xeon + ECC though) in a Mini-tower like HP ML110/ML310, Lenovo ThinkStation and so on. Those consumed 60 to 80 Watt per node and I saw the need to shrink down. Only one HP MicroServer is left today, utilizing ~30 W or so. The others got "recycled" and became excellent PBS's, turned off most of the time.

For some years the main nodes had been MinisForum HM80 with 14W idle to 25W normal use and >50W peak. Then I thought to consolidate some nodes into larger ones and bought some ASRock DeskMeet X600 w/ 128 GiB Ram. Unfortunately these do suck in much more power! And so I do enter the next loop: I want the features of an X600 with less power consumption.

Worth to mention: I also have some Odroids H2, H3, H4. The H3 pulles 8.2 Watt idle; I did not measure the H4 but probably it is similar. Unfortunately I prefere AMD Ryzen over a (too small for PVE!) Intel N5105 and similar. But for my continously running PBS instance the H3 is fine.

So... I do not have a good recommendation - my hardware suite is too small to judge. And there are too many options, which... nevertheless is really great!

----
For completeness: in my dayjob I use "real" servers (Dell) with three digits Watts per node...

cjlacz · Jun 24, 2025

Apologies. I had meant to write it in the past tense. I had read you took it apart. Thank you. I was curious what motivated you to move to something different.

My power usage is probably half that, using MS-01 or Nuc 9 extremes. Definitely would add to the monthly cost. I’m also running the enterprise nvmes, but with bonded 10gbe interfaces my performance seems pretty solid, even as I start to add databases later.

I’m not sure if I’d make the same choice if I got to do it over, but it’s working out ok for me currently. The whole setup is pretty quiet.

ness1602 · Jun 24, 2025

The beauty of ceph it can run on anything. You need extra network card(even 2.5g works in small environment),and one disk per node and off you go!
Yes,performance isnt maximum you can get, but it works, and you can test it, failover,HA and everything. I also have one customer with 3 node 2.5gb card and samsung drives. And believe it or not it works. So start small,and then keep adding.

UdoB · Jun 24, 2025

ness1602 said:
The beauty of ceph it can run on anything.

And believe it or not it works.

Sure! Or at least: maybe ;-)

This thread (the first post) is about my thoughts and insights I found when I'd done that. (For a year, not just testing for a weekend.) In a nutshell: to reach the promised goals you need much more hardware than the absolute minimum.

And as usual: ymmv - if it works for you, it is fine!

Search

Search

[TUTORIAL] FabU: can I use Ceph in a _very_ small cluster?

LnxBil

Distinguished Member

Johannes S

Renowned Member

UdoB

Distinguished Member

alexskysilk

Distinguished Member

UdoB

Distinguished Member

alexskysilk

Distinguished Member

cjlacz

New Member

UdoB

Distinguished Member

cjlacz

New Member

ness1602

Famous Member

UdoB

Distinguished Member

We value your privacy