[TUTORIAL] FabU: can I use Ceph in a _very_ small cluster?

LnxBil · Jan 22, 2025

kellogs said:
in our network and setup we eliminates all single point failure - sleep better at night

Yes and having more than any one device fault tolerance will you even sleep better. Therefore 5 instead of 3 nodes as a sane minimum for this. If you only want to have no SPOF (so any one device fault tolerance), 3 will also be OK IMHO.

Johannes S · Feb 11, 2025

Maybe this is something which might be added as a link or of interest for some of you: ~~CERN~~ ETH Zürich did a talk on benchmarking Ceph and it's difficulties in preparing for the real world on this years fosdem in Brussel:
https://fosdem.org/2025/schedule/ev...etic-benchmarks-to-real-world-user-workloads/
There were also some other talks on Ceph (although I didn't manage to get a seat, the room was quite small unfortunely): https://fosdem.org/2025/schedule/room/k3401/

Edit: Wasn't CERN but ETH Zürich, I mixed them up (due to both institutions being in Switzerland I suspect)

UdoB · Feb 11, 2025

Johannes S said:
link or of interest

Thanks, I've added it. And I've downloaded but not seen it yet. All in all there seem to be 793 recordings from Fosdem - and Ceph is not the only interesting topic...

alexskysilk · Apr 28, 2025

alexskysilk said:
That means that you should already account on max capacity= (number of Nodes-1)*(osd capacity/node *.08) / nodes.

as a consequence of a different thread I need to caveat this.

The above is only true IF each node has the SAME OSD CAPACITY, and the pool rule is replication:3. ACTUAL pool capacity would be 3x capacity of the node with the smallest capacity in case of 3 osd nodes, or total osd capacity/3. the 80% high water mark needs to be observed PER OSD, which in practical terms means a lower amount for the pool because OSD distribution will usually have 5-10% variance in a given node (the mode osd's per node the less the variance.)

UdoB · Apr 28, 2025

alexskysilk said:
the 80% high water mark needs to be observed PER OSD,

If I remember correctly my six node Ceph was clever enough to distribute data in a way that the smaller OSDs were assigned less data - without any manual tuning. A small OSD would get 80% filled at (nearly) the same time as a large one.

One of my core statements was and is "you need some more nodes and several more OSDs than the minimum" for a good experience

Disclaimer, as I've already stated: I've currently dropped my Ceph setup. I can't verify what I say now...

alexskysilk · Apr 29, 2025

UdoB said:
If I remember correctly my six node Ceph was clever enough to distribute data in a way that the smaller OSDs were assigned less data - without any manual tuning. A small OSD would get 80% filled at (nearly) the same time as a large one.

So here is the thing about that.

the algorithm will distribute pg's best it can according to the rules, but the SIZE of the pgs is a function of the pool total pg count. the larger then pgs, the more difficult it is to shoehorn them evenly. You may be then tempted to deploy a large number of PGs by default so they end up smaller- which has the cost of potentially reduced performance and increased latency. Everything is a tradeoff.

cjlacz · Jun 24, 2025

UdoB said:
If I remember correctly my six node Ceph was clever enough to distribute data in a way that the smaller OSDs were assigned less data - without any manual tuning. A small OSD would get 80% filled at (nearly) the same time as a large one.

One of my core statements was and is "you need some more nodes and several more OSDs than the minimum" for a good experience

Disclaimer, as I've already stated: I've currently dropped my Ceph setup. I can't verify what I say now...

Can I ask what you running as nodes? Roughly. What was the power draw like?

I have a 5 node cluster currently. 6th machine just needs more storage if I decide to use it. The crush rules automatically weight the osd based on its size. It happened automatically for me too.

I agree with your points from the original post. We will see if I keep mine around.

UdoB · Jun 24, 2025

cjlacz said:
Can I ask what you running as nodes? Roughly. What was the power draw like?

First: I've dropped Ceph.

The nodes in my Homelab are a conglomerate of different manufacturers. First there were "normal" PCs (with Xeon + ECC though) in a Mini-tower like HP ML110/ML310, Lenovo ThinkStation and so on. Those consumed 60 to 80 Watt per node and I saw the need to shrink down. Only one HP MicroServer is left today, utilizing ~30 W or so. The others got "recycled" and became excellent PBS's, turned off most of the time.

For some years the main nodes had been MinisForum HM80 with 14W idle to 25W normal use and >50W peak. Then I thought to consolidate some nodes into larger ones and bought some ASRock DeskMeet X600 w/ 128 GiB Ram. Unfortunately these do suck in much more power! And so I do enter the next loop: I want the features of an X600 with less power consumption.

Worth to mention: I also have some Odroids H2, H3, H4. The H3 pulles 8.2 Watt idle; I did not measure the H4 but probably it is similar. Unfortunately I prefere AMD Ryzen over a (too small for PVE!) Intel N5105 and similar. But for my continously running PBS instance the H3 is fine.

So... I do not have a good recommendation - my hardware suite is too small to judge. And there are too many options, which... nevertheless is really great!

----
For completeness: in my dayjob I use "real" servers (Dell) with three digits Watts per node...

cjlacz · Jun 24, 2025

Apologies. I had meant to write it in the past tense. I had read you took it apart. Thank you. I was curious what motivated you to move to something different.

My power usage is probably half that, using MS-01 or Nuc 9 extremes. Definitely would add to the monthly cost. I’m also running the enterprise nvmes, but with bonded 10gbe interfaces my performance seems pretty solid, even as I start to add databases later.

I’m not sure if I’d make the same choice if I got to do it over, but it’s working out ok for me currently. The whole setup is pretty quiet.

ness1602 · Jun 24, 2025

The beauty of ceph it can run on anything. You need extra network card(even 2.5g works in small environment),and one disk per node and off you go!
Yes,performance isnt maximum you can get, but it works, and you can test it, failover,HA and everything. I also have one customer with 3 node 2.5gb card and samsung drives. And believe it or not it works. So start small,and then keep adding.

UdoB · Jun 24, 2025

ness1602 said:
The beauty of ceph it can run on anything.

And believe it or not it works.

Sure! Or at least: maybe ;-)

This thread (the first post) is about my thoughts and insights I found when I'd done that. (For a year, not just testing for a weekend.) In a nutshell: to reach the promised goals you need much more hardware than the absolute minimum.

And as usual: ymmv - if it works for you, it is fine!

wishy · Sep 23, 2025

I'm trying to wrap my head around the possibilities here.

I've got a small home Proxmox right now, the main problem is I have to take the whole house offline if I want to upgrade, reboot or do any maintenance.
I currently have about 400Gb of VM Images, 1TB would cover foreseeable growth. It's far from heavy IOPs. I have 10gb NICs in each and a switch, so can provide a dedicated storage network. I can put 64Gb of RAM in each node.

Let's say I did a 3 node cluster with 1 x 1Tb SATA SSD (+ Boot drive) in each cluster, Copies = 3, min=2.
- Each node would have a full copy of the disk, writes would be replicated to every node.
- If a node is offline (patching), it would sync upon return.
- If an OSD fails, the node is effectively dead, I'm down to 2 nodes until the SSD is replaced.
- If a node dies, I'm down to 2 nodes until the node is replaced
- If a 2nd node dies, I go read-only and it's dead in the water
- If there is bitrot, CEPH will detect this on scrub or read, much like ZFS

It seems like failure scenarios are actually pretty sweet in this case?

Things get complicated if I want to expand beyond the single disk because automatic repair wouldn't be possible until I expand to 3 SSDS per cluster, and I can't fill beyond 66% because there wouldn't be a room for an automatic migration. hence the general recommendation of 12 OSDs?

alexskysilk · Sep 23, 2025

There is no "migration" with only three nodes. the "3" in your crush rule refers to how many copies on individual nodes that have to exist in order to have a healthy pg (placement group.) the number of OSDs dont matter in this context- you can only use a maximum of the smallest node's capacity.

UdoB · Sep 23, 2025

wishy said:
It seems like failure scenarios are actually pretty sweet in this case?

Well, my whole point in the first post is that the absolute minimum is not a scenario I would like to use.

If you think it will work fine for your use case: go for it! (No sarcasm, I mean it!)

wishy · Sep 23, 2025

I think what I'm trying to say really is that in that minimal scenario, things are actually pretty straight forward. I lose an OSD, I lose a node. I lose two nodes or two clusters, and I go read-only until I bring it back to 2 node, 2 OSD.

That seems way more redundant that I have with a single node cluster, while (of course) being way less redundant than a 4 or 5 node cluster (Which I don't want the electric bill for!). I'm really just asking "Am I missing something here, is there some complication I'm missing"

The only other question in my mind is how things will operate with a cluster if a node dies - will CEPH stall writes until the dead node is declared dead, or will it continue to operate if it can write min-copies to a the surviving peer?

UdoB · Sep 23, 2025

wishy said:
or will it continue to operate if it can write min-copies to a the surviving peer?

Yes, that's the idea behind "min_size=2"

mr44er · Sep 23, 2025

wishy said:
"Am I missing something here, is there some complication I'm missing"

No.

wishy said:
I think what I'm trying to say really is that in that minimal scenario, things are actually pretty straight forward. I lose an OSD, I lose a node. I lose two nodes or two clusters, and I go read-only until I bring it back to 2 node, 2 OSD.

Yes.

In my case I decided to go with minimum 4 nodes, the electricity bill is the bitter pill to swallow, but I have redundancy to still be online during maintenance.

Johannes S · Sep 28, 2025

Well especially in a "I don't want to take the whole house offline scenario" for a house I think a two-node cluster with ZFS Storage replication is the better approach: If you have a low-power device (like a raspberry pi or a nas) which can act as qdevice you don't need to have the third node running all the time and you can still have a failover in case one node crashs. Of course there is the caveat that you will loose the data not replicated since last replication but imho in most cases this shouldn't hurt too much (The default replication schedule of 15 minutes can reduced to one minute and tbh in most homeuser/selfhosting applications I can live with such minimal datalosses). Your third node can then act as a PBS, only to be switched on for backups.

This is imho better in terms of power reduction and you don't have to deal with the hassles of a minimal Ceph cluster which Udo pointed out in the first post while still maintaining high availability. I mean there are enough companys who use it, Proxmox have one example in their success storys list, I quote just the relevant parts:

Farmácia Nova da Maia is the most visited and a highly respected online pharmacy in Portugal. Since its inception, the pharmacy has invested in technology, choosing open-source solutions wherever possible....

To improve performance and reliability, Farmácia Nova da Maia implemented a two-node Proxmox VE cluster, with one QDevice, running on brand-new HPE servers equipped with state-of-the-art NVMe PCIe 5.0 drives. ZFS is used for storage replication between the nodes. This approach, which eliminates the need for centralized storage, was chosen for its cost-effectiveness, good performance, and to avoid over-engineering.

Additionally, to protect data and ensure rapid recovery in case of failure, the pharmacy deploys Proxmox Backup Server with an offsite backup plan.

The ZFS replication between the nodes takes place over a 50 Gbps network, with some virtual machines being replicated every 5 minutes. This minimizes data loss in case of catastrophic failure and ensures compliance with the company’s HA and business continuity policies.

Complete article: https://www.proxmox.com/en/about/about-us/stories/story/farmacia-nova-da-maia

wishy · Sep 28, 2025

Johannes S said:
Well especially in a "I don't want to take the whole house offline scenario" for a house I think a two-node cluster with ZFS Storage replication is the better approach: If you have a low-power device (like a raspberry pi or a nas) which can act as qdevice you don't need to have the third node running all the time and you can still have a failover in case one node crashs. Of course there is the caveat that you will loose the data not replicated since last replication but imho in most cases this shouldn't hurt too much (The default replication schedule of 15 minutes can reduced to one minute and tbh in most homeuser/selfhosting applications I can live with such minimal datalosses). Your third node can then act as a PBS, only to be switched on for backups.

This is imho better in terms of power reduction and you don't have to deal with the hassles of a minimal Ceph cluster which Udo pointed out in the first post while still maintaining high availability. I mean there are enough companys who use it, Proxmox have one example in their success storys list, I quote just the relevant parts:

Oddly enough, after going down a massive Ceph rabbit hole and getting ever increasing amounts of Ceph+proxmox content pushed to me by "the algorithm", I was reading through the manual before starting the deployment and spotted ZFS replication, and came to the same conclusion, setting it up a couple of days ago.

I've got a NAS acting as a qDevice, and so far the setup seems to be working fine. The only minor irritations are moving USB devices

[TUTORIAL] FabU: can I use Ceph in a _very_ small cluster?

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

New Member

Distinguished Member

New Member

Famous Member

Distinguished Member

New Member

Distinguished Member

Distinguished Member

New Member

Distinguished Member

Famous Member

Distinguished Member

New Member

We value your privacy