[SOLVED] Ceph safe nearfull ratio

ph0x · Mar 17, 2021

Recently, I recommended the Ceph calculator from florian.ca a couple of times and now that I'm about to set up a cluster myself, I also consulted it for a safe cluster size.
However, I seem to not having grasped every detail of the Ceph concepts. I read somewhere that crush does not place the same PG on any given node more than once. With a pool size of 3/2 and 2 OSD per node, why should I set nearfull ratio to 0.67 - as the calculator suggests - on a three node cluster?
From what I understood, crush will always distribute one replica of each PG on one node each and not redistribute them if one node fails, because the PGs would end up on the nodes twice. Therefore there should be no need to reserve a third of the cluster space for the OSDs of a failed node, shouldn't it?
Is the above true or is there a flaw in my consideration?

Regards,
Marco

VictorSTS · Mar 18, 2021

Ceph, by default, will try to keep as much redundancy as possible in order to have as many copies of data as configured in the pool (typically 3), even if at some point it has to go over the "1 PG per host" rule.

If any OSD stops responding, Ceph will wait up to "mon osd down out interval" (600 secs by default) before marking it "down" and "out". If the OSD gets marked "down" and "out", it's PG's will be remapped to other OSDs.

Say you have just one pool with 3 replicas and a 3 node cluster with the same disk capacity each. If a node fails longer than 600 seconds, all its OSDs will be marked "down" and "out", forcing a remap of their PGs to the OSD of the remaining 2 nodes even if that overrides the "only one copy on every node". For that remap to fit in your remaining OSDs, you need at least 34% of free space in your cluster or the remaining OSDs will become full. Avoid full OSDs at *all* costs, as you wont be able to write to your CEPH cluster and efectively distrupting your VMs services, risking VM filesystem damange in certain cases. And recovery from full OSDs can be painfull and slow. If you have nodes with more/less disk space think about what will happend if that node fails and adapt your free space accodingly.

Having a nearfull ratio of 0.67 will raise a warning when you have used 67% of your disk space, helping yo to take action and avoid risks.

Keep in mind that Ceph was designed to be used with lots of nodes and OSDs in the same cluster: using just 3 nodes is a somewhat "extreme" case for Ceph, although works perfectly and is completely supported. Having more nodes would lower the "safe free space restriction".

ph0x · Mar 18, 2021

Okay, thanks, that clears it up for me!

czechsys · Mar 18, 2021

Any source about overriding one copy per host rule by default in 3 node cluster, when one node is down 600s?

Search

Search

[SOLVED] Ceph safe nearfull ratio

ph0x

Renowned Member

VictorSTS

Famous Member

ph0x

Renowned Member

czechsys

Renowned Member