[SOLVED] Ceph safe nearfull ratio

ph0x

Renowned Member
Jul 5, 2020
1,327
220
73
/dev/null
Recently, I recommended the Ceph calculator from florian.ca a couple of times and now that I'm about to set up a cluster myself, I also consulted it for a safe cluster size.
However, I seem to not having grasped every detail of the Ceph concepts. I read somewhere that crush does not place the same PG on any given node more than once. With a pool size of 3/2 and 2 OSD per node, why should I set nearfull ratio to 0.67 - as the calculator suggests - on a three node cluster?
From what I understood, crush will always distribute one replica of each PG on one node each and not redistribute them if one node fails, because the PGs would end up on the nodes twice. Therefore there should be no need to reserve a third of the cluster space for the OSDs of a failed node, shouldn't it?
Is the above true or is there a flaw in my consideration?

Regards,
Marco
 
Last edited:
Ceph, by default, will try to keep as much redundancy as possible in order to have as many copies of data as configured in the pool (typically 3), even if at some point it has to go over the "1 PG per host" rule.

If any OSD stops responding, Ceph will wait up to "mon osd down out interval" (600 secs by default) before marking it "down" and "out". If the OSD gets marked "down" and "out", it's PG's will be remapped to other OSDs.

Say you have just one pool with 3 replicas and a 3 node cluster with the same disk capacity each. If a node fails longer than 600 seconds, all its OSDs will be marked "down" and "out", forcing a remap of their PGs to the OSD of the remaining 2 nodes even if that overrides the "only one copy on every node". For that remap to fit in your remaining OSDs, you need at least 34% of free space in your cluster or the remaining OSDs will become full. Avoid full OSDs at *all* costs, as you wont be able to write to your CEPH cluster and efectively distrupting your VMs services, risking VM filesystem damange in certain cases. And recovery from full OSDs can be painfull and slow. If you have nodes with more/less disk space think about what will happend if that node fails and adapt your free space accodingly.

Having a nearfull ratio of 0.67 will raise a warning when you have used 67% of your disk space, helping yo to take action and avoid risks.

Keep in mind that Ceph was designed to be used with lots of nodes and OSDs in the same cluster: using just 3 nodes is a somewhat "extreme" case for Ceph, although works perfectly and is completely supported. Having more nodes would lower the "safe free space restriction".
 
  • Like
Reactions: ph0x
Any source about overriding one copy per host rule by default in 3 node cluster, when one node is down 600s?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!