Okei, I'm worried about you selling this to your customer without fully understanding how Ceph works, so let's try to clarify some points. Every value mentioned are defaults.
You have 3 servers with 6 x 3.2TB OSD each with a size 3, min_size 2 pool. The default replicated_rule
forces the copies to be created in different OSD of different hosts. With 3 hosts, each PG (PG=groups of objects used by Ceph to distribute the data in the Ceph cluster) will have a copy in an OSD in a different host: every hosts will have a copy of each PG.
You max available space will be 18 x 3.2TB / size 3 = 19'2TB. Ceph stops writing data at 95% full (mon_osd_full_ratio
), stop backfilling/recovery ops at 90% full (osd_backfill_full_ratio
) and warn you at 85% full (mon_osd_nearfull_ratio
). Given this, your max usable capacity would be around 80% max, so ~15.4TB.
Now, you want to survive some OSD failures. In Ceph, how and when a drive fails matters, as the amount of data in the cluster does.
By default, Ceph will mark OUT an OSD that has been DOWN for more than 10 minutes (mon_osd_down_out_interval
). At that point, Ceph will start creating the copies that where in the failed OSD in the remaining drives of the host where the OSD failed. For this to happend, you need enough available space in those remaining OSD's or you'll end up filling your drives too much and reaching the 85/90/95 ratios mentioned above.
In your case, if you want to fully survive 1 OSD failure on each host and still allow Ceph to self heal, your max usable capacity would be ~15.4TB - 3.2TB = ~12.2TB. Keep substracting 3.2TB usable capacity for each OSD drive that you want to allow to fail.
Let's talk about the "how and when" part: imagine that 1 OSD in each host fails at the same time. It will happen that some PG had it's 3 copies in those very 3 OSD: that PG would become inactive and won't be available (thus the VMs will probaly hang/panic/bluescreen). Ceph didn't had a chance at rebuilding the copies for the affected PGs. You can use ceph pg dump
to see how are your PGs distributed among all the OSDs. If you manage to bring at least one of those OSD back, the PGs will become active and Ceph will recreate the copies in other OSD, albeit writes to PGs that still have only one copy may block until at least a second replica is created (this would affect the performance of the VMs).
If the OSDs fail at different points in time, Ceph will hopefully have enough time to recreate the copies in other OSD, thus keeping full data availability and redundancy at the cost of available space in the cluster.
As for the "amount of data in the cluster": provided that your data is small enough and Ceph had enough time to backfill/recover, you could end up losing all OSD except 2 in different host without losing any data and still be able to read/write it:
- You have 2TB of data, you would need at the very least 1 x 3.2TB OSD in two of your hosts (data will become read only if any of these OSD fail).
- Say you have 5TB of data, you would need at the very least 2 x 3.2TB OSD in two of your hosts (data will become read only if any of these OSD fail).
You should try to have the same amount of available space in each of the 3 hosts in the cluster at all times to avoid full OSD: if you have 5TB data, with 2 x 3.2TB OSD in 2 of your hosts, but the third host has just one 3.2TB OSD it will become full while the OSD of the other hosts still have available space.
Avoid full OSD at all costs: can get really tricky to recover a full Ceph cluster without deleting data and/or adding OSDs.
Hope this helps. I really encourage you to create test clusters as VMs in a PVE hosts and practice all these situations.