Size of 3 node cluster Proxmox VE with NVMe

amanf76 · Aug 12, 2024

Hi,

i have 3 server with 2x ssd for OS and
6x 3,2 TB NVME drive for ceph pool

Total available space (size=3, minsize=2) : 19,2TB

Total available space (considering 80% safe usage) : 15,36 TB MAX
Two question

How many disk can fail (18 disk total)? What is the formula to calculate n. of disk that can fail?
Under what circumstances does data get corrupted? (e.g. if 3 disks fail or if a node + 1 disk dies)

Many thanks to all

ness1602 · Aug 12, 2024

You can get max failure of probaby disks, ie whole node down.
if 2 nodes die, data doesnt get corrupted, you just don't have quorum and write access to the cluster.

Your cluster is ideal for ceph beginnings and just testing everything(in production

) .

amanf76 · Aug 12, 2024

Hi,
Thanks.
Can you confirm me available storage space of 19,2TB?

Excluding node die, how many disk can fail? (2 - 3 -4 -5)
In what case does data get corrupted?

Thanks

ness1602 · Aug 12, 2024

Available = 19.2
Usable = 12.8 and this is max. As always , around 80% not more should be filled, so 10TB lets say.
Excluding node , probably the same number of disks can die.

amanf76 · Aug 12, 2024

Hi ness,

why usable 12.8? what calculation did you do?

"probably the same number of disks can die." ----> Then? 6 disk? 3 disk?

thanks

ness1602 · Aug 12, 2024

available = 19.2
because of max3/min2 you lose 30% by default so around 12tb, and if you don't overfill it, 10tb.
As for disks 6, but if all 6 disks die at the same time, it will be hard for ceph to be green, it would probably need around 30% more space(than used). Keep that in mind.

amanf76 · Aug 12, 2024

ok then recap

available (70% safe) : 9TB
potential loss : 1 node (OR or AND?) 6 disk

Right?
N.B 33% not 30% (33%*3 near 100%)

ness1602 · Aug 12, 2024

Right.

amanf76 · Aug 12, 2024

this calculator doesn't matter about 30% "because of max3/min2 you lose 30% by default"

Why? I knew that you had to add up all the OSDs and divide it by /3 to get the available space

Zerstoiber · Aug 12, 2024

(imho above is wrong, 80% should be calculated from 18TiB, but see below)

Per Node you have 6x 3,2TB = ~18TiB
*3 Nodes = ~54TB
But you have 3 Ceph copies (one full dataset on each node), so it simply stays at 18TiB available.

Of those, you should always stay below a certain threshold:

If one of your 6 disks in a node fails, your available strorage in that node drops down to 15TiB
-> If you had used anywhere near of 15TiB in total, and Ceph tries to rebuild the data on the failed disk on the remaining 5 disks, your storage would now be full...

So to be safe, i'd go in the direction of 70-75% of 15TiB, which is ~10.5-11TiB safely usable

If one whole node fails (or gets rebooted for updates), nothing will be rebuilt and you're fine.

If two nodes go down, Ceph will stop working until you have 2 Nodes up again (Size 3 [copies]/min Size 2 [readable data, still accepting writes]).
So always make sure everything is healthy before you reboot a node for maintenance ;-)

amanf76 · Aug 12, 2024

Hi,
OK and in the "safe range", supposing 11TB used

How many disk can be fail at the same time?

UdoB · Aug 12, 2024

amanf76 said:
i have 3 server with 2x ssd for OS and
6x 3,2 TB NVME drive for ceph pool

Total available space (size=3, minsize=2) : 19,2TB

With 3 servers one may die. The "failure domain" is "host". Ceph assures that the three requested copies of a data block are distributed to all three nodes ;-)

After one OSD fails the other OSD on the same node may fail also. My above sentence implies this. But if an OSD on another node dies Ceph will probably go read-only.

Disclaimer: I am not a Ceph expert...

amanf76 · Aug 13, 2024

ok then
2 osd on same node can securely fail
2 osd on different node put storage in read only?
2 or more?

czechsys · Aug 13, 2024

It's easy. You have 3/2 replication. You need 2 copies of the dataX (PG) to be ceph writable. If you don't have 2 copies of the dataX, dataX is readonly. So it depends on where are dataX placed on what disks on specific node, where disk fails.

amanf76 · Aug 13, 2024

"So it depends on where are dataX placed on what disks on specific node, where disk fails."

ok but how do I know where dataX is?

I need to tell the customer how many disks can break, to let him decide

UdoB · Aug 13, 2024

amanf76 said:
ok but how do I know where dataX is?

It is "in the cluster"! And to get a reliable service you need to follow the well documented rules.

For "size=3/min_size=2" it is simple; just to repeat it once more: one OSD may fail without affect for the service. When a second OSD fails (on another host) Ceph goes read only. Then all VMs trying to write any data will stall, but no data is lost.

If you want to be able to tolerate a second OSD to fail you need more nodes (and OSDs of course). A fourth node and "size=4,min_size=2" may be sufficient...

VictorSTS · Aug 13, 2024

Okei, I'm worried about you selling this to your customer without fully understanding how Ceph works, so let's try to clarify some points. Every value mentioned are defaults.

You have 3 servers with 6 x 3.2TB OSD each with a size 3, min_size 2 pool. The default replicated_rule forces the copies to be created in different OSD of different hosts. With 3 hosts, each PG (PG=groups of objects used by Ceph to distribute the data in the Ceph cluster) will have a copy in an OSD in a different host: every hosts will have a copy of each PG.

You max available space will be 18 x 3.2TB / size 3 = 19'2TB. Ceph stops writing data at 95% full (mon_osd_full_ratio), stop backfilling/recovery ops at 90% full (osd_backfill_full_ratio) and warn you at 85% full (mon_osd_nearfull_ratio). Given this, your max usable capacity would be around 80% max, so ~15.4TB.

Now, you want to survive some OSD failures. In Ceph, how and when a drive fails matters, as the amount of data in the cluster does.

By default, Ceph will mark OUT an OSD that has been DOWN for more than 10 minutes (mon_osd_down_out_interval). At that point, Ceph will start creating the copies that where in the failed OSD in the remaining drives of the host where the OSD failed. For this to happend, you need enough available space in those remaining OSD's or you'll end up filling your drives too much and reaching the 85/90/95 ratios mentioned above.

In your case, if you want to fully survive 1 OSD failure on each host and still allow Ceph to self heal, your max usable capacity would be ~15.4TB - 3.2TB = ~12.2TB. Keep substracting 3.2TB usable capacity for each OSD drive that you want to allow to fail.

Let's talk about the "how and when" part: imagine that 1 OSD in each host fails at the same time. It will happen that some PG had it's 3 copies in those very 3 OSD: that PG would become inactive and won't be available (thus the VMs will probaly hang/panic/bluescreen). Ceph didn't had a chance at rebuilding the copies for the affected PGs. You can use ceph pg dump to see how are your PGs distributed among all the OSDs. If you manage to bring at least one of those OSD back, the PGs will become active and Ceph will recreate the copies in other OSD, albeit writes to PGs that still have only one copy may block until at least a second replica is created (this would affect the performance of the VMs).

If the OSDs fail at different points in time, Ceph will hopefully have enough time to recreate the copies in other OSD, thus keeping full data availability and redundancy at the cost of available space in the cluster.

As for the "amount of data in the cluster": provided that your data is small enough and Ceph had enough time to backfill/recover, you could end up losing all OSD except 2 in different host without losing any data and still be able to read/write it:

- You have 2TB of data, you would need at the very least 1 x 3.2TB OSD in two of your hosts (data will become read only if any of these OSD fail).
- Say you have 5TB of data, you would need at the very least 2 x 3.2TB OSD in two of your hosts (data will become read only if any of these OSD fail).

You should try to have the same amount of available space in each of the 3 hosts in the cluster at all times to avoid full OSD: if you have 5TB data, with 2 x 3.2TB OSD in 2 of your hosts, but the third host has just one 3.2TB OSD it will become full while the OSD of the other hosts still have available space.

Avoid full OSD at all costs: can get really tricky to recover a full Ceph cluster without deleting data and/or adding OSDs.

Hope this helps. I really encourage you to create test clusters as VMs in a PVE hosts and practice all these situations.

Azunai333 · Aug 13, 2024

UdoB said:
When a second OSD fails (on another host) Ceph goes read only.

Are you sure it is still readable if the available replica drop below min_size?

In https://docs.ceph.com/en/quincy/rados/operations/pools/ it says:

An object might accept I/Os in degraded mode with fewer than poolsize replicas. To set a minimum number of replicas required for I/O, you should use the min_size setting.

So *all* IO is blocked AFAIU.

amanf76 · Aug 13, 2024

hi @VictorSTS
many many thanks, you have clarified my doubt

Thanks

UdoB · Aug 13, 2024

Azunai333 said:
Are you sure it is still readable if the available replica drop below min_size?

No, I am not, sorry.

Trust the reference docs, not me. Some posts earlier I said "Disclaimer: I am not a Ceph expert".

Disabling (only) write access seemed logical to me. Although on some filesystems reading triggers meta-data updates - which would mean a (blocked) write event.

----
Added some days later: I was really curious about the failure-behavior. Not surprisingly I have to confirm that reading stalls when less than min_size OSDs are available. Sometimes one must see to believe - fortunately I have a test-cluster for experiments like this ;-)

Size of 3 node cluster Proxmox VE with NVMe

New Member

Famous Member

New Member

Famous Member

New Member

Famous Member

New Member

Famous Member

New Member

Member

New Member

Distinguished Member

New Member

Renowned Member

New Member

Distinguished Member

Famous Member

Active Member

New Member

Distinguished Member

We value your privacy