Ceph question: do Size only refer to OSDs by default or Host as well?

DC-CA1 · Feb 15, 2023

i read that ceph can inforce having copy on different host not only different osd

is this rule by default in Proxnox ? ( the Number refer to OSD and or host ?

or its a rule we need to add to the config file ?

aaron · Feb 15, 2023

The size parameter, defines how many replicas for each object (or PG in which objects are grouped into) should be stored in the cluster.

The used CRUSH rule defines then how the replicas are distributed in the cluster. The default replicated_rule has the failure domain set to "host".

This should work fine in most situations. If you have different types of OSDs, for example HDD and SSDs, you can create rules that also target a specific device class. See https://docs.ceph.com/en/latest/rados/operations/crush-map/#device-classes
For example:

Code:

ceph osd crush rule create-replicated replicated_hdd default host hdd

DC-CA1 · Feb 15, 2023

aaron said:
The size parameter, defines how many replicas for each object (or PG in which objects are grouped into) should be stored in the cluster.

The used CRUSH rule defines then how the replicas are distributed in the cluster. The default replicated_rule has the failure domain set to "host".

This should work fine in most situations. If you have different types of OSDs, for example HDD and SSDs, you can create rules that also target a specific device class. See https://docs.ceph.com/en/latest/rados/operations/crush-map/#device-classes
For example:

Code:

ceph osd crush rule create-replicated replicated_hdd default host hdd

Thx for explanations.

Do size of 3 mean 3 copy + active data or a total of 3 ?

So if I gave a pool of 4 osd and are configured to 3 , a copy exists on all 4 or 3 of 4?

Sorry in advance

aaron · Feb 15, 2023

Let's stay with host as the failure domain, as that is the usual use-case

. You should be able to transfer the explanation to a situation where the failure domain is OSD.

If you have a cluster of 3 nodes with a size=3, then each host will store a replica. That means, if a node goes down, you have 2 out of 3 replicas available. Still enough to stay operational, but not with the full redundancy. In the Ceph status page, you should see that all PGs are in state "active+undersized".

If you have a 4 node cluster, the 3 replicas are spread across them all. This is how you can achieve more space in a cluster by adding more nodes.
If one of the 4 nodes is down, you should see roughly 3/4 of PGs being in "active+undersized" state, and about 1/4 perfectly fine in "active" state. The latter PGs have all their replicas on the 3 nodes that are still running.

In such a situation, Ceph will wait for 10 minutes for the node to come back. If the OSDs are not back up running within those 10 minutes, they will be marked as "out". The result is that Ceph will recover the lost replicas on the remaining nodes. Because it should still be able to have 3 replicas and adhere to the failure domain of host which implies, that only one replicas can be stored per host.

DC-CA1 · Feb 15, 2023

aaron said:
Let's stay with host as the failure domain, as that is the usual use-case . You should be able to transfer the explanation to a situation where the failure domain is OSD.

If you have a cluster of 3 nodes with a size=3, then each host will store a replica. That means, if a node goes down, you have 2 out of 3 replicas available. Still enough to stay operational, but not with the full redundancy. In the Ceph status page, you should see that all PGs are in state "active+undersized".

If you have a 4 node cluster, the 3 replicas are spread across them all. This is how you can achieve more space in a cluster by adding more nodes.
If one of the 4 nodes is down, you should see roughly 3/4 of PGs being in "active+undersized" state, and about 1/4 perfectly fine in "active" state. The latter PGs have all their replicas on the 3 nodes that are still running.

In such a situation, Ceph will wait for 10 minutes for the node to come back. If the OSDs are not back up running within those 10 minutes, they will be marked as "out". The result is that Ceph will recover the lost replicas on the remaining nodes. Because it should still be able to have 3 replicas and adhere to the failure domain of host which implies, that only one replicas can be stored per host.

per my reading erasure codng seem to be interesting and allow to preserve more usable storage , do you have any advice on using it with proxmox ?
otherwise a 3/2 will always cost around 60-70% of usable storage no matter the size of the cluster per my calculation i am wrong ?
i know also alot of people say dont use 2/2 but its the exact same as people who as been running Raid10 unit for years . that why we keep having backup anyway in case of a double failure during rebuild no ?
why do it is so advertise as dangerous approcach

aaron · Feb 15, 2023

EC can indeed give you a much better ratio of usable space to overall space.

Again, with host as failure domain in mind: to really make use of it, having 5 or more nodes is where it gets interesting. Please check out the docs: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pve_ceph_ec_pools

EC pools are less flexible. If you realize that you need to change something, most likely you need to create a new EC pool and move your data over to it.

DC-CA1 said:
i know also alot of people say dont use 2/2 but its the exact same as people who as been running Raid10 unit for years . that why we keep having backup anyway in case of a double failure during rebuild no ?

In a RAID 10 you can lose one or more disks (if the correct disks fail) and it will stay operational. If you run a replicated with 2/2, you will not sustain data loss if a disk fails, but the pool will be IO blocked and therefore not operational for as long as it takes to recover the lost PGs.

Running a 2/1 replicated pool increases your chances for data loss significantly. There are a few presentations from a few years ago by Ceph consultants. The TL;DR is that in most situations where they saw data loss, the admins set min_size to 1, and it wouldn't have happened if it was kept at 2.

Just as a hint, also for other people, do not try to compare Ceph too much with classical RAID. The analogies break very quickly

Search

Search

Ceph question: do Size only refer to OSDs by default or Host as well?

DC-CA1

Member

aaron

Proxmox Staff Member

DC-CA1

Member

aaron

Proxmox Staff Member

DC-CA1

Member

aaron

Proxmox Staff Member