ceph pool(s) size/number

PaulVM · Oct 23, 2023

I have 3 node ceph cluster, each node has 4x3.84TB disks to use as OSD.
VM images are about 3 TB.
Does Is it better to create a pool with all the 4x3.84 TB for each node (15TB), or create 2 pools of 2x3.84 TB for each node (7.5TB)?

Thanks, P.

PaulVM · Oct 24, 2023

Is it a so complicated or so stupid question?

Thanks, P.

aaron · Oct 24, 2023

What do you expect by creating multiple pools and assigning them dedicated OSDs (device class) instead of one big pool?
Without assigning the pools dedicated OSDs (giving each group a specific device class and assign the pools a rule that targets the specific device class), they will use the same OSDs.

Are the disks for the OSDs all the same model?

PaulVM · Oct 24, 2023

aaron said:
What do you expect by creating multiple pools and assigning them dedicated OSDs (device class) instead of one big pool?

I am at my first experiences with ceph, so I don't know it so much.
In my limited experience to standard file Systems and/or storage, I usually prefer to have separated littler "pools" if possible so, if something go wrong it is simplier and faster to recover the one that have problems.
I don't know if this is also applicable to ceph and how complex is it to create multiple pools instead of a single big one.

aaron said:
Without assigning the pools dedicated OSDs (giving each group a specific device class and assign the pools a rule that targets the specific device class), they will use the same OSDs.

No simple way to assign (i.e.), 2 disks for a pool and other 2 disks for another (for every server of course)?

aaron said:
Are the disks for the OSDs all the same model?

Every server have 2x960GB System disks + 6x 3.84TB
Same size/type. For every server I have a little mixed model type:

PVE01 (All Samsung):

Disk /dev/nvme0n1: 894.25 GiB, 960197124096 bytes, 1875385008 sectors
Disk model: SAMSUNG MZQL2960HCJR-00A07

Disk /dev/nvme1n1: 894.25 GiB, 960197124096 bytes, 1875385008 sectors
Disk model: SAMSUNG MZQL2960HCJR-00A07

Disk /dev/nvme2n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: SAMSUNG MZQL23T8HCLS-00A07

Disk /dev/nvme3n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: SAMSUNG MZQL23T8HCLS-00A07

Disk /dev/nvme4n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: SAMSUNG MZQL23T8HCLS-00A07

Disk /dev/nvme5n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: SAMSUNG MZQL23T8HCLS-00A07

Disk /dev/nvme6n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: SAMSUNG MZQL23T8HCLS-00A07

Disk /dev/nvme7n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: SAMSUNG MZQL23T8HCLS-00A07

PVE02 (Mixed Samsung and Intel):

Disk /dev/nvme0n1: 894.25 GiB, 960197124096 bytes, 1875385008 sectors
Disk model: SAMSUNG MZQL2960HCJR-00A07

Disk /dev/nvme1n1: 894.25 GiB, 960197124096 bytes, 1875385008 sectors
Disk model: SAMSUNG MZQL2960HCJR-00A07

Disk /dev/nvme2n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: INTEL SSDPF2KX038T1O

Disk /dev/nvme3n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: INTEL SSDPF2KX038T1O

Disk /dev/nvme4n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: INTEL SSDPF2KX038T1O

Disk /dev/nvme5n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: INTEL SSDPF2KX038T1O

Disk /dev/nvme6n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: SAMSUNG MZQL23T8HCLS-00A07

Disk /dev/nvme7n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: SAMSUNG MZQL23T8HCLS-00A07

PVE02 (Micron for System and Samsung for data):

Disk /dev/nvme0n1: 894.25 GiB, 960197124096 bytes, 1875385008 sectors
Disk model: Micron_7450_MTFDKCC960TFR

Disk /dev/nvme1n1: 894.25 GiB, 960197124096 bytes, 1875385008 sectors
Disk model: Micron_7450_MTFDKCC960TFR

Disk /dev/nvme3n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: SAMSUNG MZQL23T8HCLS-00A07

Disk /dev/nvme2n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: SAMSUNG MZQL23T8HCLS-00A07

Disk /dev/nvme4n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: SAMSUNG MZQL23T8HCLS-00A07

Disk /dev/nvme5n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: SAMSUNG MZQL23T8HCLS-00A07

Disk /dev/nvme6n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: SAMSUNG MZQL23T8HCLS-00A07

Disk /dev/nvme7n1: 3.49 TiB, 3840755982336 bytes, 7501476528 sectors
Disk model: SAMSUNG MZQL23T8HCLS-00A07

As I told, my idea is to use 4 disks for ceph for every PVE (the remaing 2 for a local ZFS).

Thanks, P.

aaron · Oct 24, 2023

Well, Ceph has a few core concepts that are very different compared to traditional raid based storage.

A 3-node Ceph cluster is the bare minimum number of nodes for which a few edge cases need to be considered. The main, and why you should not separate the disks specifically, is that in case a node or a disk/OSD goes down, there are no other nodes on which Ceph can recover the lost replicas to.

With 3 nodes, you have as many nodes as replicas (size parameter). The result is, that each node has one replica. If a node goes down, then the data will show as degraded. Once the node is back up and functional, Ceph can recover and the cluster will be back and healthy. If you would have more nodes, then Ceph could recover the lost data to the other remaining nodes that did not have any of the replicas for the affected PGs (placement group) yet.

The edge case now is, if only a disk/OSD in a node dies. The node is still working so Ceph will try to recover the lost replicas on that same node, on the remaining OSDs. Imagine having two OSDs per node, each 40% full. One disk dies and Ceph recovers that data. The remaining disk will now be 80%. That's why it is good to have more disks. 4 per node should be about the minimum, so that if a disk fails, the data can be recovered on the remaining 3, reducing the chances for any to run full.

Ceph can recover from a lot, but running out of physical space is painful!

In general, if you can, give Ceph smaller but more resources (nodes, disk, ...) as that gives it more options to recover to in case one component fails.

PaulVM · Oct 24, 2023

Thanks for the explanation.
I obviously had read a few documentation, so, some of this concept are familiar to me. I miss the practical experience.
My focus was in evaluating the specific situation:
3 TB data amount
15 TB disk space (4 x 3.84)
We are at 20% of physical space.
With 2 disks/OSD we have around 40%, so if an OSD/disk dies, there is still enough space to recover on the remaining OSD/disk (that will go at 80%).
And there are still 2 more PVE that have all the data and the space available.
My obsession is more psycological than technical: having only 20% of space used, I know that in the middle term someone will be inclined to use some of this space (backups, test, spool space, ...). If I can limit it at the source ...

Ceph distribute efficiently the data between disks (having a RAID-0 effect)?
But I suppose the limit will be the network (25Gbit in my configuration)
The disks are ok also if some are from different vendor?

Thanks, P.

aaron · Oct 25, 2023

PaulVM said:
Ceph distribute efficiently the data between disks (having a RAID-0 effect)?

Do not try to compare it to RAID

Ceph is an object store with additional layers like the Ceph FS for a file system and RBD to get block device functionality (disk images).

These objects are then stored multiple times (size parameter of a pool) across the cluster. The default rule is that there can only be one copy/replica per node to protect against the loss of a node. To make accounting for the objects less resource intensive, the objects are grouped into placement groups (PGs). The decision on which node (think clusters of clusters with many more nodes than just 3) and disk in the node a replica should be stored is done on the PG level.
When a disk / node fails, or when you manually set an OSD to "OUT", the topology changes and Ceph will act on it by either recovering or moving the data somewhere else.
That's why you should use either 1, or at least 4 OSDs in a small 3 node cluster (more detailed in my previous answer).
If you want to get near-full warnings earlier, you can configure your cluster to have a lower limit for it, for example:

Code:

ceph osd set-nearfull-ratio 0.6

The default value is 0.85.

PaulVM said:
But I suppose the limit will be the network (25Gbit in my configuration)

Possible, but keep in mind that OSDs might also be CPU bound.

PaulVM said:
The disks are ok also if some are from different vendor?

Yes. Depending on the spec, you might see the cluster performance change a bit. They should be of the same size (roughly). As larger disks get more data stored on them and therefore more load. Which could turn them into bottlenecks.

If you have vastly different disks, you can separate them with device classes (you can define your own names) and create specific rules that only match them. Then you can assign these rules to the pools and therefore have, for example, a fast and a slow pool.

ceph pool(s) size/number

PaulVM

Renowned Member

PaulVM

Renowned Member

aaron

Proxmox Staff Member

PaulVM

Renowned Member

aaron

Proxmox Staff Member

PaulVM

Renowned Member

aaron

Proxmox Staff Member

We value your privacy