Question about Ceph Node and OSD distribution

nibblerrick

Renowned Member
Aug 20, 2015
35
5
73
Germany
Hi,

my first tests with ceph are really interesting and in runs great.

What I haven't found out: Is ceph considering only the OSDs for distribution or also the nodes?
Just a theoretical setup to show what I mean:
2 Servers with 3 3 TB HDDs for Ceph
1 Server with 6 1.5 TB HDDs for Ceph

So when Ceph distributes only evenly over the OSDs I have a bigger problem is the 6-HDD-Node will be faulty.
If Ceph trys to distribute the things evenly over the nodes it would be more safe for this case.
Therefore I am wondering what is the default behavior?
If I unterstood correctly for special behaviours there are many ways to modify the crush map.

Thanks
 
Hi,

my first tests with ceph are really interesting and in runs great.

What I haven't found out: Is ceph considering only the OSDs for distribution or also the nodes?
Just a theoretical setup to show what I mean:
2 Servers with 3 3 TB HDDs for Ceph
1 Server with 6 1.5 TB HDDs for Ceph

So when Ceph distributes only evenly over the OSDs I have a bigger problem is the 6-HDD-Node will be faulty.
If Ceph trys to distribute the things evenly over the nodes it would be more safe for this case.
Therefore I am wondering what is the default behavior?
If I unterstood correctly for special behaviours there are many ways to modify the crush map.

Thanks
Hi,
ceph distributed the data with the crush algorithm. And there are a lot of things which can used to affect which data is writen where.
You have an logical structure (datacenter, room, rack, host) and you can choose if your replica must be on different OSDs or hosts (host is mostly right).

In your case with 3 servers and an replica of 3 it's equal which node fail. If you have 4 nodes and on fails, the content of the failed node will be written to the remaining nodes.

For 3TB and 1.5TB disks - this is done with the weight - every disks has an weight (normaly size in tb) - so an 1.5 TB disk has an weight of app. 1.3 and the 3TB disk of app. 2.7 (depends on filesystem and journal size).

Udo
 
Hi,
ceph distributed the data with the crush algorithm. And there are a lot of things which can used to affect which data is writen where.
You have an logical structure (datacenter, room, rack, host) and you can choose if your replica must be on different OSDs or hosts (host is mostly right).
Udo

Hi Udo, thanks for your answer.
The thing with the weight is clear to me.
I think the point is replica over host or OSD, which wasn't clear to me.
I just thought of OSDs in the first place and, as more extreme example I have 2 Servers with 2X4 TB and one Server with 16X0,5 TB (as long as my pg's are suitable) I thought it would place more things on the server with the 16 HDDs for evenly distribution. When there is the thing with hosts it makes sense, that it can be distributed securely this way.

On the Proxmox Interface I can't remember that there was such an option (have to set up my test enviroment with 4b2 this weekend again).
Do you know what is the default setting for this?
And I think I have to modify it manually if I want, right?

Thanks
 
Hi Udo, thanks for your answer.
The thing with the weight is clear to me.
I think the point is replica over host or OSD, which wasn't clear to me.
I just thought of OSDs in the first place and, as more extreme example I have 2 Servers with 2X4 TB and one Server with 16X0,5 TB (as long as my pg's are suitable) I thought it would place more things on the server with the 16 HDDs for evenly distribution. When there is the thing with hosts it makes sense, that it can be distributed securely this way.
Hi,
ceph allways reads data from the primary of an PG - so if you want to use mostly the 16hdds (which makes sense) you can also modify the primary-affinity (default 1 - value between 0 and 1) which is available since ceph 0.9.
E.G. if your 4TB disks has an primary-affinity of 0 (and you have enough space on disks with primary-affinity of 1) all data will be read from the 16disks node (on an healthy cluster).
On the Proxmox Interface I can't remember that there was such an option (have to set up my test enviroment with 4b2 this weekend again).
Do you know what is the default setting for this?
And I think I have to modify it manually if I want, right?

Thanks
normaly the default settings should be right.

You can take a look on your crushmap with following commands
Code:
# ceph osd getcrushmap -o crushmap.compiled
got crush map from osdmap epoch 203454
# crushtool -d crushmap.compiled -o crushmap.decompiled

#cat crushmap.decompiled
Udo
 
There are so many things to learn at ceph, great!
I will try this out when I got my Testcluster ready, thank you.
Haven't heard of the primary-affinity yet, I have to try that, too.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!