Ceph new rule for HDD storage.

CyberGuy

Member
Dec 20, 2021
24
3
8
28
Hi guys,

During the free time I had to think how to extend and add new resources (storage) to our Cloud. For the moment I do have stroage base on Ceph - OSD only SSD type.
I was reading docs about Ceph and I can say it's possbile I have even the plan. Problem is I have no idea if the actions I want to do are pretty safe. I took over handling the type of work. I read the the red hat docs, but the problem is we have pretty general configuration, I want to do the right way. I will be happy that somebone could see and comment it a bit. I am not sure the current configuration is fine to expand they way I want...

My current configuration:

Proxmox version: Virtual Environment 6.4
Ceph version : 15.2.15
Cluster of 8 nodes.


Ceph configration:
[Global]
cluster_network = XXX.XXX.XXX.XXX/XX
fsid = XXXXXXXXXXXXXXXXXXXX
mon_allow_pool_delete = true
mon_host = XXX.XXX.XXX.XXX XXX.XXX.XXX.XXX XXX.XXX.XXX.XXX XXX.XXX.XXX.XXX
osd_pool_default_min_size = 1
osd_pool_default_size = 2
public_network = XXX.XXX.XXX.XXX/XX


# begin crush map

tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd
device 15 osd.15 class ssd
device 16 osd.16 class ssd
device 17 osd.17 class ssd
device 18 osd.18 class ssd
device 19 osd.19 class ssd
device 20 osd.20 class ssd
device 21 osd.21 class ssd
device 22 osd.22 class ssd
device 23 osd.23 class ssd
device 24 osd.24 class ssd
device 25 osd.25 class ssd
device 26 osd.26 class ssd
device 27 osd.27 class ssd
device 28 osd.28 class ssd
device 29 osd.29 class ssd
device 30 osd.30 class ssd
device 31 osd.31 class ssd
device 32 osd.32 class ssd
device 33 osd.33 class ssd
device 34 osd.34 class ssd
device 35 osd.35 class ssd
device 36 osd.36 class ssd
device 37 osd.37 class ssd
device 38 osd.38 class ssd
device 39 osd.39 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host pve1 {
id -3 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 7.278
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.819
item osd.1 weight 1.819
item osd.2 weight 1.819
item osd.3 weight 1.819
}
host pve2 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 7.278
alg straw2
hash 0 # rjenkins1
item osd.5 weight 1.819
item osd.6 weight 1.819
item osd.7 weight 1.819
item osd.4 weight 1.819
}
host pve3 {
id -7 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 7.278
alg straw2
hash 0 # rjenkins1
item osd.8 weight 1.819
item osd.9 weight 1.819
item osd.10 weight 1.819
item osd.11 weight 1.819
}
host pve4 {
id -9 # do not change unnecessarily
id -10 class ssd # do not change unnecessarily
# weight 7.278
alg straw2
hash 0 # rjenkins1
item osd.12 weight 1.819
item osd.13 weight 1.819
item osd.14 weight 1.819
item osd.15 weight 1.819
}
host pve5 {
id -11 # do not change unnecessarily
id -12 class ssd # do not change unnecessarily
# weight 10.916
alg straw2
hash 0 # rjenkins1
item osd.16 weight 1.819
item osd.17 weight 1.819
item osd.18 weight 1.819
item osd.19 weight 1.819
item osd.24 weight 1.819
item osd.25 weight 1.819
}
host pve6 {
id -13 # do not change unnecessarily
id -14 class ssd # do not change unnecessarily
# weight 10.916
alg straw2
hash 0 # rjenkins1
item osd.20 weight 1.819
item osd.21 weight 1.819
item osd.22 weight 1.819
item osd.23 weight 1.819
item osd.26 weight 1.819
item osd.27 weight 1.819
}
host pve7 {
id -15 # do not change unnecessarily
id -16 class ssd # do not change unnecessarily
# weight 10.916
alg straw2
hash 0 # rjenkins1
item osd.28 weight 1.819
item osd.29 weight 1.819
item osd.30 weight 1.819
item osd.31 weight 1.819
item osd.32 weight 1.819
item osd.33 weight 1.819
}
host pve8 {
id -17 # do not change unnecessarily
id -18 class ssd # do not change unnecessarily
# weight 10.916
alg straw2
hash 0 # rjenkins1
item osd.34 weight 1.819
item osd.35 weight 1.819
item osd.36 weight 1.819
item osd.37 weight 1.819
item osd.38 weight 1.819
item osd.39 weight 1.819
}
root default {
id -1 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 72.776
alg straw2
hash 0 # rjenkins1
item pve1 weight 7.278
item pve2 weight 7.278
item pve3 weight 7.278
item pve4 weight 7.278
item pve5 weight 10.916
item pve6 weight 10.916
item pve7 weight 10.916
item pve8 weight 10.916
}

# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

8 nodes:
4 nodes has 6x 2 TB ssd SATA
4 nodes has 4x 2 TB ssd SATA


I have already one pool with the rule replicated_rule, problem is I am not sure if the rule is global for ssds but also for anything else and how to make work with new nodes... Because I want to create new pool from new nodes.

New nodes will be :
3 nodes has 10x 6 TB Drives .

I want to create new pool only for HDDs drives - OSD. It will require changes I believe.
rule: replicated_rule_hdd_6TB - will contain only pool and osd from 3 new nodes.
root : root hdd_6TB - will contain only those 3 nodes.
new pool : cephfs_hdd_6tb which will contain new rule.

As PGs are not autoscalable for the moment, should I enable it for the new pool or incresed manually? when adding new OSD?
If my couting is right : I want to have 2 replicas on new nodes:
Would be : 1024 = (30 OSD * 100 Target per OSD ) / 2 number of replicas. It should be 1500 but recommended is 1024.

Problems I do have and consideration:
I do not want to touch current configratuin, and I want to make sure it will be safe.
I did not find any informations but does the Live migration will work?
*Imporant - add 2 more ssds per node to the HDD nodes, problem is if I have 10x 6 TB and I have 2 spare slots (bays) on each machine what would be the best size of ssd for journal of Ceph. How to make it working with current confiugration?
** Addin new rule and pool, I believe adding new osd won't happen via Web interface and adding for example in future will be done only via CLI ?


If you guys have some comments would be great ;)
 
Last edited:
First regarding the rules and device types:

If you have a mixed set of device classes, in this case SSD and HDD, you will have to create rules that target these specifiaclly. The default replicated rule will not make a difference between the device class.

To create a new rule, you can run the following:
Code:
ceph osd crush rule create-replicated <rule name> <root> <failure domain> <device class>
For example for SSDs:
Code:
ceph osd crush rule create-replicated replicated_ssd default host ssd
.

The same goes for the HDDs. Then you need to assign those rules to your pools. If necessary, Ceph will start to shuffle around the data until the rules are satisfied. You can set the rules for the pool in the GUI if you run a recent PVE 6.4 or newer. Otherwise you need to assign the rules on the CLI, which is also explained in the Ceph docs

With this you will then have a "slow" HDD pool and a "fast" SSD pool and the associated storages within PVE. With that you can then decide on which storage and therefore pool, you want to place VM disks.

Would be : 1024 = (30 OSD * 100 Target per OSD ) / 2 number of replicas. It should be 1500 but recommended is 1024.
The pg_num needs to be a power of 2, therefore it will go down or up to the nearest one.
From the PG Calculator:
1641212598012.png

Also, please do not use a size of 2! Run your pools with a size/min_size of 3/2 and size them accordingly. If you run them with 2/2, the pool will become read only as soon as one PG has lost a replicate and if you run them with 2/1 you are risking data inconsistency and potential data corruption!


I did not find any informations but does the Live migration will work?
You mean live migration of VMs between nodes? Should work just fine.
** Addin new rule and pool, I believe adding new osd won't happen via Web interface and adding for example in future will be done only via CLI ?
You can add new OSDs via the GUI (make sure to select the right node and then the Ceph OSD panel to create them on that node) or via the CLI. Have a look at the pveceph osd create command (manual page)

*Imporant - add 2 more ssds per node to the HDD nodes, problem is if I have 10x 6 TB and I have 2 spare slots (bays) on each machine what would be the best size of ssd for journal of Ceph. How to make it working with current confiugration?
If you do not specify the size and don't have anything in your ceph.conf file, it will end up being 10% of the OSD size by default: https://pve.proxmox.com/pve-docs/chapter-pveceph.html#pve_ceph_osd_create
If you know how many OSDs you want to place on the SSD as DB/WAL device, divide the SSD by the number of OSDs with a bit of space left over as spare.

Keep in mind though, that should the SSD fail, all the OSDs using it as DB/WAL device will also fail and need to be recreated.
 
Thank you aaron you have been helping me a lot...

Well what I expected.

I will need to consider the scenarios and come back to you, I am not experiemce enough and pool is in production, I think the best thing is to create rules now, before adding new nodes. Sorry to ask you, but I could not find anything about it... Changing rule for ssds do you have some real life example about adding new rule, changing the rules and outcome on live production...

Make sure data are fine, but I am bit frustrated while I increase replication from 2 -> 3 I will run of of storage... Problem is It was not good desgin from the beginning and I am left out with situation which may cause broken data... I am always wondering about worst scenario. We do have backups but I wanted to omit any lose.

Thank you again Aaron.
 
roblem is It was not good desgin from the beginning and I am left out with situation which may cause broken data...
Yep, that is true. You can create as many rules as you like. As long as they are not assigned to a pool, nothing will happen. Since right now, all you have are SSDs, there should be nothing happening if you create a dedicated SSD rule and assign it to the pool which you currently have.

Is your plan to move some VM disks from the SSDs to the HDDs? If that is the case, I would first create the two rules for HDD and SSD devices. Then assign the SSD rule to the current pool which should stay on the SSDs. Then add the new nodes with the HDDs, create the OSDs, make sure that they are discovered correctly as HDD (or set it manually). Then create the new (hdd) pool. Once you have that as a storage, you can start to use the "Move Disk" to move over VM disks to the other pool, making more space on the SSD pool. Should you be able to free up enough space, you can then consider setting the size/min_size of the SSD pool to 3/2.

If you keep using the default replicated_rule, then the HDD OSDs will also be considered as valid OSDs for the current pool. You will see a rebalance happening as Ceph is distributing the current pool across the newly added OSDs.
You will have somewhat of a mixed bag regarding performance, depending on where the data is residing (HDD or SSD).
 
Yes Aaron, this exactly what I want to do, So I understand pretty good the problem :)

You help me confirm it, wish you all the best in new year :)

Will try to start implementing when I have new nodes in place.
 
Hi Aaron,

2 Problems which require more experience help, seems previous guy who configure it did not think threw all...

What happens when I enable on live production, silly question but bother me: PG Autoscale Mode from off to on.

services:
mon: 6 daemons, quorum pve1,pve2,pve3,pve4,pve5,pve6 (age 3w)
mgr: pve2(active, since 3M), standbys: pve3, pve1, pve4, pve5, pve6
mds: cephfs:1 {0=pve3=up:active} 5 up:standby
osd: 40 osds: 40 up (since 12d), 40 in (since 12d)

task status:

data:
pools: 4 pools, 577 pgs
objects: 4.95M objects, 19 TiB
usage: 37 TiB used, 36 TiB / 73 TiB avail
pgs: 577 active+clean


1641309317844.png

What can go wrong about it? Is just eating me up, then we have second thing, replication_rule... Which setup 2/1. If someone can help I would be likely to find out what can be done, for the moment I afraid it all go down...
 
You can also set the autoscaler to warn first. With that it will tell you what it thinks is best, but won't change the pg_num.
If you set it to on, it will also only start to act if the difference between current and optimal number of PGs is off by a factor of 3. If it is a factor of 2, like in this case, you will have to change it manually.
 
hmm, but what about changing the PGs, what the process will do, when I have 2/1, 36 TB of data will be split in to the small chunks to safe space? Maens if something goes wrong live production storage is basicly lost? Sorry just my precursion. I know I have strange setup.

I have around 300 VMs on productions which use the storage, I can't allow if something bad happen during the change. But maybe I just overreact...

Patryk
 
If you change the number of PGs for a pool, Ceph will recalculate which objects belong to which PG and how those PGs will be distributed within the OSDs of the cluster. It can result in a more equal space use of the OSDs, as more PGs means smaller PGs, reducing possible imbalances of very largen and small PGs.

The redistribution of the PGs can result in OSDs becoming more empty or full in the process. If you have OSDs that are already quite full, it might push them closer to being full. The pool will keep working during the whole procedure.

If you are interested in more details, check out the following chapter in the Ceph docs and the following chapters as well: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#how-are-placement-groups-used

You don't have to change the PG num right away. Depending on how full the OSDs are, I would wait until the HDD pool is implemented and you have made space on the SSD pool.
 
Yes Aaroon now I understand it... Thank you, I will read more about ceph and idea is not to touch ceph storage at all... We will create second cluster with HDD and start ceph configuration properly now, creating rules, pgs etc before implementing.

And on new proxmox version. migrate clients to new cluster add current machines to new cluster... Well I expected that changing will be easy but still we have too many data and there to big risk to do it now....
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!