matheus.made

New Member
May 29, 2024
6
0
1
Hello everyone!

I have a question regarding CEPH on PROXMOX. I have a CEPH cluster in production and would like to rebalance my OSDs since some of them are reaching 90% usage.
1723139573325.png
My pool was manually set to 512 PGs with the PG Autoscale option OFF, and now I've changed it to PG Autoscale ON. 1723139475577.png

I used: ceph config set global osd_mclock_profile high_client_ops to ensure the action occurs in production without impacting the VMs I have running.

And here are my OSDs:
1723139739176.png
I get this return from: ceph balancer status

1723139857751.png

However, even after making this change, the autoscaling didn't start automatically. PROXMOX suggests that the optimized number of PGs is 256. If I manually change the quantity, would that force the process to start?

I would like to know if there's anything else I need to do to initiate the process?
 
It looks like your crush map has a bucket level between the root and the hosts. I assume datacenter?

And it looks like you have a crush rule that replicates across datacenters, correct?

This is why the first and the last host have nearfull OSDs and the two in the middle only have OSDs filled to about 50%, because the second datacenter has double the capacity available.

You cannot do anything against this with reweighting. You need to add capacity to your first and third datacenter.
 
Oh, I see it, makes sense.

I was already thinking in adding capacity but I was going to add an equal amount on every node, but it's nice to know that this buckets have an influence in the distribution

Thanks for the help!
 
If you share the CRUSH rule you use (node->Ceph->Configuration, right side is the CRUSH map and the rules are at the bottom), we can see what is actually happening.

Right now, with the information available (blacking out bucket names and hostnames will make it harder), it looks like you have that one additional layer, most likely room or datacenter and a rule that places one replica into each room/datacenter.
 
Here it is:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd
device 15 osd.15 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host Server01 {
id -13 # do not change unnecessarily
id -14 class ssd # do not change unnecessarily
# weight 6.98639
alg straw2
hash 0 # rjenkins1
item osd.6 weight 1.74660
item osd.7 weight 1.74660
item osd.8 weight 1.74660
item osd.10 weight 1.74660
}
datacenter dc-01 {
id -5 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 6.98639
alg straw2
hash 0 # rjenkins1
item Server01 weight 6.98639
}
host Server02 {
id -9 # do not change unnecessarily
id -10 class ssd # do not change unnecessarily
# weight 6.98639
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.74660
item osd.1 weight 1.74660
item osd.2 weight 1.74660
item osd.9 weight 1.74660
}
host Server03 {
id -15 # do not change unnecessarily
id -16 class ssd # do not change unnecessarily
# weight 6.98639
alg straw2
hash 0 # rjenkins1
item osd.12 weight 1.74660
item osd.13 weight 1.74660
item osd.14 weight 1.74660
item osd.15 weight 1.74660
}
datacenter dc-02 {
id -3 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 13.97278
alg straw2
hash 0 # rjenkins1
item Server02 weight 6.98639
item Server03 weight 6.98639
}
host Server04 {
id -11 # do not change unnecessarily
id -12 class ssd # do not change unnecessarily
# weight 7.05919
alg straw2
hash 0 # rjenkins1
item osd.4 weight 1.74660
item osd.11 weight 1.81940
item osd.3 weight 1.74660
item osd.5 weight 1.74660
}
datacenter dc-03 {
id -4 # do not change unnecessarily
id -7 class ssd # do not change unnecessarily
# weight 7.05919
alg straw2
hash 0 # rjenkins1
item Server04 weight 7.05919
}
root default {
id -1 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 21.03197
alg straw2
hash 0 # rjenkins1
item dc-01 weight 6.98639
item dc-02 weight 6.98639
item dc-03 weight 7.05919
}

# rules
rule replicated_rule {
id 0
type replicated
step take default
step chooseleaf firstn 0 type host
step emit
}
rule REPLICA-DCS {
id 1
type replicated
step take default
step chooseleaf firstn 0 type datacenter
step emit
}

# end crush map
 

Attachments

  • crushmap.txt
    2.9 KB · Views: 0
okay, quite as expected.

We have the following rule documented:
Code:
rule replicate_3rooms {
    id {RULE ID}
    type replicated
    step take default
    step choose firstn 0 type room
    step chooseleaf firstn 1 type host
    step emit
}

The bucket type itself does not matter. The main difference is here:
Code:
    step choose firstn 0 type room
    step chooseleaf firstn 1 type host
making sure that, should for whatever reason, 2 replicas end up in the same room/datacenter, it will be placed on different hosts.
Something you could consider. Though if you change the rule, chances are high that you'll see a rebalance, since the topology will likely change with it. Therefore, get the full OSDs solved first :)
 
So you're recommending that after I resolve the FULL OSD issue, I should change my current rule:

rule REPLICA-DCS {
id 1
type replicated
step take default
step chooseleaf firstn 0 type datacenter
step emit
}
To:
rule REPLICA-DCS {
id {RULE ID}
type replicated
step take default
step choose firstn 0 type room
step chooseleaf firstn 1 type host
step emit
}
 
Oh sorry, in fact it should be like this?

rule REPLICA-DCS {
id {RULE ID}
type replicated
step take default
step choose firstn 0 type datacenter
step chooseleaf firstn 1 type host
step emit
}
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!