ceph balancer mode upmap-read

hawk128

Active Member
May 22, 2017
27
0
41
41
Hi,

I am not happy with current ceph balancer as there is too big difference in number of PG per OSDs.
I would like to try upmap-read but all clients must be reef, however they are luminous.

Why Proxmox use luminous clients for reef ceph? Can I change it to reef and activate upmap-read balancer?

Code:
ceph version
ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)

Code:
ceph features
{
    "mon": [
        {
            "features": "0x3f01cfbffffdffff",
            "release": "luminous",
            "num": 3
        }
    ],
    "osd": [
        {
            "features": "0x3f01cfbffffdffff",
            "release": "luminous",
            "num": 30
        }
    ],
    "client": [
        {
            "features": "0x2f018fb87aa4aafe",
            "release": "luminous",
            "num": 5
        },
        {
            "features": "0x3f01cfbffffdffff",
            "release": "luminous",
            "num": 5
        }
    ],
    "mgr": [
        {
            "features": "0x3f01cfbffffdffff",
            "release": "luminous",
            "num": 3
        }
    ]
}
 
I have two clusters.
One prod - bigger with 30 OSDs.
Another test one with 7 only.

Both has balance issues.
Here are for test one:

Code:
ceph version
ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)

Code:
ceph features
{
    "mon": [
        {
            "features": "0x3f01cfbffffdffff",
            "release": "luminous",
            "num": 3
        }
    ],
    "osd": [
        {
            "features": "0x3f01cfbffffdffff",
            "release": "luminous",
            "num": 7
        }
    ],
    "client": [
        {
            "features": "0x2f018fb87aa4aafe",
            "release": "luminous",
            "num": 3
        },
        {
            "features": "0x3f01cfbffffdffff",
            "release": "luminous",
            "num": 5
        }
    ],
    "mgr": [
        {
            "features": "0x3f01cfbffffdffff",
            "release": "luminous",
            "num": 3
        }
    ]
}

Code:
ceph balancer status
{
    "active": true,
    "last_optimize_duration": "0:00:00.001174",
    "last_optimize_started": "Fri Aug 16 11:09:18 2024",
    "mode": "upmap",
    "no_optimization_needed": true,
    "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
    "plans": []
}

Code:
ceph osd df tree
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL     %USE   VAR   PGS  STATUS  TYPE NAME
-1         1.53358         -  1.5 TiB  571 GiB  560 GiB  423 KiB   11 GiB  1000 GiB  36.33  1.00    -          root default
-3         0.56239         -  576 GiB  191 GiB  188 GiB  178 KiB  3.1 GiB   384 GiB  33.25  0.92    -              host ursleipmx01
 0   main  0.22690   1.00000  232 GiB   80 GiB   79 GiB   82 KiB  1.3 GiB   152 GiB  34.62  0.95   37      up          osd.0
 3   main  0.22690   1.00000  232 GiB   74 GiB   73 GiB   37 KiB  1.3 GiB   158 GiB  31.84  0.88   33      up          osd.3
 5   main  0.10860   1.00000  111 GiB   37 GiB   37 GiB   59 KiB  527 MiB    74 GiB  33.33  0.92   17      up          osd.5
-5         0.56239         -  576 GiB  217 GiB  211 GiB  146 KiB  5.4 GiB   359 GiB  37.64  1.04    -              host ursleipmx02
 1   main  0.22690   1.00000  232 GiB   78 GiB   76 GiB   78 KiB  1.6 GiB   155 GiB  33.39  0.92   35      up          osd.1
 4   main  0.22690   1.00000  232 GiB   94 GiB   92 GiB   33 KiB  1.8 GiB   139 GiB  40.28  1.11   43      up          osd.4
 6   main  0.10860   1.00000  111 GiB   46 GiB   44 GiB   35 KiB  2.0 GiB    66 GiB  41.02  1.13   20      up          osd.6
-7         0.40880         -  419 GiB  162 GiB  160 GiB   99 KiB  2.3 GiB   256 GiB  38.75  1.07    -              host ursleipmx03
 2   main  0.40880   1.00000  419 GiB  162 GiB  160 GiB   99 KiB  2.3 GiB   256 GiB  38.75  1.07   74      up          osd.2
                       TOTAL  1.5 TiB  571 GiB  560 GiB  427 KiB   11 GiB  1000 GiB  36.33

MIN/MAX VAR: 0.88/1.13  STDDEV: 3.47

So, osd.3 and osd.4 are the same size but PGs are 33 to 43.
Looks incorrect for me.

According to Ceph docs: there is an equal number of PGs on each OSD (±1 PG). But it is far from it.
 
How are your pools configured in regards to PGs per pool? Did you configure target_ratios for the pools so that the auto scaler knows what you expect them to grow into in size?

According to Ceph docs: there is an equal number of PGs on each OSD (±1 PG). But it is far from it.
Also, you have cluster nodes with every different numbers of OSDs and sized OSDs. This makes it quite a bit more difficult as Ceph weight the OSDs by size. Therefore, an OSD with ~100 GiB in size will hold less data, and therefore PGs as an OSD with ~200 GiB in size.
 
Hi Aaron,

I have 2 clusters. This one is lab.
1724100839730.png

This is is prod:
1724100884512.png
Both are with autoscale off.
Both with different number of OSDs and their sizes. Bot with more or less equal total size per host.
 
Besides my earlier comment regarding different OSD sizes and number of OSDs per node, these screenshots show a few more issues.

CRUSH rules: if you start using device specific rules, all pools need to be assigned one, as the default replicated_rule does not distinguish between device classes. Therefore, as long as a pool has it assigned, the autoscaler cannot work, as there is still a (potential) overlap that cannot be resolved. Once the .mgr pool is assigned to the main rule, the autoscaler should show its recommendations again.

Pool sizes: using a min_size of less than 2 increases the chances for data-loss/corruption considerably and is NOT recommended, especially for production use.
If an OSD fails, you still have one replica. And with min_size=1, the pool will keep working -> new data is written. Now consider that the previously failed OSD comes back online, or is replaced. The data with only 1 replica is about to get recovered to other OSDs to come back to the size number of replicas. If the OSD that contains that new changed data, for which there currently is only one replica present, fails in the meantime, there is no known good copy around. Only maybe some older replicas, if the first failed OSD is back online.
 
According to Ceph docs: there is an equal number of PGs on each OSD (±1 PG). But it is far from it.
you have a bunch of different sized osds, with one of your hosts grossly undersized. How What exactly are you expecting? The docs ALSO say you should have equal sized nodes.

Having some variance in pgs between OSDs is normal, since not all the data you're writing is equal sized. I'd be far more worried about having one node with one OSD, which is substantially smaller in capacity then the others- your "full" capacity is therefore around 336GB, regardless how much space you have on your first two nodes.
 
Hi Aaron,

Do you think it is worth to move .mgr to main rule?
I do not use autoscale, it is ok for me. I set PG number malually.

About size and min_size - yes, I know the risks and in our case it is reasonable.
In the worth case I have 4 independed level of backups. So, I can restore whole pool from backups.
 
Do you think it is worth to move .mgr to main rule?
I do not use autoscale, it is ok for me. I set PG number malually.
If you set the autoscale mode to "warn" you will at least know if you should change the pg_num :)
So yes, I would suggest you switch the .mgr pool to the main rule as well, to get that.
 
Hi Alexskysilk,

I understand about lab one.
But I am expected better PG allocation.

1724103515686.png

For example here is one of prod nodes. You can see that 4 OSDs are the same size but one has 46 PGs another 51. What is looks like 10% misallocation.
Practically, they should be something like 49 PGs on each.

You also can see usage - 45% to 50+%.
 
Last edited:
10% variance in misallocation isnt a big deal, especially with such a small number of OSDs. Like I said, you're worried about the wrong thing. I have nodes with 36 OSDs with roughly the same variance from the least-to-most utilized, and I dont worry about it as long as they're all less then 70% utilized.

As I explained, the data written isnt in neatly sized chunks. the smaller then number of PGs is, the greater the expected variance. Conversely, the larger the number of PGs the more expensive metadata operations are, so this is not necessarily a problem you want to fix.
 
Hmm.
The issues is that sometimes I need to push more data on ceph temporary and some OSD reach 90-95% capacity while others are on 65-70%.
Mostly it is due to the PG number on them.
So, I trying to understand why it is +-1 in the docs but I have 10+% diff of PG numbers per OSD.
 
The issues is that sometimes I need to push more data on ceph temporary and some OSD reach 90-95% capacity while others are on 65-70%.
thats not workable. you need more osds. I hope my explanation helped you understand.

No file system likes being full. Ceph REALLY doesnt like it. if you intend to fill up your file system, have more spare room.
 
No, I was about usage OSD field from ceph osd list.

Anyway. Ok. Thanks for your help.

You just confirm that my clusters are too small, what I thought...
 
And what about Ceph Version?

I am on reef now but ceph features says Proxmox uses luminous as client?

So, I can not try upmap-read...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!