Ceph - Balancing OSD distribution (new in Luminous)

Jun 8, 2016
344
74
93
48
Johannesburg, South Africa
The Proxmox Ceph upgrade process should potentially recommend users consider changing existing bucket's distribution algorithm from 'straw' to 'straw2'. This is additionally a requirement when using the Ceph balancer module.


Maximum variance on per-OSD distribution:
Code:
Before:                   After:
osd.20 838G (45%) used    osd.16 803G (43%) used
osd.5  546G (29%) used    osd.1  680G (37%) used


Confirm that minimum client version is jewel or higher:
Code:
[admin@kvm5b ~]# ceph osd dump|grep require_min_compat_client;
require_min_compat_client jewel



All buckets should use straw2:
update buckets:
Code:
ceph osd crush set-all-straw-buckets-to-straw2;


check:
Code:
[root@kvm1 ~]# ceph osd getcrushmap -o crush.map; crushtool -d crush.map | grep straw; rm -f crush.map
    218
    tunable straw_calc_version 1
            alg straw2
            alg straw2
            alg straw2
            alg straw2


Ceph distribution balancer:
Activate balancing:
Code:
    ceph mgr module ls
    ceph mgr module enable balancer
    ceph balancer on
    ceph balancer mode crush-compat
    ceph config-key set "mgr/balancer/max_misplaced": "0.01"


Show configuration and state:
Code:
    ceph config-key dump
    ceph balancer status


Create a plan, review and run it. Afterwards remove all custom plans:
Code:
    ceph balancer eval
    ceph balancer optimize myplan
    ceph balancer eval myplan
    ceph balancer show myplan
    ceph balancer execute myplan
    ceph balancer reset


Wishlist:
Proxmox VE 5.2 to perhaps consider updating Ceph client, presumably used to monitor Ceph. This would allow us to change from 'crush-compat' to 'upmap'.
Code:
[admin@kvm5b ~]# ceph features
{
    "mon": {
        "group": {
            "features": "0x1ffddff8eea4fffb",
            "release": "luminous",
            "num": 3
        }
    },
    "mds": {
        "group": {
            "features": "0x1ffddff8eea4fffb",
            "release": "luminous",
            "num": 3
        }
    },
    "osd": {
        "group": {
            "features": "0x1ffddff8eea4fffb",
            "release": "luminous",
            "num": 29
        }
    },
    "client": {
        "group": {
            "features": "0x7010fb86aa42ada",
            "release": "jewel",
            "num": 6
        },
        "group": {
            "features": "0x1ffddff8eea4fffb",
            "release": "luminous",
            "num": 10
        }
    }
}


Before rebalancing:
Code:
[admin@kvm5a ~]# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS
 0   hdd 1.81898  1.00000  1862G   680G  1181G 36.55 1.21  66
 1   hdd 1.81898  1.00000  1862G   588G  1273G 31.60 1.05  66
 2   hdd 1.81898  1.00000  1862G   704G  1157G 37.85 1.25  75
 3   hdd 1.81898  1.00000  1862G   682G  1179G 36.66 1.21  74
24  nvme 2.91089  1.00000  2980G   114G  2865G  3.85 0.13  71
 4   hdd 1.81898  1.00000  1862G   804G  1057G 43.19 1.43  84
 5   hdd 1.81898  1.00000  1862G   546G  1315G 29.34 0.97  56
 6   hdd 1.81898  1.00000  1862G   623G  1238G 33.51 1.11  65
 7   hdd 1.81898  1.00000  1862G   949G   912G 51.01 1.69  91
25  nvme 2.91089  1.00000  2980G   114G  2865G  3.86 0.13  68
 8   hdd 1.81898  1.00000  1862G   692G  1169G 37.18 1.23  70
 9   hdd 1.81898  1.00000  1862G   716G  1145G 38.50 1.28  78
10   hdd 1.81898  1.00000  1862G   666G  1195G 35.82 1.19  69
11   hdd 1.81898  1.00000  1862G   903G   958G 48.51 1.61  90
26  nvme 2.91089  1.00000  2980G   114G  2866G  3.84 0.13  74
12   hdd 1.81898  1.00000  1862G   748G  1113G 40.20 1.33  73
13   hdd 1.81898  1.00000  1862G   835G  1026G 44.85 1.49  85
14   hdd 1.81898  1.00000  1862G   760G  1101G 40.83 1.35  77
15   hdd 1.81898  1.00000  1862G   593G  1268G 31.85 1.06  64
27  nvme 2.91089  1.00000  2980G   114G  2866G  3.83 0.13  71
16   hdd 1.81898  1.00000  1862G   804G  1057G 43.23 1.43  75
17   hdd 1.81898  1.00000  1862G   700G  1161G 37.62 1.25  73
18   hdd 1.81898  1.00000  1862G   622G  1239G 33.44 1.11  65
19   hdd 1.81898  1.00000  1862G   716G  1145G 38.50 1.28  73
28  nvme 2.91089  1.00000  2980G   114G  2866G  3.84 0.13  68
20   hdd 1.81898  1.00000  1862G   838G  1023G 45.01 1.49  86
21   hdd 1.81898  1.00000  1862G   758G  1103G 40.75 1.35  75
22   hdd 1.81898  1.00000  1862G   714G  1147G 38.37 1.27  69
23   hdd 1.81898  1.00000  1862G   760G  1101G 40.82 1.35  77
                    TOTAL 59594G 17987G 41607G 30.18
MIN/MAX VAR: 0.13/1.69  STDDEV: 14.34


After rebalancing:
Code:
[admin@kvm5b ~]# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS
 0   hdd 1.81898  1.00000  1862G   769G  1092G 41.34 1.37  74
 1   hdd 1.81898  1.00000  1862G   680G  1181G 36.53 1.21  74
 2   hdd 1.81898  1.00000  1862G   691G  1170G 37.16 1.23  74
 3   hdd 1.81898  1.00000  1862G   682G  1179G 36.68 1.22  74
24  nvme 2.91089  1.00000  2980G   114G  2866G  3.85 0.13  71
 4   hdd 1.81898  1.00000  1862G   712G  1149G 38.29 1.27  74
 5   hdd 1.81898  1.00000  1862G   736G  1125G 39.53 1.31  74
 6   hdd 1.81898  1.00000  1862G   715G  1146G 38.44 1.27  74
 7   hdd 1.81898  1.00000  1862G   758G  1104G 40.71 1.35  74
25  nvme 2.91089  1.00000  2980G   115G  2865G  3.88 0.13  71
 8   hdd 1.81898  1.00000  1862G   746G  1115G 40.12 1.33  74
 9   hdd 1.81898  1.00000  1862G   669G  1192G 35.98 1.19  74
10   hdd 1.81898  1.00000  1862G   712G  1149G 38.26 1.27  74
11   hdd 1.81898  1.00000  1862G   736G  1125G 39.54 1.31  74
26  nvme 2.91089  1.00000  2980G   114G  2866G  3.84 0.13  70
12   hdd 1.81898  1.00000  1862G   760G  1101G 40.84 1.35  74
13   hdd 1.81898  1.00000  1862G   722G  1139G 38.82 1.29  74
14   hdd 1.81898  1.00000  1862G   727G  1134G 39.07 1.29  74
15   hdd 1.81898  1.00000  1862G   704G  1157G 37.82 1.25  74
27  nvme 2.91089  1.00000  2980G   115G  2865G  3.87 0.13  70
16   hdd 1.81898  1.00000  1862G   803G  1058G 43.16 1.43  74
17   hdd 1.81898  1.00000  1862G   713G  1149G 38.30 1.27  74
18   hdd 1.81898  1.00000  1862G   690G  1171G 37.10 1.23  74
19   hdd 1.81898  1.00000  1862G   728G  1133G 39.14 1.30  74
28  nvme 2.91089  1.00000  2980G   114G  2866G  3.83 0.13  70
20   hdd 1.81898  1.00000  1862G   714G  1147G 38.37 1.27  74
21   hdd 1.81898  1.00000  1862G   723G  1138G 38.88 1.29  74
22   hdd 1.81898  1.00000  1862G   769G  1092G 41.31 1.37  74
23   hdd 1.81898  1.00000  1862G   738G  1123G 39.65 1.31  74
                    TOTAL 59594G 17985G 41608G 30.18
MIN/MAX VAR: 0.13/1.43  STDDEV: 13.62
 
Last edited:
I just upgraded from Jewel to Luminous and was wondering if this is still relevant. I see that all of my nodes are currently configured with "alg straw" so it appears to still be the case...

Thanks!
Dan
 
Yip, still recommend that Proxmox update their wiki to get users to convert all straw buckets to straw2 and most deployments would benefit from minimising full OSDs due to uneven distribution of data.

Would be nice if Proxmox updated monitoring tools to be Luminous based so that we could switch to Luminous only 'upmap'.

Been working in 3 production clusters for over 6 months without a single hiccup...
 
  • Like
Reactions: dmulk
David,

A big THANK YOU for posting this info here. I just upgraded my environment this past weekend from 4.x to 5.2-9 and one of the motivators to move to Luminous was to be able to rebalance my OSD's. I have large percentage skews. I'll follow your instructions! Looking forward to having a more even balance of data!

<D>
 
BTW: I just ran CEPH features and under client mine is showing luminous. I upgraded from 4.x to 5.2-9 this weekend....so based on this it appears that Proxmox upgraded their client and we can now use upmap...correct?
 
Also, a question: It appears you run this as a manual task...in the CEPH documentation under Balancer Plugin it appears that it "runs automatically" if it's enabled and using the ceph balancer on.

From experience is it better to run it once in a while manually like you are doing or just leave it running?


http://docs.ceph.com/docs/mimic/mgr/balancer/

Since it was the first time I have ever performed a CEPH/Proxmox upgrade I wanted to be careful with my data....so I migrated most of it off CEPH on to other types of storage. I haven't migrated it back....so in my scenario it would be a whole lot of data moving (50TB?) and in the past, because of the data imbalance, I've run OSD's out of space and had to roll back / cancel migration.

Wondering how the balancer might be able to help during a large data migration....thoughts?

<D>
 
Another question:

What is the difference between the line you suggest:

ceph config-key set "mgr/balancer/max_misplaced": "0.01"


and the one in the documentation:

ceph config set mgr mgr/balancer/max_misplaced .07 # 7%


Are they both doing the same thing?
 
Another question:

What is the difference between the line you suggest:

ceph config-key set "mgr/balancer/max_misplaced": "0.01"


and the one in the documentation:

ceph config set mgr mgr/balancer/max_misplaced .07 # 7%


Are they both doing the same thing?

This configure how many percent difference tolareted beetwen OSD-s. We using the default 5%, with 1TB HDD.
 
  • Like
Reactions: dmulk
Ok, so both of these lines do the exact same thing, which is setting the default percentage difference. Got it. Thanks!
 
  • Like
Reactions: Lephisto
I ran through these settings yesterday and it worked great.

The earlier point about not being able to run upmap because the Proxmox version of the CEPH client was still Jewel seems to have changed...as when I run ceph features I only see luminous listed in the client section. Can someone confirm? upmap seems to be the path forward.

Cheers,

<D>
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!