ceph crushmap mixed up

northe

Active Member
Sep 23, 2017
42
1
28
Hi folks!
I have a ceph cluster running with 2 tiers.
Now, months later I see that there must have been a copy/paste issue in my prepared text file which I inserted as config into the crush map. =8-O

ceph osd tree
-5 54.53394 host node1-1_TIER1
2 hdd 9.08899 osd.2 up 1.00000 1.00000
3 hdd 9.08899 osd.3 up 1.00000 1.00000
4 hdd 9.08899 osd.4 up 1.00000 1.00000
5 hdd 9.08899 osd.5 up 1.00000 1.00000
6 hdd 9.08899 osd.6 up 1.00000 1.00000
7 hdd 9.08899 osd.7 up 1.00000 1.00000
-7 54.53394 host node2_TIER1
10 hdd 9.08899 osd.10 up 1.00000 1.00000
11 hdd 9.08899 osd.11 up 1.00000 1.00000
12 hdd 9.08899 osd.12 up 1.00000 1.00000
13 hdd 9.08899 osd.13 up 1.00000 1.00000
14 hdd 9.08899 osd.14 up 1.00000 1.00000
15 hdd 9.08899 osd.15 up 1.00000 1.00000
-9 54.53394 host node3_TIER1
10 hdd 9.08899 osd.10 up 1.00000 1.00000
11 hdd 9.08899 osd.11 up 1.00000 1.00000
12 hdd 9.08899 osd.12 up 1.00000 1.00000
13 hdd 9.08899 osd.13 up 1.00000 1.00000
14 hdd 9.08899 osd.14 up 1.00000 1.00000
15 hdd 9.08899 osd.15 up 1.00000 1.00000
..
..

You can see that node2 and node3 have exact the same OSDs.
node 3 should have
-9 54.53394 host node1708-3_TIER1
10 hdd 9.08899 osd.18 up 1.00000 1.00000
11 hdd 9.08899 osd.19 up 1.00000 1.00000
12 hdd 9.08899 osd.20 up 1.00000 1.00000
13 hdd 9.08899 osd.21 up 1.00000 1.00000
14 hdd 9.08899 osd.22 up 1.00000 1.00000
15 hdd 9.08899 osd.23 up 1.00000 1.00000
These OSD 18-23 are currently not assigned or used by any tier.


What is the correct and secure procedure to remove the wrong assigned ods from node3?

Thank you for your help!!
 
How does your 'osd tree' look like? I guess, those OSDs are not used at all.
 
this is the complete ceph osd tree

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-3 254.49170 root TIER1
-5 54.53394 host node1_TIER1
2 hdd 9.08899 osd.2 up 1.00000 1.00000
3 hdd 9.08899 osd.3 up 1.00000 1.00000
4 hdd 9.08899 osd.4 up 1.00000 1.00000
5 hdd 9.08899 osd.5 up 1.00000 1.00000
6 hdd 9.08899 osd.6 up 1.00000 1.00000
7 hdd 9.08899 osd.7 up 1.00000 1.00000
-7 54.53394 host node2_TIER1
10 hdd 9.08899 osd.10 up 1.00000 1.00000
11 hdd 9.08899 osd.11 up 1.00000 1.00000
12 hdd 9.08899 osd.12 up 1.00000 1.00000
13 hdd 9.08899 osd.13 up 1.00000 1.00000
14 hdd 9.08899 osd.14 up 1.00000 1.00000
15 hdd 9.08899 osd.15 up 1.00000 1.00000
-9 54.53394 host node3_TIER1
10 hdd 9.08899 osd.10 up 1.00000 1.00000
11 hdd 9.08899 osd.11 up 1.00000 1.00000
12 hdd 9.08899 osd.12 up 1.00000 1.00000
13 hdd 9.08899 osd.13 up 1.00000 1.00000
14 hdd 9.08899 osd.14 up 1.00000 1.00000
15 hdd 9.08899 osd.15 up 1.00000 1.00000
-11 45.44495 host node4_TIER1
26 hdd 9.08899 osd.26 up 1.00000 1.00000
27 hdd 9.08899 osd.27 up 1.00000 1.00000
28 hdd 9.08899 osd.28 up 1.00000 1.00000
29 hdd 9.08899 osd.29 up 1.00000 1.00000
30 hdd 9.08899 osd.30 up 1.00000 1.00000
-13 45.44495 host node5_TIER1
34 hdd 9.08899 osd.34 up 1.00000 1.00000
35 hdd 9.08899 osd.35 up 1.00000 1.00000
36 hdd 9.08899 osd.36 up 1.00000 1.00000
37 hdd 9.08899 osd.37 up 1.00000 1.00000
39 hdd 9.08899 osd.39 up 1.00000 1.00000
-2 90.88989 root TIER0
-4 18.17798 host node1_TIER0
0 hdd 9.08899 osd.0 up 1.00000 1.00000
1 hdd 9.08899 osd.1 up 1.00000 1.00000
-6 18.17798 host node2_TIER0
8 hdd 9.08899 osd.8 up 1.00000 1.00000
9 hdd 9.08899 osd.9 up 1.00000 1.00000
-8 18.17798 host node3_TIER0
16 hdd 9.08899 osd.16 up 1.00000 1.00000
17 hdd 9.08899 osd.17 up 1.00000 1.00000
-10 18.17798 host node4_TIER0
24 hdd 9.08899 osd.24 up 1.00000 1.00000
25 hdd 9.08899 osd.25 up 1.00000 1.00000
-12 18.17798 host node5_TIER0
32 hdd 9.08899 osd.32 up 1.00000 1.00000
33 hdd 9.08899 osd.33 up 1.00000 1.00000
-1 0 root default
18 hdd 0 osd.18 up 1.00000 1.00000
19 hdd 0 osd.19 up 1.00000 1.00000
20 hdd 0 osd.20 up 1.00000 1.00000
21 hdd 0 osd.21 up 1.00000 1.00000
22 hdd 0 osd.22 up 1.00000 1.00000
23 hdd 0 osd.23 up 1.00000 1.00000
 
and its' usage
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
2 hdd 9.08899 1.00000 9313G 1882G 7431G 20.21 1.16 217
3 hdd 9.08899 1.00000 9313G 1856G 7457G 19.94 1.15 214
4 hdd 9.08899 1.00000 9313G 1938G 7375G 20.81 1.20 223
5 hdd 9.08899 1.00000 9313G 2100G 7213G 22.55 1.30 242
6 hdd 9.08899 1.00000 9313G 1926G 7387G 20.68 1.19 222
7 hdd 9.08899 1.00000 9313G 2229G 7083G 23.94 1.38 257
10 hdd 9.08899 1.00000 9313G 3703G 5610G 39.76 2.28 426
11 hdd 9.08899 1.00000 9313G 3841G 5472G 41.24 2.37 443
12 hdd 9.08899 1.00000 9313G 3693G 5620G 39.65 2.28 426
13 hdd 9.08899 1.00000 9313G 3761G 5552G 40.39 2.32 433
14 hdd 9.08899 1.00000 9313G 3607G 5706G 38.73 2.23 416
15 hdd 9.08899 1.00000 9313G 3603G 5709G 38.69 2.22 416
10 hdd 9.08899 1.00000 9313G 3703G 5610G 39.76 2.28 426
11 hdd 9.08899 1.00000 9313G 3841G 5472G 41.24 2.37 443
12 hdd 9.08899 1.00000 9313G 3693G 5620G 39.65 2.28 426
13 hdd 9.08899 1.00000 9313G 3761G 5552G 40.39 2.32 433
14 hdd 9.08899 1.00000 9313G 3607G 5706G 38.73 2.23 416
15 hdd 9.08899 1.00000 9313G 3603G 5709G 38.69 2.22 416
26 hdd 9.08899 1.00000 9313G 1699G 7614G 18.24 1.05 193
27 hdd 9.08899 1.00000 9313G 1852G 7461G 19.88 1.14 177
28 hdd 9.08899 1.00000 9313G 1748G 7565G 18.78 1.08 190
29 hdd 9.08899 1.00000 9313G 2035G 7278G 21.85 1.26 228
30 hdd 9.08899 1.00000 9313G 2125G 7188G 22.82 1.31 230
34 hdd 9.08899 1.00000 9313G 2087G 7226G 22.42 1.29 239
35 hdd 9.08899 1.00000 9313G 2076G 7237G 22.29 1.28 237
36 hdd 9.08899 1.00000 9313G 2073G 7240G 22.26 1.28 238
37 hdd 9.08899 1.00000 9313G 2046G 7267G 21.97 1.26 234
39 hdd 9.08899 1.00000 9313G 2110G 7203G 22.66 1.30 243
0 hdd 9.08899 1.00000 9313G 748G 8565G 8.03 0.46 152
1 hdd 9.08899 1.00000 9313G 732G 8581G 7.86 0.45 148
8 hdd 9.08899 1.00000 9313G 819G 8494G 8.80 0.51 166
9 hdd 9.08899 1.00000 9313G 791G 8522G 8.50 0.49 160
16 hdd 9.08899 1.00000 9313G 784G 8529G 8.42 0.48 159
17 hdd 9.08899 1.00000 9313G 752G 8561G 8.08 0.46 152
24 hdd 9.08899 1.00000 9313G 740G 8573G 7.95 0.46 150
25 hdd 9.08899 1.00000 9313G 716G 8597G 7.69 0.44 145
32 hdd 9.08899 1.00000 9313G 775G 8538G 8.32 0.48 157
33 hdd 9.08899 1.00000 9313G 725G 8587G 7.79 0.45 147
18 hdd 0 1.00000 9313G 1485M 9312G 0.02 0 0
19 hdd 0 1.00000 9313G 1486M 9312G 0.02 0 0
20 hdd 0 1.00000 9313G 1485M 9312G 0.02 0 0
21 hdd 0 1.00000 9313G 1485M 9312G 0.02 0 0
22 hdd 0 1.00000 9313G 1486M 9312G 0.02 0 0
23 hdd 0 1.00000 9313G 1486M 9312G 0.02 0 0
TOTAL 345T 61596G 285T 17.40
MIN/MAX VAR: 0/2.37 STDDEV: 14.27
 
There will definitely a rebalance (recovery) happen.

There are different ways to go. Either you do it all in one go and have a long recovery (long degraded state). Removing the disks from crush on host node3_TIER1 and adding the missing ones in. Or you can set on of the OSDs on host node3_TIER1 with weight 0 and add one of the OSDs from root (18-23). This way the recovery should copy mostly from on disk to the other on the same host. You need to do this through the crush map as the ceph tools (ceph osd crush) will probably alter the OSD on both hosts.

Anyway, you can not leave it like it is now, not only because it has wrong IDs, but also that now PGs are distributed on the same OSD twice. If a disk failure happens, two copies of an object are lost simultaneously.

EDIT: put explanation for first way.
 
okay. I will go the first way: one big bang.
So you think I can proceede like below or should I start with adding the OSDs from root, balancing and when finished removing the OSD 10-15 from node3_TIER1?
Or is it better to export crush map, alter it and reinject it? The latter I have never done before.

cep osd crush rm osd.10 node3_TIER1
cep osd crush rm osd.11 node3_TIER1
cep osd crush rm osd.12 node3_TIER1
cep osd crush rm osd.13 node3_TIER1
cep osd crush rm osd.14 node3_TIER1
cep osd crush rm osd.15 node3_TIER1

ceph osd crush add osd.18 9.089 host=node3_TIER1
ceph osd crush add osd.19 9.089 host=node3_TIER1
ceph osd crush add osd.20 9.089 host=node3_TIER1
ceph osd crush add osd.21 9.089 host=node3_TIER1
ceph osd crush add osd.22 9.089 host=node3_TIER1
ceph osd crush add osd.23 9.089 host=node3_TIER1

ceph osd crush set osd.18 9.089 root=TIER1 host=node3_TIER1
ceph osd crush set osd.19 9.089 root=TIER1 host=node3_TIER1
ceph osd crush set osd.20 9.089 root=TIER1 host=node3_TIER1
ceph osd crush set osd.21 9.089 root=TIER1 host=node3_TIER1
ceph osd crush set osd.22 9.089 root=TIER1 host=node3_TIER1
ceph osd crush set osd.23 9.089 root=TIER1 host=node3_TIER1

Do you think, this is the reason for the poor performance of disk operations of the VMs on that pool?
https://forum.proxmox.com/threads/pve-5-1-46-ceph-bluestore-poor-performance-with-smal-files.42928/

Thank you Alwin
 
okay. I will go the first way: one big bang.
So you think I can proceede like below or should I start with adding the OSDs from root, balancing and when finished removing the OSD 10-15 from node3_TIER1?
Or is it better to export crush map, alter it and reinject it? The latter I have never done before.
Add the root disks and set the OSDs on host node3_TIER1 to 0 weight. This way the objects are still there and it will only be read from the OSDs. You need to alter the crush map and inject it, as I think the ceph cli will alter the OSDs for both nodes.

Do you think, this is the reason for the poor performance of disk operations of the VMs on that pool?
Yes, as those OSDs with double entry are hit twice.

And create backups before you do anything, just to be safe.
 
I managed it without export, alter reinject crushmap.
In order not to go under the limit of placement groups per OSD introduced with Luminous and having more replicas , I added all the OSD from the root to the node3_TIER1 at once and let it balance. This improved overall performance, because the read/write spread over more OSDs and the wrong assigned lost some of their usage which was/is twice of the correct assigned OSDs.

After rebalancing I removed one by one wrong assigned OSD but always waiting for rebalancing.
ceph osd crush rm osd.XX node1708-3_TIER1

Now ceph osd tree looks fine :-)

Annotation for the GUI developpers:
In the GUI I never saw the OSDs in the root, only the ones assigned to TIER0 and 1.
And also, node2_TIER1 and node3_TIER1 with the same OSDs never been listed together, either 2 was collapsed or 3 was collapsed though this mistake is not visible if you do not suspect to have such a misconfiguration.

Alwin, thank you very much for your support!!
 
  • Like
Reactions: Alwin

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!