Hello,
we upgraded our Nautilus till Quincy two weeks ago and wanted to get rid of old settings, which we have since Luminous or older: Splitted SSD / HDD, before Ceph had device-classes:
https://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
We executed:
and checked, what happens .. but not deep enough .. as Ceph health was ok, but today I saw:
so, just move the bucket to default is not enough .. and we are unsure, how to fix it.
The Tree looks like this:
The rule:
The Crushmap
Which steps are required to fix it and move back to the Ceph standards ? I assume, we need to take the datacenter offline, as it will for sure rebalance a lot. question is .. power off the VMs, or not .. as I fear for filesystem crashes.
Any help would be great !
we upgraded our Nautilus till Quincy two weeks ago and wanted to get rid of old settings, which we have since Luminous or older: Splitted SSD / HDD, before Ceph had device-classes:
https://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
We executed:
Code:
ceph osd crush move fc-r02-ceph-osd-01 root=default
...
ceph osd crush move fc-r02-ceph-osd-06 root=default
and checked, what happens .. but not deep enough .. as Ceph health was ok, but today I saw:
Code:
root@fc-r02-ceph-osd-01:[~]: ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
nvme 1.8 TiB 786 GiB 1.1 TiB 1.1 TiB 57.80
ssd 21 TiB 9.8 TiB 11 TiB 11 TiB 53.86
TOTAL 23 TiB 11 TiB 13 TiB 13 TiB 54.17
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
ssd-pool 1 2048 4.2 TiB 1.14M 12 TiB 100.00 0 B
db-pool 4 128 50 MiB 3 151 MiB 100.00 0 B
.mgr 5 1 43 MiB 12 130 MiB 0 2.4 TiB
root@fc-r02-ceph-osd-01:[~]: ceph -s
cluster:
id: cfca8c93-f3be-4b86-b9cb-8da095ca2c26
health: HEALTH_OK
services:
mon: 5 daemons, quorum fc-r02-ceph-osd-01,fc-r02-ceph-osd-02,fc-r02-ceph-osd-03,fc-r02-ceph-osd-05,fc-r02-ceph-osd-06 (age 2w)
mgr: fc-r02-ceph-osd-06(active, since 2w), standbys: fc-r02-ceph-osd-02, fc-r02-ceph-osd-03, fc-r02-ceph-osd-01, fc-r02-ceph-osd-05, fc-r02-ceph-osd-04
osd: 54 osds: 54 up (since 2w), 54 in (since 2w); 2176 remapped pgs
data:
pools: 3 pools, 2177 pgs
objects: 1.14M objects, 4.3 TiB
usage: 13 TiB used, 11 TiB / 23 TiB avail
pgs: 5684530/3410754 objects misplaced (166.665%)
2176 active+clean+remapped
1 active+clean
io:
client: 906 KiB/s rd, 13 MiB/s wr, 38 op/s rd, 986 op/s wr
so, just move the bucket to default is not enough .. and we are unsure, how to fix it.
The Tree looks like this:
Code:
root@fc-r02-ceph-osd-01:[~]: ceph osd crush tree --show-shadow
ID CLASS WEIGHT TYPE NAME
-39 nvme 1.81938 root default~nvme
-30 nvme 0 host fc-r02-ceph-osd-01~nvme
-31 nvme 0.36388 host fc-r02-ceph-osd-02~nvme
36 nvme 0.36388 osd.36
-32 nvme 0.36388 host fc-r02-ceph-osd-03~nvme
40 nvme 0.36388 osd.40
-33 nvme 0.36388 host fc-r02-ceph-osd-04~nvme
37 nvme 0.36388 osd.37
-34 nvme 0.36388 host fc-r02-ceph-osd-05~nvme
38 nvme 0.36388 osd.38
-35 nvme 0.36388 host fc-r02-ceph-osd-06~nvme
39 nvme 0.36388 osd.39
-38 nvme 0 root ssds~nvme
-37 nvme 0 datacenter fc-ssds~nvme
-36 nvme 0 rack r02-ssds~nvme
-29 nvme 0 root sata~nvme
-28 nvme 0 datacenter fc-sata~nvme
-27 nvme 0 rack r02-sata~nvme
-24 ssd 0 root ssds~ssd
-23 ssd 0 datacenter fc-ssds~ssd
-21 ssd 0 rack r02-ssds~ssd
-22 ssd 0 root sata~ssd
-19 ssd 0 datacenter fc-sata~ssd
-20 ssd 0 rack r02-sata~ssd
-14 0 root sata
-18 0 datacenter fc-sata
-16 0 rack r02-sata
-13 0 root ssds
-17 0 datacenter fc-ssds
-15 0 rack r02-ssds
-4 ssd 22.17122 root default~ssd
-7 ssd 4.00145 host fc-r02-ceph-osd-01~ssd
0 ssd 0.45470 osd.0
1 ssd 0.45470 osd.1
2 ssd 0.45470 osd.2
3 ssd 0.45470 osd.3
4 ssd 0.45470 osd.4
5 ssd 0.45470 osd.5
41 ssd 0.36388 osd.41
42 ssd 0.45470 osd.42
48 ssd 0.45470 osd.48
-3 ssd 3.61948 host fc-r02-ceph-osd-02~ssd
6 ssd 0.45470 osd.6
7 ssd 0.45470 osd.7
8 ssd 0.45470 osd.8
9 ssd 0.45470 osd.9
10 ssd 0.43660 osd.10
29 ssd 0.45470 osd.29
43 ssd 0.45470 osd.43
49 ssd 0.45470 osd.49
-8 ssd 3.63757 host fc-r02-ceph-osd-03~ssd
11 ssd 0.45470 osd.11
12 ssd 0.45470 osd.12
13 ssd 0.45470 osd.13
14 ssd 0.45470 osd.14
15 ssd 0.45470 osd.15
16 ssd 0.45470 osd.16
44 ssd 0.45470 osd.44
50 ssd 0.45470 osd.50
-10 ssd 3.63757 host fc-r02-ceph-osd-04~ssd
30 ssd 0.45470 osd.30
31 ssd 0.45470 osd.31
32 ssd 0.45470 osd.32
33 ssd 0.45470 osd.33
34 ssd 0.45470 osd.34
35 ssd 0.45470 osd.35
45 ssd 0.45470 osd.45
51 ssd 0.45470 osd.51
-12 ssd 3.63757 host fc-r02-ceph-osd-05~ssd
17 ssd 0.45470 osd.17
18 ssd 0.45470 osd.18
19 ssd 0.45470 osd.19
20 ssd 0.45470 osd.20
21 ssd 0.45470 osd.21
22 ssd 0.45470 osd.22
46 ssd 0.45470 osd.46
52 ssd 0.45470 osd.52
-26 ssd 3.63757 host fc-r02-ceph-osd-06~ssd
23 ssd 0.45470 osd.23
24 ssd 0.45470 osd.24
25 ssd 0.45470 osd.25
26 ssd 0.45470 osd.26
27 ssd 0.45470 osd.27
28 ssd 0.45470 osd.28
47 ssd 0.45470 osd.47
53 ssd 0.45470 osd.53
-1 23.99060 root default
-6 4.00145 host fc-r02-ceph-osd-01
0 ssd 0.45470 osd.0
1 ssd 0.45470 osd.1
2 ssd 0.45470 osd.2
3 ssd 0.45470 osd.3
4 ssd 0.45470 osd.4
5 ssd 0.45470 osd.5
41 ssd 0.36388 osd.41
42 ssd 0.45470 osd.42
48 ssd 0.45470 osd.48
-2 3.98335 host fc-r02-ceph-osd-02
36 nvme 0.36388 osd.36
6 ssd 0.45470 osd.6
7 ssd 0.45470 osd.7
8 ssd 0.45470 osd.8
9 ssd 0.45470 osd.9
10 ssd 0.43660 osd.10
29 ssd 0.45470 osd.29
43 ssd 0.45470 osd.43
49 ssd 0.45470 osd.49
-5 4.00145 host fc-r02-ceph-osd-03
40 nvme 0.36388 osd.40
11 ssd 0.45470 osd.11
12 ssd 0.45470 osd.12
13 ssd 0.45470 osd.13
14 ssd 0.45470 osd.14
15 ssd 0.45470 osd.15
16 ssd 0.45470 osd.16
44 ssd 0.45470 osd.44
50 ssd 0.45470 osd.50
-9 4.00145 host fc-r02-ceph-osd-04
37 nvme 0.36388 osd.37
30 ssd 0.45470 osd.30
31 ssd 0.45470 osd.31
32 ssd 0.45470 osd.32
33 ssd 0.45470 osd.33
34 ssd 0.45470 osd.34
35 ssd 0.45470 osd.35
45 ssd 0.45470 osd.45
51 ssd 0.45470 osd.51
-11 4.00145 host fc-r02-ceph-osd-05
38 nvme 0.36388 osd.38
17 ssd 0.45470 osd.17
18 ssd 0.45470 osd.18
19 ssd 0.45470 osd.19
20 ssd 0.45470 osd.20
21 ssd 0.45470 osd.21
22 ssd 0.45470 osd.22
46 ssd 0.45470 osd.46
52 ssd 0.45470 osd.52
-25 4.00145 host fc-r02-ceph-osd-06
39 nvme 0.36388 osd.39
23 ssd 0.45470 osd.23
24 ssd 0.45470 osd.24
25 ssd 0.45470 osd.25
26 ssd 0.45470 osd.26
27 ssd 0.45470 osd.27
28 ssd 0.45470 osd.28
47 ssd 0.45470 osd.47
53 ssd 0.45470 osd.53
The rule:
Code:
root@fc-r02-ceph-osd-01:[~]: ceph osd pool get db-pool crush_rule
crush_rule: fc-r02-ssdpool
root@fc-r02-ceph-osd-01:[~]: ceph osd pool get ssd-pool crush_rule
crush_rule: fc-r02-ssdpool
The Crushmap
Code:
root@fc-r02-ceph-osd-01:[~]: ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 1,
"rule_name": "fc-r02-ssdpool",
"type": 1,
"steps": [
{
"op": "take",
"item": -15,
"item_name": "r02-ssds"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 2,
"rule_name": "fc-r02-satapool",
"type": 1,
"steps": [
{
"op": "take",
"item": -16,
"item_name": "r02-sata"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 3,
"rule_name": "fc-r02-ssd",
"type": 1,
"steps": [
{
"op": "take",
"item": -4,
"item_name": "default~ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
Which steps are required to fix it and move back to the Ceph standards ? I assume, we need to take the datacenter offline, as it will for sure rebalance a lot. question is .. power off the VMs, or not .. as I fear for filesystem crashes.
Any help would be great !