Code:
proxmox 6.1-8
ceph 14.2.22
I apologize in advance for the length of this post. I'll try to be brief but succinct with the explanation of the scenario.
I inherited a proxmox 6.1 cluster with some unusual configuration. host a and b have a mix of HDDs and SSDs while host c is a cluster member but with no ceph storage of its own. This cluster has worked for several years, but is reaching replacement age for a variety of reasons.
Yesterday we had an SSD fail in host b and the recovery got stuck because another SSD in host a was over 90% full. The HDDs had next to no utilization and it took me a while to figure out why. All OSDs were in the same pool and back in 2019 someone made some crush rule changes to only use class SSDs devices. I changed the active crush rule to the 'replicated_ruleset' (an original rule perhaps?) which I saw only uses devices of class HDD. This change forced ALL data to be moved to the HDDs and the recovery was finally able to complete.
I then removed all the SSD osds, created a new 'rbdssd' pool, created the new SSD osds (yes I likely didn't need to recreate the osds) and told that pool to use the customized, existing 'ssd_ruleset'. Today while trying to clean up the mess, we rebooted the three hosts in a controlled manner, first c, then b and finally a. Now I see there is a recover that is again stuck. It appears to be related to the SSD devices. and I'm wondering if the cause is the issue with the customized crush rule.
I see it contains one key section that differs from the replicated ruleset:
Code:
from ssd_ruleset:
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
Should this not be similar to the following snippet from the relicated ruleset? Does having these sections different cause a conflict between how the rules are used by crush?
Code:
from replicated_ruleset:
"op": "chooseleaf_firstn",
"num": 2,
"type": "osd"
Is there another issue that I'm not seeing?
I do hope I've provided sufficient information for this issue, plesae let me know if output from other commands is needed!
Some details about the hosts and ceph status:
Code:
root@proxmox-b:~# ceph status
cluster:
id: 6c26f8f3-0c93-457a-b059-1c1c6f872d98
health: HEALTH_WARN
Degraded data redundancy: 1133/3901019 objects degraded (0.029%), 13 pgs degraded, 13 pgs undersized
services:
mon: 3 daemons, quorum proxmox-2,proxmox,proxmox-3 (age 4h)
mgr: proxmox-3(active, since 4h), standbys: proxmox-2, proxmox
mds: cephfs:3 {0=proxmox=up:active,1=proxmox-3=up:active,2=proxmox-2=up:active}
osd: 20 osds: 20 up (since 94m), 20 in (since 13h); 115 remapped pgs
data:
pools: 4 pools, 480 pgs
objects: 1.90M objects, 5.3 TiB
usage: 13 TiB used, 45 TiB / 58 TiB avail
pgs: 1133/3901019 objects degraded (0.029%)
10196/3901019 objects misplaced (0.261%)
352 active+clean
115 active+clean+remapped
13 active+undersized+degraded
io:
client: 4.0 KiB/s rd, 2.5 MiB/s wr, 3 op/s rd, 47 op/s wr
Code:
root@proxmox-b:~# ceph health detail
HEALTH_WARN Degraded data redundancy: 1133/3901033 objects degraded (0.029%), 13 pgs degraded, 13 pgs undersized
PG_DEGRADED Degraded data redundancy: 1133/3901033 objects degraded (0.029%), 13 pgs degraded, 13 pgs undersized
pg 5.c is stuck undersized for 5843.927285, current state active+undersized+degraded, last acting [18,12]
pg 5.21 is stuck undersized for 5843.927123, current state active+undersized+degraded, last acting [17,12]
pg 5.2a is stuck undersized for 5843.927004, current state active+undersized+degraded, last acting [17,14]
pg 5.2b is stuck undersized for 5933.650599, current state active+undersized+degraded, last acting [17,12]
pg 5.31 is stuck undersized for 5933.658167, current state active+undersized+degraded, last acting [18,12]
pg 5.3c is stuck undersized for 5843.926952, current state active+undersized+degraded, last acting [17,15]
pg 5.3d is stuck undersized for 5898.647859, current state active+undersized+degraded, last acting [17,13]
pg 5.3e is stuck undersized for 5894.973519, current state active+undersized+degraded, last acting [14,19]
pg 5.59 is stuck undersized for 5894.972669, current state active+undersized+degraded, last acting [19,14]
pg 5.5e is stuck undersized for 5898.640636, current state active+undersized+degraded, last acting [18,12]
pg 5.60 is stuck undersized for 5843.926617, current state active+undersized+degraded, last acting [17,13]
pg 5.6f is stuck undersized for 5870.800083, current state active+undersized+degraded, last acting [18,12]
pg 5.7f is stuck undersized for 5898.647462, current state active+undersized+degraded, last acting [17,12]
Code:
root@proxmox-b:~# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 3.63899 0.70001 3.6 TiB 1.2 TiB 1.2 TiB 105 MiB 2.4 GiB 2.5 TiB 31.95 1.46 76 up
1 hdd 3.63899 0.75000 3.6 TiB 1.0 TiB 1.0 TiB 143 MiB 3.8 GiB 2.6 TiB 28.36 1.29 73 up
2 hdd 3.63899 0.70001 3.6 TiB 1.2 TiB 1.2 TiB 106 MiB 3.8 GiB 2.5 TiB 31.76 1.45 73 up
3 hdd 3.63899 0.75000 3.6 TiB 1.0 TiB 1.0 TiB 79 MiB 2.7 GiB 2.6 TiB 28.08 1.28 85 up
4 hdd 3.63899 0.75000 3.6 TiB 995 GiB 991 GiB 132 MiB 3.7 GiB 2.7 TiB 26.71 1.22 70 up
5 hdd 3.63899 0.75000 3.6 TiB 931 GiB 928 GiB 77 MiB 2.9 GiB 2.7 TiB 24.99 1.14 68 up
17 ssd 1.81940 1.00000 1.8 TiB 7.8 GiB 6.8 GiB 16 KiB 1024 MiB 1.8 TiB 0.42 0.02 101 up
18 ssd 1.81940 1.00000 1.8 TiB 5.4 GiB 4.4 GiB 8 KiB 1024 MiB 1.8 TiB 0.29 0.01 65 up
19 ssd 1.81940 1.00000 1.8 TiB 4.8 GiB 3.8 GiB 16 KiB 1024 MiB 1.8 TiB 0.26 0.01 55 up
6 hdd 3.63899 0.75000 3.6 TiB 1.2 TiB 1.2 TiB 163 MiB 3.8 GiB 2.5 TiB 31.92 1.46 82 up
7 hdd 3.63899 0.75000 3.6 TiB 1017 GiB 1014 GiB 140 MiB 3.7 GiB 2.6 TiB 27.31 1.25 72 up
8 hdd 3.63899 0.70001 3.6 TiB 1.3 TiB 1.3 TiB 110 MiB 3.3 GiB 2.4 TiB 34.52 1.57 83 up
9 hdd 3.63899 0.75000 3.6 TiB 878 GiB 876 GiB 65 MiB 1.7 GiB 2.8 TiB 23.57 1.08 66 up
10 hdd 3.63899 0.75000 3.6 TiB 1.1 TiB 1.1 TiB 88 MiB 3.7 GiB 2.6 TiB 29.50 1.35 75 up
11 hdd 3.63899 0.70001 3.6 TiB 1.1 TiB 1.1 TiB 172 MiB 3.8 GiB 2.5 TiB 31.15 1.42 75 up
12 ssd 1.81940 1.00000 1.8 TiB 3.7 GiB 2.7 GiB 8 KiB 1024 MiB 1.8 TiB 0.20 0.01 41 up
13 ssd 1.81940 1.00000 1.8 TiB 3.1 GiB 2.1 GiB 8 KiB 1024 MiB 1.8 TiB 0.17 0.01 30 up
14 ssd 1.81940 1.00000 1.8 TiB 3.2 GiB 2.2 GiB 8 KiB 1024 MiB 1.8 TiB 0.17 0.01 28 up
15 ssd 1.81940 1.00000 1.8 TiB 2.8 GiB 1.8 GiB 16 KiB 1024 MiB 1.8 TiB 0.15 0.01 26 up
16 ssd 1.81940 1.00000 1.8 TiB 2.8 GiB 1.8 GiB 16 KiB 1024 MiB 1.8 TiB 0.15 0.01 23 up
TOTAL 58 TiB 13 TiB 13 TiB 1.3 GiB 47 GiB 45 TiB 21.92
MIN/MAX VAR: 0.01/1.57 STDDEV: 16.00
Code:
root@proxmox-a:~# ceph osd pool get rbdssd crush_rule
crush_rule: ssd_ruleset
root@proxmox-a:~# ceph osd pool get rbd crush_rule
crush_rule: replicated_ruleset
root@proxmox-a:~# ceph osd crush rule list
replicated_ruleset
ssd_ruleset
Code:
root@proxmox-a:~# ceph osd crush rule dump ssd_ruleset
{
"rule_id": 1,
"rule_name": "ssd_ruleset",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 4,
"steps": [
{
"op": "take",
"item": -8,
"item_name": "default~ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
root@proxmox-a:~# ceph osd crush rule dump replicated_ruleset
{
"rule_id": 0,
"rule_name": "replicated_ruleset",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 6,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default~hdd"
},
{
"op": "choose_firstn",
"num": 0,
"type": "host"
},
{
"op": "chooseleaf_firstn",
"num": 2,
"type": "osd"
},
{
"op": "emit"
}
]
}