Crush rule issue?

petemcdonnell

New Member
Oct 23, 2021
12
0
1
50
Code:
proxmox 6.1-8
ceph 14.2.22

I apologize in advance for the length of this post. I'll try to be brief but succinct with the explanation of the scenario.

I inherited a proxmox 6.1 cluster with some unusual configuration. host a and b have a mix of HDDs and SSDs while host c is a cluster member but with no ceph storage of its own. This cluster has worked for several years, but is reaching replacement age for a variety of reasons.

Yesterday we had an SSD fail in host b and the recovery got stuck because another SSD in host a was over 90% full. The HDDs had next to no utilization and it took me a while to figure out why. All OSDs were in the same pool and back in 2019 someone made some crush rule changes to only use class SSDs devices. I changed the active crush rule to the 'replicated_ruleset' (an original rule perhaps?) which I saw only uses devices of class HDD. This change forced ALL data to be moved to the HDDs and the recovery was finally able to complete.

I then removed all the SSD osds, created a new 'rbdssd' pool, created the new SSD osds (yes I likely didn't need to recreate the osds) and told that pool to use the customized, existing 'ssd_ruleset'. Today while trying to clean up the mess, we rebooted the three hosts in a controlled manner, first c, then b and finally a. Now I see there is a recover that is again stuck. It appears to be related to the SSD devices. and I'm wondering if the cause is the issue with the customized crush rule.
I see it contains one key section that differs from the replicated ruleset:

Code:
 from ssd_ruleset:
           "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"

Should this not be similar to the following snippet from the relicated ruleset? Does having these sections different cause a conflict between how the rules are used by crush?

Code:
 from replicated_ruleset:
           "op": "chooseleaf_firstn",
            "num": 2,
            "type": "osd"

Is there another issue that I'm not seeing?

I do hope I've provided sufficient information for this issue, plesae let me know if output from other commands is needed!

Some details about the hosts and ceph status:

Code:
root@proxmox-b:~# ceph status
  cluster:
    id:     6c26f8f3-0c93-457a-b059-1c1c6f872d98
    health: HEALTH_WARN
            Degraded data redundancy: 1133/3901019 objects degraded (0.029%), 13 pgs degraded, 13 pgs undersized

  services:
    mon: 3 daemons, quorum proxmox-2,proxmox,proxmox-3 (age 4h)
    mgr: proxmox-3(active, since 4h), standbys: proxmox-2, proxmox
    mds: cephfs:3 {0=proxmox=up:active,1=proxmox-3=up:active,2=proxmox-2=up:active}
    osd: 20 osds: 20 up (since 94m), 20 in (since 13h); 115 remapped pgs

  data:
    pools:   4 pools, 480 pgs
    objects: 1.90M objects, 5.3 TiB
    usage:   13 TiB used, 45 TiB / 58 TiB avail
    pgs:     1133/3901019 objects degraded (0.029%)
             10196/3901019 objects misplaced (0.261%)
             352 active+clean
             115 active+clean+remapped
             13  active+undersized+degraded

  io:
    client:   4.0 KiB/s rd, 2.5 MiB/s wr, 3 op/s rd, 47 op/s wr

Code:
root@proxmox-b:~# ceph health detail
HEALTH_WARN Degraded data redundancy: 1133/3901033 objects degraded (0.029%), 13 pgs degraded, 13 pgs undersized
PG_DEGRADED Degraded data redundancy: 1133/3901033 objects degraded (0.029%), 13 pgs degraded, 13 pgs undersized
    pg 5.c is stuck undersized for 5843.927285, current state active+undersized+degraded, last acting [18,12]
    pg 5.21 is stuck undersized for 5843.927123, current state active+undersized+degraded, last acting [17,12]
    pg 5.2a is stuck undersized for 5843.927004, current state active+undersized+degraded, last acting [17,14]
    pg 5.2b is stuck undersized for 5933.650599, current state active+undersized+degraded, last acting [17,12]
    pg 5.31 is stuck undersized for 5933.658167, current state active+undersized+degraded, last acting [18,12]
    pg 5.3c is stuck undersized for 5843.926952, current state active+undersized+degraded, last acting [17,15]
    pg 5.3d is stuck undersized for 5898.647859, current state active+undersized+degraded, last acting [17,13]
    pg 5.3e is stuck undersized for 5894.973519, current state active+undersized+degraded, last acting [14,19]
    pg 5.59 is stuck undersized for 5894.972669, current state active+undersized+degraded, last acting [19,14]
    pg 5.5e is stuck undersized for 5898.640636, current state active+undersized+degraded, last acting [18,12]
    pg 5.60 is stuck undersized for 5843.926617, current state active+undersized+degraded, last acting [17,13]
    pg 5.6f is stuck undersized for 5870.800083, current state active+undersized+degraded, last acting [18,12]
    pg 5.7f is stuck undersized for 5898.647462, current state active+undersized+degraded, last acting [17,12]

Code:
root@proxmox-b:~# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE  DATA     OMAP    META     AVAIL   %USE  VAR  PGS STATUS
 0   hdd 3.63899  0.70001 3.6 TiB  1.2 TiB  1.2 TiB 105 MiB  2.4 GiB 2.5 TiB 31.95 1.46  76     up
 1   hdd 3.63899  0.75000 3.6 TiB  1.0 TiB  1.0 TiB 143 MiB  3.8 GiB 2.6 TiB 28.36 1.29  73     up
 2   hdd 3.63899  0.70001 3.6 TiB  1.2 TiB  1.2 TiB 106 MiB  3.8 GiB 2.5 TiB 31.76 1.45  73     up
 3   hdd 3.63899  0.75000 3.6 TiB  1.0 TiB  1.0 TiB  79 MiB  2.7 GiB 2.6 TiB 28.08 1.28  85     up
 4   hdd 3.63899  0.75000 3.6 TiB  995 GiB  991 GiB 132 MiB  3.7 GiB 2.7 TiB 26.71 1.22  70     up
 5   hdd 3.63899  0.75000 3.6 TiB  931 GiB  928 GiB  77 MiB  2.9 GiB 2.7 TiB 24.99 1.14  68     up
17   ssd 1.81940  1.00000 1.8 TiB  7.8 GiB  6.8 GiB  16 KiB 1024 MiB 1.8 TiB  0.42 0.02 101     up
18   ssd 1.81940  1.00000 1.8 TiB  5.4 GiB  4.4 GiB   8 KiB 1024 MiB 1.8 TiB  0.29 0.01  65     up
19   ssd 1.81940  1.00000 1.8 TiB  4.8 GiB  3.8 GiB  16 KiB 1024 MiB 1.8 TiB  0.26 0.01  55     up
 6   hdd 3.63899  0.75000 3.6 TiB  1.2 TiB  1.2 TiB 163 MiB  3.8 GiB 2.5 TiB 31.92 1.46  82     up
 7   hdd 3.63899  0.75000 3.6 TiB 1017 GiB 1014 GiB 140 MiB  3.7 GiB 2.6 TiB 27.31 1.25  72     up
 8   hdd 3.63899  0.70001 3.6 TiB  1.3 TiB  1.3 TiB 110 MiB  3.3 GiB 2.4 TiB 34.52 1.57  83     up
 9   hdd 3.63899  0.75000 3.6 TiB  878 GiB  876 GiB  65 MiB  1.7 GiB 2.8 TiB 23.57 1.08  66     up
10   hdd 3.63899  0.75000 3.6 TiB  1.1 TiB  1.1 TiB  88 MiB  3.7 GiB 2.6 TiB 29.50 1.35  75     up
11   hdd 3.63899  0.70001 3.6 TiB  1.1 TiB  1.1 TiB 172 MiB  3.8 GiB 2.5 TiB 31.15 1.42  75     up
12   ssd 1.81940  1.00000 1.8 TiB  3.7 GiB  2.7 GiB   8 KiB 1024 MiB 1.8 TiB  0.20 0.01  41     up
13   ssd 1.81940  1.00000 1.8 TiB  3.1 GiB  2.1 GiB   8 KiB 1024 MiB 1.8 TiB  0.17 0.01  30     up
14   ssd 1.81940  1.00000 1.8 TiB  3.2 GiB  2.2 GiB   8 KiB 1024 MiB 1.8 TiB  0.17 0.01  28     up
15   ssd 1.81940  1.00000 1.8 TiB  2.8 GiB  1.8 GiB  16 KiB 1024 MiB 1.8 TiB  0.15 0.01  26     up
16   ssd 1.81940  1.00000 1.8 TiB  2.8 GiB  1.8 GiB  16 KiB 1024 MiB 1.8 TiB  0.15 0.01  23     up
                    TOTAL  58 TiB   13 TiB   13 TiB 1.3 GiB   47 GiB  45 TiB 21.92
MIN/MAX VAR: 0.01/1.57  STDDEV: 16.00

Code:
root@proxmox-a:~# ceph osd pool get rbdssd crush_rule
crush_rule: ssd_ruleset
root@proxmox-a:~# ceph osd pool get rbd crush_rule
crush_rule: replicated_ruleset
root@proxmox-a:~# ceph osd crush rule list
replicated_ruleset
ssd_ruleset

Code:
root@proxmox-a:~# ceph osd crush rule dump ssd_ruleset
{
    "rule_id": 1,
    "rule_name": "ssd_ruleset",
    "ruleset": 1,
    "type": 1,
    "min_size": 1,
    "max_size": 4,
    "steps": [
        {
            "op": "take",
            "item": -8,
            "item_name": "default~ssd"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

root@proxmox-a:~# ceph osd crush rule dump replicated_ruleset
{
    "rule_id": 0,
    "rule_name": "replicated_ruleset",
    "ruleset": 0,
    "type": 1,
    "min_size": 1,
    "max_size": 6,
    "steps": [
        {
            "op": "take",
            "item": -1,
            "item_name": "default~hdd"
        },
        {
            "op": "choose_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 2,
            "type": "osd"
        },
        {
            "op": "emit"
        }
    ]
}
 
You still only use 2 nodes for OSDs? If so, this explains why it can't recover [0]:
Code:
For a replicated pool, the primary decision when creating the CRUSH
rule is what the failure domain is going to be.  For example, if a
failure domain of host is selected, then CRUSH will ensure that
each replica of the data is stored on a unique host.
I assume your cluster has the default size 3 min_size 2 config?

[0] https://docs.ceph.com/en/latest/rados/operations/crush-map/#creating-a-rule-for-a-replicated-pool
 
You still only use 2 nodes for OSDs? If so, this explains why it can't recover [0]:
Code:
For a replicated pool, the primary decision when creating the CRUSH
rule is what the failure domain is going to be.  For example, if a
failure domain of host is selected, then CRUSH will ensure that
each replica of the data is stored on a unique host.
I assume your cluster has the default size 3 min_size 2 config?

[0] https://docs.ceph.com/en/latest/rados/operations/crush-map/#creating-a-rule-for-a-replicated-pool
Thanks so much for your reply, Mira! I was doing more reading about the rack/host/osd, etc domains. If I created a new rule (copy of ssd_ruleset) like this, it would work? (I know I still have many issues with using ceph with just two hosts):


{
"rule_id": 2,
"rule_name": "ssd_ruleset2",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 4,
"steps": [
{
"op": "take",
"item": -8,
"item_name": "default~ssd"
},
{
"op": "chooseleaf_firstn",
"num": 2,
"type": "osd"
},
{
"op": "emit"
}
]
}

As I understand it, by changing the failure domain to OSD it should allow recovery?
In the meantime, I destroyed the pool to avoid potential use of it.

For the primary pool using the 'replicated ruleset' the replicated size is 2 and the min_size is 2. That means I have at least two copies of each pg, even though they may be on the same host, right?

Thanks
 
Actually, with a size of 2 the ssd_ruleset should work just fine.

Could you dump more info regarding the degraded PGs?
You can use ceph pg <PG> query to get detailed information. Please run the command for all PGs mentioned in the ceph health detail output.
 
I can't now as I destroyed the pool. What I saw was just 'unclean' status. I followed this article which took me down the path of troubleshooting:

Ran: ceph pg dump_stuck unclean (inactive, stale, undersized etc produced no output)
I did not have peering errors.
I then wound up at item 5.2.4 from the article - crush map errors.

I detected that someone had been fiddling with the crush rules in 2019, but I can't tell exactly what they changed other than by dumping the rules as shown above.

The cluster was running for a couple of year with just the ssd_ruleset in place, mostly without issue. Now we're running with just the replicated_ruleset which forces use of just the HDD class devices.

What I was trying to do was to have the rbd pool use replicated_ruleset while creating new pool with ssd_ruleset to have two different pools each with a specific class of device.
Are the differences between the rules perhaps why the problem with the stuck pgs came up?