CEPH: inconsistent PG did not heal first and then heals ...

Belokan

Active Member
Apr 27, 2016
155
16
38
Hello all,

Few days ago I've been warned about a 1 pgs inconsistent; 1 scrub error on my PM4.4 cluster:

root@pve2:~# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 8.2f is active+clean+inconsistent, acting [0,1,2]
1 scrub errors

After investigation it appeared that PG 8.2f was faulted on OSD.1:
Code:
    root@pve2:~# rados list-inconsistent-obj 8.2f --format=json-pretty
    {
        "epoch": 461,
        "inconsistents": [
            {
                "object": {
                    "name": "rbd_data.4f23c2ae8944a.0000000000000263",
                    "nspace": "",
                    "locator": "",
                    "snap": "head"
                },
                "errors": [
                    "read_error"
                ],
                "shards": [
                    {
                        "osd": 0,
                        "size": 4194304,
                        "omap_digest": "0xffffffff",
                        "data_digest": "0x56d22b99",
                        "errors": []
                    },
                    {
                        "osd": 1,
                        "size": 4194304,
                        "errors": [
                            "read_error"
                        ]
                    },
                    {
                        "osd": 2,
                        "size": 4194304,
                        "omap_digest": "0xffffffff",
                        "data_digest": "0x56d22b99",
                        "errors": []
                    }
                ]
            }
        ]
    }
I tried to repair it (twice) using "ceph pg repair 8.2f" but it did not work at that time.
I've ordered a new disk in order to replace OSD.1 but just before starting to replace it I tried the repair command one last time and it worked:

root@pve2:~# ceph pg repair 8.2f
[...]
root@pve2:~# ceph health detail
HEALTH_OK
root@pve2:~# rados list-inconsistent-obj 8.2f --format=json-pretty
{
"epoch": 461,
"inconsistents": []
}

What do you think about it ? Should I replace OSD.1 as planed when PG was faulty ? S.M.A.R.T did not showed any issue on the disk ... So I'm little puzzled.

Bonus question: What is the correct procedure to replace a failing but still UP/IN OSD with CEPH jewel ?

Thanks a lot in advance !

Olivier
 
Last edited: