CEPH: inconsistent PG did not heal first and then heals ...

Belokan

Active Member
Apr 27, 2016
155
16
38
Hello all,

Few days ago I've been warned about a 1 pgs inconsistent; 1 scrub error on my PM4.4 cluster:

root@pve2:~# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 8.2f is active+clean+inconsistent, acting [0,1,2]
1 scrub errors

After investigation it appeared that PG 8.2f was faulted on OSD.1:
Code:
    root@pve2:~# rados list-inconsistent-obj 8.2f --format=json-pretty
    {
        "epoch": 461,
        "inconsistents": [
            {
                "object": {
                    "name": "rbd_data.4f23c2ae8944a.0000000000000263",
                    "nspace": "",
                    "locator": "",
                    "snap": "head"
                },
                "errors": [
                    "read_error"
                ],
                "shards": [
                    {
                        "osd": 0,
                        "size": 4194304,
                        "omap_digest": "0xffffffff",
                        "data_digest": "0x56d22b99",
                        "errors": []
                    },
                    {
                        "osd": 1,
                        "size": 4194304,
                        "errors": [
                            "read_error"
                        ]
                    },
                    {
                        "osd": 2,
                        "size": 4194304,
                        "omap_digest": "0xffffffff",
                        "data_digest": "0x56d22b99",
                        "errors": []
                    }
                ]
            }
        ]
    }
I tried to repair it (twice) using "ceph pg repair 8.2f" but it did not work at that time.
I've ordered a new disk in order to replace OSD.1 but just before starting to replace it I tried the repair command one last time and it worked:

root@pve2:~# ceph pg repair 8.2f
[...]
root@pve2:~# ceph health detail
HEALTH_OK
root@pve2:~# rados list-inconsistent-obj 8.2f --format=json-pretty
{
"epoch": 461,
"inconsistents": []
}

What do you think about it ? Should I replace OSD.1 as planed when PG was faulty ? S.M.A.R.T did not showed any issue on the disk ... So I'm little puzzled.

Bonus question: What is the correct procedure to replace a failing but still UP/IN OSD with CEPH jewel ?

Thanks a lot in advance !

Olivier
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!