pg active+clean+inconsistent, acting

hidalgo · Mar 5, 2018

Since some time ago I got my ceph unhealthy. 3 pg show up like active+clean+inconsistent, acting

Code:

pg 1.25 is active+clean+inconsistent, acting [2,3,5]
pg 1.d0 is active+clean+inconsistent, acting [1,5,6]
pg 1.1f6 is active+clean+inconsistent, acting [1,5,4]

I tried

Code:

# ceph pg 1.25 list_missing
{
    "offset": {
        "oid": "",
        "key": "",
        "snapid": 0,
        "hash": 0,
        "max": 0,
        "pool": -9223372036854775808,
        "namespace": ""
    },
    "num_missing": 0,
    "num_unfound": 0,
    "objects": [],
    "more": false
}

but I don’t know how to solve my problem.

Thanks for any hint.

Alwin · Mar 5, 2018

What is the state of your cluster (ceph health, logs, ...)? Does your OSD 5 sitll exist and work properly (drive errors)?

hidalgo · Mar 5, 2018

Alwin said:
What is the state of your cluster (ceph health, logs, ...)? Does your OSD 5 sitll exist and work properly (drive errors)?

Yes, the OSDs are all working. First I thought it’s a drive error but cannot see any other errors but the ones I mentioned.
I have this entries in the log file

Code:

2018-03-05 09:06:59.807135 osd.1 osd.1 192.168.0.10:6812/14280 7645 : cluster [ERR] deep-scrub 1.1f6 1:6fa0ed13:::rbd_data.124974b0dc51.0000000000001023:head expected clone 1:6fa0ed13:::rbd_data.124974b0dc51.0000000000001023:412 1 missing
2018-03-05 09:06:59.807140 osd.1 osd.1 192.168.0.10:6812/14280 7646 : cluster [INF] deep-scrub 1.1f6 1:6fa0ed13:::rbd_data.124974b0dc51.0000000000001023:head 1 missing clone(s)
2018-03-05 09:08:04.332519 osd.1 osd.1 192.168.0.10:6812/14280 7647 : cluster [ERR] 1.1f6 deep-scrub 1 errors

As I wrote, the status says HEALTH_ERR.
I tried also to repair with this https://ceph.com/geen-categorie/ceph-manually-repair-object/ but no success. I couldn't find the corresponding object as described.

Alwin · Mar 5, 2018

Which ceph version are you running (ceph versions)? The rbd prefix 'rbd_data.124974b0dc51' is shown on 'rbd -p <pool> info <image>', then you know which image is affected. I guess, there is a defect clone lying around.

One possible issue could be this, but only a hunch for now.
https://tracker.ceph.com/issues/19413

hidalgo · Mar 5, 2018

Alwin said:
Which ceph version are you running (ceph versions)? The rbd prefix 'rbd_data.124974b0dc51' is shown on 'rbd -p <pool> info <image>', then you know which image is affected. I guess, there is a defect clone lying around.

One possible issue could be this, but only a hunch for now.
https://tracker.ceph.com/issues/19413

Thank you. It would be very helpful. Is there a command to look through all the images or do I have to check one by one until I find the string?

Code:

Package versions
proxmox-ve: 5.1-32 (running kernel: 4.10.17-4-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.13-2-pve: 4.13.13-32
pve-kernel-4.10.17-4-pve: 4.10.17-24
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.13.8-3-pve: 4.13.8-30
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.10.17-3-pve: 4.10.17-23
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.10.17-1-pve: 4.10.17-18
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.0-4
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.2-pve1

Alwin · Mar 5, 2018

hidalgo said:
Is there a command to look through all the images or do I have to check one by one until I find the string?

No, one by one.

hidalgo · Mar 5, 2018

So, I figured out, which images are involved. I deleted one of the VM and expected one error would go away. Unfortunately the error remains.

Alwin · Mar 6, 2018

Are there any snapshots left over? Also you can check with 'rados -p <pool> ls' if there are objects left over that hold that prefix. You also let a deep scrub run again.

hidalgo · Mar 6, 2018

Thank you. Deep scrub didn’t help but I got

Code:

rados -p pool ls | grep 124974b0dc51
rbd_data.124974b0dc51.0000000000001023

I think that’s the issue. How to get rid of it?

Alwin · Mar 6, 2018

Code:

man rados

See the man page, commands are there documented. But you need to be absolutely sure that this object is not connected to any image or snapshot, otherwise you will destroy a VM/CT or its snapshot.

Code:

rados -p rbd listwatchers <object>

With this you can at least see, if this object is in use.

hidalgo · Mar 6, 2018

Yeah, I read the man pages, but it didn’t help. Although the object is listed, I cannot delete it. Also

Code:

# rados -p pool listwatchers rbd_data.124974b0dc51.0000000000001023
error listing watchers pool/rbd_data.124974b0dc51.0000000000001023: (2) No such file or directory

Alwin · Mar 6, 2018

Is it still there? A deletion can take some time. Check if there are still inconsistent pgs and run a scrub.

Code:

SCRUB AND REPAIR:
   list-inconsistent-pg <pool>      list inconsistent PGs in given pool
   list-inconsistent-obj <pgid>     list inconsistent objects in given pg
   list-inconsistent-snapset <pgid> list inconsistent snapsets in the given pg

hidalgo · Mar 6, 2018

I got this

Code:

# rados list-inconsistent-pg pool
["1.25","1.d0","1.1f6"]
# rados list-inconsistent-obj 1.1f6
{"epoch":13476,"inconsistents":[]}root@pve1:~#
# rados list-inconsistent-snapset 1.1f6
{"epoch":13476,"inconsistents":[]}root@pve1:~#

I deleted the VM that contained pg 1.1f6 as I mentioned before. Very strange.

Alwin · Mar 6, 2018

Alwin said:
and run a scrub.

Do a deep-scrub, so the PGs are re-evaluated.

hidalgo · Mar 6, 2018

Seems it’s not deep enough

Code:

018-03-06 16:46:37.590802 osd.1 osd.1 192.168.0.10:6812/14280 7866 : cluster [ERR] deep-scrub 1.1f6 1:6fa0ed13:::rbd_data.124974b0dc51.0000000000001023:head expected clone 1:6fa0ed13:::rbd_data.124974b0dc51.0000000000001023:412 1 missing
2018-03-06 16:46:37.590807 osd.1 osd.1 192.168.0.10:6812/14280 7867 : cluster [INF] deep-scrub 1.1f6 1:6fa0ed13:::rbd_data.124974b0dc51.0000000000001023:head 1 missing clone(s)
2018-03-06 16:47:43.483406 osd.1 osd.1 192.168.0.10:6812/14280 7868 : cluster [ERR] 1.1f6 deep-scrub 1 errors

Alwin · Mar 7, 2018

Are the other inconsistent PGs also still there?
Did you upgrade the cluster, from hammer or is it a new one?
In the OSD logs, is there more shown then those three entries?
What message did you get, when you tried to delete the object

One way to be sure to remove all missing object, would be to move all VMs to a different pool and delete the pool afterwards.

hidalgo · Mar 7, 2018

Yes, as far I can see.
No, it’s a new one
No, only these entries
No such file or directory

I just created a new pool and started to migrate my VMs. It takes some time and I will come back with hopefully good news.

Thanks for your help.

Alwin · Mar 7, 2018

Alwin said:
What message did you get, when you tried to delete the object

What was the command you issued? Could be that it was not removed from the object map but doesn't exist anymore.

hidalgo · Mar 7, 2018

With something like

Code:

rados -p pool rm rbd_data.124974b0dc51.0000000000001023

Alwin · Mar 8, 2018

Strange, if it show up again then you can increase the logging (osd) to get more information on what is going.
http://docs.ceph.com/docs/luminous/rados/troubleshooting/log-and-debug/

pg active+clean+inconsistent, acting

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff