pg active+clean+inconsistent, acting

hidalgo

Well-Known Member
Nov 11, 2016
60
0
46
57
Since some time ago I got my ceph unhealthy. 3 pg show up like active+clean+inconsistent, acting
Code:
pg 1.25 is active+clean+inconsistent, acting [2,3,5]
pg 1.d0 is active+clean+inconsistent, acting [1,5,6]
pg 1.1f6 is active+clean+inconsistent, acting [1,5,4]
I tried
Code:
# ceph pg 1.25 list_missing
{
    "offset": {
        "oid": "",
        "key": "",
        "snapid": 0,
        "hash": 0,
        "max": 0,
        "pool": -9223372036854775808,
        "namespace": ""
    },
    "num_missing": 0,
    "num_unfound": 0,
    "objects": [],
    "more": false
}

but I don’t know how to solve my problem.

Thanks for any hint.
 
What is the state of your cluster (ceph health, logs, ...)? Does your OSD 5 sitll exist and work properly (drive errors)?
 
What is the state of your cluster (ceph health, logs, ...)? Does your OSD 5 sitll exist and work properly (drive errors)?
Yes, the OSDs are all working. First I thought it’s a drive error but cannot see any other errors but the ones I mentioned.
I have this entries in the log file
Code:
2018-03-05 09:06:59.807135 osd.1 osd.1 192.168.0.10:6812/14280 7645 : cluster [ERR] deep-scrub 1.1f6 1:6fa0ed13:::rbd_data.124974b0dc51.0000000000001023:head expected clone 1:6fa0ed13:::rbd_data.124974b0dc51.0000000000001023:412 1 missing
2018-03-05 09:06:59.807140 osd.1 osd.1 192.168.0.10:6812/14280 7646 : cluster [INF] deep-scrub 1.1f6 1:6fa0ed13:::rbd_data.124974b0dc51.0000000000001023:head 1 missing clone(s)
2018-03-05 09:08:04.332519 osd.1 osd.1 192.168.0.10:6812/14280 7647 : cluster [ERR] 1.1f6 deep-scrub 1 errors
As I wrote, the status says HEALTH_ERR.
I tried also to repair with this https://ceph.com/geen-categorie/ceph-manually-repair-object/ but no success. I couldn't find the corresponding object as described.
 
Which ceph version are you running (ceph versions)? The rbd prefix 'rbd_data.124974b0dc51' is shown on 'rbd -p <pool> info <image>', then you know which image is affected. I guess, there is a defect clone lying around.

One possible issue could be this, but only a hunch for now.
https://tracker.ceph.com/issues/19413
 
Which ceph version are you running (ceph versions)? The rbd prefix 'rbd_data.124974b0dc51' is shown on 'rbd -p <pool> info <image>', then you know which image is affected. I guess, there is a defect clone lying around.

One possible issue could be this, but only a hunch for now.
https://tracker.ceph.com/issues/19413

Thank you. It would be very helpful. Is there a command to look through all the images or do I have to check one by one until I find the string?

Code:
Package versions
proxmox-ve: 5.1-32 (running kernel: 4.10.17-4-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.13-2-pve: 4.13.13-32
pve-kernel-4.10.17-4-pve: 4.10.17-24
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.13.8-3-pve: 4.13.8-30
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.10.17-3-pve: 4.10.17-23
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.10.17-1-pve: 4.10.17-18
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.0-4
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.2-pve1
 
Is there a command to look through all the images or do I have to check one by one until I find the string?
No, one by one.
 
So, I figured out, which images are involved. I deleted one of the VM and expected one error would go away. Unfortunately the error remains.
 
Are there any snapshots left over? Also you can check with 'rados -p <pool> ls' if there are objects left over that hold that prefix. You also let a deep scrub run again.
 
Thank you. Deep scrub didn’t help but I got
Code:
rados -p pool ls | grep 124974b0dc51
rbd_data.124974b0dc51.0000000000001023
I think that’s the issue. How to get rid of it?
 
Code:
man rados
See the man page, commands are there documented. But you need to be absolutely sure that this object is not connected to any image or snapshot, otherwise you will destroy a VM/CT or its snapshot.
Code:
rados -p rbd listwatchers <object>
With this you can at least see, if this object is in use.
 
Yeah, I read the man pages, but it didn’t help. Although the object is listed, I cannot delete it. Also
Code:
# rados -p pool listwatchers rbd_data.124974b0dc51.0000000000001023
error listing watchers pool/rbd_data.124974b0dc51.0000000000001023: (2) No such file or directory
 
Is it still there? A deletion can take some time. Check if there are still inconsistent pgs and run a scrub.

Code:
SCRUB AND REPAIR:
   list-inconsistent-pg <pool>      list inconsistent PGs in given pool
   list-inconsistent-obj <pgid>     list inconsistent objects in given pg
   list-inconsistent-snapset <pgid> list inconsistent snapsets in the given pg
 
I got this
Code:
# rados list-inconsistent-pg pool
["1.25","1.d0","1.1f6"]
# rados list-inconsistent-obj 1.1f6
{"epoch":13476,"inconsistents":[]}root@pve1:~#
# rados list-inconsistent-snapset 1.1f6
{"epoch":13476,"inconsistents":[]}root@pve1:~#
I deleted the VM that contained pg 1.1f6 as I mentioned before. Very strange.
 
Seems it’s not deep enough
Code:
018-03-06 16:46:37.590802 osd.1 osd.1 192.168.0.10:6812/14280 7866 : cluster [ERR] deep-scrub 1.1f6 1:6fa0ed13:::rbd_data.124974b0dc51.0000000000001023:head expected clone 1:6fa0ed13:::rbd_data.124974b0dc51.0000000000001023:412 1 missing
2018-03-06 16:46:37.590807 osd.1 osd.1 192.168.0.10:6812/14280 7867 : cluster [INF] deep-scrub 1.1f6 1:6fa0ed13:::rbd_data.124974b0dc51.0000000000001023:head 1 missing clone(s)
2018-03-06 16:47:43.483406 osd.1 osd.1 192.168.0.10:6812/14280 7868 : cluster [ERR] 1.1f6 deep-scrub 1 errors
 
  • Are the other inconsistent PGs also still there?
  • Did you upgrade the cluster, from hammer or is it a new one?
  • In the OSD logs, is there more shown then those three entries?
  • What message did you get, when you tried to delete the object
One way to be sure to remove all missing object, would be to move all VMs to a different pool and delete the pool afterwards.
 
  • Yes, as far I can see.
  • No, it’s a new one
  • No, only these entries
  • No such file or directory
I just created a new pool and started to migrate my VMs. It takes some time and I will come back with hopefully good news.

Thanks for your help.
 
What message did you get, when you tried to delete the object
What was the command you issued? Could be that it was not removed from the object map but doesn't exist anymore.
 
With something like
Code:
rados -p pool rm rbd_data.124974b0dc51.0000000000001023
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!