Ceph HEALTH_ERR PG_DAMAGED: Possible data damage

dlee

Member
Feb 11, 2022
3
0
6
54
Hi, after a recent upgrade to Proxmox 7.2-4.
Observed on Datacenter > Ceph > Health.
1653923661735.png


# ceph health detail
HEALTH_ERR 4/1340831 objects unfound (0.000%); Possible data damage: 4 pgs recovery_unfound; Degraded data redundancy: 12/4022493 objects degraded (0.000%), 4 pgs degraded; 4 pgs not deep-scrubbed in time; 4 pgs not scrubbed in time
[WRN] OBJECT_UNFOUND: 4/1340831 objects unfound (0.000%)
pg 5.12a has 1 unfound objects
pg 5.18b has 1 unfound objects
pg 5.1b2 has 1 unfound objects
pg 5.1e1 has 1 unfound objects
[ERR] PG_DAMAGED: Possible data damage: 4 pgs recovery_unfound
pg 5.12a is active+recovery_unfound+degraded, acting [9,27,47], 1 unfound
pg 5.18b is active+recovery_unfound+degraded, acting [9,47,61], 1 unfound
pg 5.1b2 is active+recovery_unfound+degraded, acting [41,59,9], 1 unfound
pg 5.1e1 is active+recovery_unfound+degraded, acting [45,31,59], 1 unfound
[WRN] PG_DEGRADED: Degraded data redundancy: 12/4022493 objects degraded (0.000%), 4 pgs degraded
pg 5.12a is active+recovery_unfound+degraded, acting [9,27,47], 1 unfound
pg 5.18b is active+recovery_unfound+degraded, acting [9,47,61], 1 unfound
pg 5.1b2 is active+recovery_unfound+degraded, acting [41,59,9], 1 unfound
pg 5.1e1 is active+recovery_unfound+degraded, acting [45,31,59], 1 unfound
[WRN] PG_NOT_DEEP_SCRUBBED: 4 pgs not deep-scrubbed in time
pg 5.12a not deep-scrubbed since 2022-04-15T02:41:18.066401+0800
pg 5.18b not deep-scrubbed since 2022-04-15T13:00:11.502126+0800
pg 5.1b2 not deep-scrubbed since 2022-04-17T21:35:10.782837+0800
pg 5.1e1 not deep-scrubbed since 2022-04-14T17:39:09.773056+0800
[WRN] PG_NOT_SCRUBBED: 4 pgs not scrubbed in time
pg 5.12a not scrubbed since 2022-04-18T13:13:00.401312+0800
pg 5.18b not scrubbed since 2022-04-17T23:09:40.972708+0800
pg 5.1b2 not scrubbed since 2022-04-17T21:35:10.782837+0800
pg 5.1e1 not scrubbed since 2022-04-18T13:18:44.533226+0800

The nodes and VMs all seem to be functioning well. Is there any cause for concern?
How can I go about troubleshooting and clearing the error if possible.
Any pointers are appreciated.
Thanks!
 
You should start a deep-scrub on these placement groups first.

ceph pg deep-scrub $pgid

After that you should try to repair the PGs

ceph pg repair $pgid

Maybe Ceph is able to find the missing objects again. But it could be that they are lost.

More info can be found in the Ceph documentation: https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-pg/#unfound-objects
Tks.

Did the above and went further to inspect the results with
ceph pg $pgid list_unfound
and
ceph pg $pgid query
which returned results of recovery state having objects which were "Already found".
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!