Ceph - Reduced data availability: 3 pgs inactive

mlanner

Renowned Member
Apr 1, 2009
192
1
83
Berkeley, CA
A while back I had an event that caused my Ceph cluster to crash. By design I had backups of everything that mattered. However, I wanted to see if I could fix the cluster and maybe try to bring back some VMs that won't start due to the crash. I had a lot of `pgs inactive`, but managed to get that down to only a few. But, I can't get rid of the last 3 inactive/unknown pgs.

My current Ceph status and health details look as follows:

ceph -s

Code:
cluster:
    id:     a16d5e4e-3c99-4a59-ad21-b57cc18df33f
    health: HEALTH_WARN
            Reduced data availability: 3 pgs inactive
            14 slow ops, oldest one blocked for 1483 sec, daemons [osd.22,osd.4,osd.9] have slow ops.

  services:
    mon: 3 daemons, quorum vh01,vh03,vh02 (age 104m)
    mgr: vh01(active, since 2h), standbys: vh03, vh02
    osd: 29 osds: 29 up (since 2h), 29 in (since 3h)

  data:
    pools:   2 pools, 513 pgs
    objects: 366.19k objects, 1.1 TiB
    usage:   3.5 TiB used, 6.4 TiB / 9.9 TiB avail
    pgs:     0.585% pgs unknown
             510 active+clean
             3   unknown

`ceph health detail`

Code:
HEALTH_WARN Reduced data availability: 3 pgs inactive; 14 slow ops, oldest one blocked for 1578 sec, daemons [osd.22,osd.4,osd.9] have slow ops.
[WRN] PG_AVAILABILITY: Reduced data availability: 3 pgs inactive
    pg 2.69 is stuck inactive for 2h, current state unknown, last acting []
    pg 2.f2 is stuck inactive for 2h, current state unknown, last acting []
    pg 2.13e is stuck inactive for 2h, current state unknown, last acting []
[WRN] SLOW_OPS: 14 slow ops, oldest one blocked for 1578 sec, daemons [osd.22,osd.4,osd.9] have slow ops.

`ceph pg dump_stuck inactive`

Code:
PG_STAT  STATE    UP  UP_PRIMARY  ACTING  ACTING_PRIMARY
2.f2     unknown  []          -1      []              -1
2.69     unknown  []          -1      []              -1
2.13e    unknown  []          -1      []              -1
ok

I've tried running a bunch of repair commands, but none seem to make any progress.

Does anyone have any ideas for what I can try? Thanks in advance.
 
Thanks @gurubert. I did look at that documentation, but wanted to see if anyone here had some additional ideas and suggestions. I'll see if deleting those PGs will somehow magically fix the remaining problems. But I'm not too hopeful. It looks as if the entire Ceph installation has gone bad. I can only partially read from it. So, I think at this time my best bet might be to simply remove it all and start from scratch.
 
the ceph documentation has a dedicated page for pg troubleshooting: https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/
this is also a good resource: https://ceph.io/geen-categorie/ceph-manually-repair-object/

generally speaking, your inactive pgs have NO shards on any existing OSD. it would be useful to know what the "event" was, and if this is an expected consequence. if it is, follow the instructions in the first link to clear those pgs, and delete any objects that were associated with them. Having said that, are you sure you are out of the woods? those slow ops are concerning.