[SOLVED] ceph problem - Reduced data availability: 15 pgs inactive

ilia987 · Jun 21, 2022

proxmox 7.1-8

yesterday i executed a large delete operation on the ceph-fs pool (around 2 TB of data)
the operation ended withing few seconds successful (without any noticeable errors).
and then the following problem occurred:
7 out of 32 osds went to down and out.

trying to set them in and up did not work (in worked, but they didn't go up).
so i removed the osd and created them.
but one of the osd remove failed and i could resolve only with full server reboot and manual osd remove from DISK->LVM->More->Destroy via gui

7/7 are up and ceph finished balancing.

but i have the following problems.

cluster:
the nodes are marked gray, and no lxc\vm are up, because all of them are stored on ceph-fs.
the quorum is OK

ceph:

1 MDSs report slow metadata IOsmds.pve-srv3(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 8317 secs
Reduced data availability: 15 pgs inactivepg 2.b is stuck inactive for 2h, current state unknown, last acting []
pg 2.d is stuck inactive for 2h, current state unknown, last acting []
pg 2.2b is stuck inactive for 2h, current state unknown, last acting []
pg 2.2d is stuck inactive for 2h, current state unknown, last acting []
pg 2.4b is stuck inactive for 2h, current state unknown, last acting []
pg 2.4d is stuck inactive for 2h, current state unknown, last acting []
pg 2.6b is stuck inactive for 2h, current state unknown, last acting []
pg 2.6d is stuck inactive for 2h, current state unknown, last acting []
pg 8.8d is stuck inactive for 2h, current state unknown, last acting []
pg 8.185 is stuck inactive for 2h, current state unknown, last acting []
pg 8.1e3 is stuck inactive for 2h, current state unknown, last acting []
pg 14.15 is stuck inactive for 2h, current state unknown, last acting []
pg 14.35 is stuck inactive for 2h, current state unknown, last acting []
pg 14.55 is stuck inactive for 2h, current state unknown, last acting []
pg 14.75 is stuck inactive for 2h, current state unknown, last acting []
296 slow ops, oldest one blocked for 7841 sec, daemons [osd.14,osd.23,osd.27,osd.30,osd.31,osd.7] have slow ops.

some commands i run on shell and its output:

Code:

ceph pg dump_stuck


PG_STAT  STATE    UP  UP_PRIMARY  ACTING  ACTING_PRIMARY
8.185    unknown  []          -1      []              -1
2.6d     unknown  []          -1      []              -1
14.35    unknown  []          -1      []              -1
8.8d     unknown  []          -1      []              -1
14.75    unknown  []          -1      []              -1
2.2d     unknown  []          -1      []              -1
2.6b     unknown  []          -1      []              -1
2.4d     unknown  []          -1      []              -1
8.1e3    unknown  []          -1      []              -1
14.15    unknown  []          -1      []              -1
14.55    unknown  []          -1      []              -1
2.4b     unknown  []          -1      []              -1
2.d      unknown  []          -1      []              -1
2.2b     unknown  []          -1      []              -1
2.b      unknown  []          -1      []              -1
ok

Code:

ceph pg 2.b mark_unfound_lost revert
Error ENOENT: i don't have pgid 2.b


ceph pg 2.b mark_unfound_lost delete
Error ENOENT: i don't have pgid 2.b

ceph pg map 2.b
osdmap e136516 pg 2.b (2.b) -> up [27,7,31] acting [27,7,31]

Code:

ceph -s

  cluster:
    id:     8ebca482-f985-4e74-9ff8-35e03a1af15e
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            Reduced data availability: 15 pgs inactive
            347 slow ops, oldest one blocked for 8271 sec, daemons [osd.14,osd.23,osd.27,osd.30,osd.31,osd.7] have slow ops.
 
  services:
    mon: 3 daemons, quorum pve-srv2,pve-srv3,pve-srv4 (age 42m)
    mgr: pve-srv3(active, since 44m), standbys: pve-srv4, pve-srv2
    mds: 2/2 daemons up, 1 standby
    osd: 33 osds: 32 up (since 18m), 32 in (since 37m); 1 remapped pgs
 
  data:
    volumes: 2/2 healthy
    pools:   6 pools, 1393 pgs
    objects: 15.64M objects, 39 TiB
    usage:   116 TiB used, 65 TiB / 182 TiB avail
    pgs:     1.077% pgs unknown
             1378 active+clean
             15   unknown
 
  io:
    client:   8.0 KiB/s wr, 0 op/s rd, 1 op/s wr

any ideas what i can do ?

gurubert · Jun 21, 2022

You removed 7 OSDs and recreated them?

The inactive PGs tell you that the Ceph cluster is missing data. It looks like you just deleted it.

ilia987 · Jun 21, 2022

gurubert said:
You removed 7 OSDs and recreated them?

The inactive PGs tell you that the Ceph cluster is missing data. It looks like you just deleted it.

deleted and recreated them. because they fail to start.

how i can fix the PG warning? (i have backup for everything) but it don know what is deleted\corrupted.

what can cause the cluster instability? all nodes are appear grayed out.

gurubert · Jun 21, 2022

What is the output of "ceph health detail"?

ilia987 · Jun 21, 2022

gurubert said:
What is the output of "ceph health detail"?

HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 15 pgs inactive; 504 slow ops, oldest one blocked for 11906 sec, daemons [osd.14,osd.23,osd.27,osd.30,osd.31,osd.7] have slow ops.
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
mds.pve-srv3(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 12382 secs
[WRN] PG_AVAILABILITY: Reduced data availability: 15 pgs inactive
pg 2.b is stuck inactive for 3h, current state unknown, last acting []
pg 2.d is stuck inactive for 3h, current state unknown, last acting []
pg 2.2b is stuck inactive for 3h, current state unknown, last acting []
pg 2.2d is stuck inactive for 3h, current state unknown, last acting []
pg 2.4b is stuck inactive for 3h, current state unknown, last acting []
pg 2.4d is stuck inactive for 3h, current state unknown, last acting []
pg 2.6b is stuck inactive for 3h, current state unknown, last acting []
pg 2.6d is stuck inactive for 3h, current state unknown, last acting []
pg 8.8d is stuck inactive for 3h, current state unknown, last acting []
pg 8.185 is stuck inactive for 3h, current state unknown, last acting []
pg 8.1e3 is stuck inactive for 3h, current state unknown, last acting []
pg 14.15 is stuck inactive for 3h, current state unknown, last acting []
pg 14.35 is stuck inactive for 3h, current state unknown, last acting []
pg 14.55 is stuck inactive for 3h, current state unknown, last acting []
pg 14.75 is stuck inactive for 3h, current state unknown, last acting []
[WRN] SLOW_OPS: 504 slow ops, oldest one blocked for 11906 sec, daemons [osd.14,osd.23,osd.27,osd.30,osd.31,osd.7] have slow ops.

gurubert · Jun 21, 2022

Please post the output of "ceph pg 2.b query" and "ceph pg 2.b list_unfound".

ilia987 · Jun 21, 2022

gurubert said:
Please post the output of "ceph pg 2.b query" and "ceph pg 2.b list_unfound".

ceph pg 2.b query
Error ENOENT: i don't have pgid 2.b

ceph pg 2.b list_unfound
Error ENOENT: i don't have pgid 2.b

gurubert · Jun 21, 2022

Try to restart the ceph-mgr services.

https://forum.proxmox.com/threads/ceph-pgs-inactive-or-incomplete.94007/

ilia987 · Jun 21, 2022

gurubert said:
Try to restart the ceph-mgr services.

this is one of the first things i did. i also rebooted all the servers in the cluster.

what to do from the link. the delete is not looks as a safe option

gurubert · Jun 21, 2022

You have destroyed too many OSDs at the same time and lost data from this action.

The PGs are nowhere to find in the cluster and the objects are gone. I do not even know if the CephFS is able to recover in such a situation.

Either get professional help from a Ceph consultant or redo the CephFS from scratch.

ilia987 · Jun 21, 2022

i can do (i have backup or able to regenerate everything , but it will take time around a week of work).
but will it fix the cluster? because everything is marked gray, all lxc\vms are down.

perhaps upgrade to latest version might help? 7.1-8 - 7.2 , but it minor dont think it will change anything

ilia987 · Jun 22, 2022

gurubert said:
Try to restart the ceph-mgr services.

https://forum.proxmox.com/threads/ceph-pgs-inactive-or-incomplete.94007/

the delete command worked. i had to restore some containers from backup, but once the "UNKNOWN PB" resolved the ceph-fs went back online

Search

Search

[SOLVED] ceph problem - Reduced data availability: 15 pgs inactive

ilia987

Active Member

gurubert

Distinguished Member

ilia987

Active Member

gurubert

Distinguished Member

ilia987

Active Member

gurubert

Distinguished Member

ilia987

Active Member

gurubert

Distinguished Member

ilia987

Active Member

gurubert

Distinguished Member

ilia987

Active Member

ilia987

Active Member