[SOLVED] ceph problem - Reduced data availability: 15 pgs inactive

ilia987

Active Member
Sep 9, 2019
275
13
38
37
proxmox 7.1-8

yesterday i executed a large delete operation on the ceph-fs pool (around 2 TB of data)
the operation ended withing few seconds successful (without any noticeable errors).
and then the following problem occurred:
7 out of 32 osds went to down and out.

trying to set them in and up did not work (in worked, but they didn't go up).
so i removed the osd and created them.
but one of the osd remove failed and i could resolve only with full server reboot and manual osd remove from DISK->LVM->More->Destroy via gui

7/7 are up and ceph finished balancing.

but i have the following problems.

cluster:
the nodes are marked gray, and no lxc\vm are up, because all of them are stored on ceph-fs.
the quorum is OK

ceph:
  1. 1 MDSs report slow metadata IOsmds.pve-srv3(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 8317 secs

  2. Reduced data availability: 15 pgs inactivepg 2.b is stuck inactive for 2h, current state unknown, last acting []
    pg 2.d is stuck inactive for 2h, current state unknown, last acting []
    pg 2.2b is stuck inactive for 2h, current state unknown, last acting []
    pg 2.2d is stuck inactive for 2h, current state unknown, last acting []
    pg 2.4b is stuck inactive for 2h, current state unknown, last acting []
    pg 2.4d is stuck inactive for 2h, current state unknown, last acting []
    pg 2.6b is stuck inactive for 2h, current state unknown, last acting []
    pg 2.6d is stuck inactive for 2h, current state unknown, last acting []
    pg 8.8d is stuck inactive for 2h, current state unknown, last acting []
    pg 8.185 is stuck inactive for 2h, current state unknown, last acting []
    pg 8.1e3 is stuck inactive for 2h, current state unknown, last acting []
    pg 14.15 is stuck inactive for 2h, current state unknown, last acting []
    pg 14.35 is stuck inactive for 2h, current state unknown, last acting []
    pg 14.55 is stuck inactive for 2h, current state unknown, last acting []
    pg 14.75 is stuck inactive for 2h, current state unknown, last acting []


  3. 296 slow ops, oldest one blocked for 7841 sec, daemons [osd.14,osd.23,osd.27,osd.30,osd.31,osd.7] have slow ops.


some commands i run on shell and its output:


Code:
ceph pg dump_stuck


PG_STAT  STATE    UP  UP_PRIMARY  ACTING  ACTING_PRIMARY
8.185    unknown  []          -1      []              -1
2.6d     unknown  []          -1      []              -1
14.35    unknown  []          -1      []              -1
8.8d     unknown  []          -1      []              -1
14.75    unknown  []          -1      []              -1
2.2d     unknown  []          -1      []              -1
2.6b     unknown  []          -1      []              -1
2.4d     unknown  []          -1      []              -1
8.1e3    unknown  []          -1      []              -1
14.15    unknown  []          -1      []              -1
14.55    unknown  []          -1      []              -1
2.4b     unknown  []          -1      []              -1
2.d      unknown  []          -1      []              -1
2.2b     unknown  []          -1      []              -1
2.b      unknown  []          -1      []              -1
ok


Code:
ceph pg 2.b mark_unfound_lost revert
Error ENOENT: i don't have pgid 2.b


ceph pg 2.b mark_unfound_lost delete
Error ENOENT: i don't have pgid 2.b

ceph pg map 2.b
osdmap e136516 pg 2.b (2.b) -> up [27,7,31] acting [27,7,31]

Code:
ceph -s

  cluster:
    id:     8ebca482-f985-4e74-9ff8-35e03a1af15e
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            Reduced data availability: 15 pgs inactive
            347 slow ops, oldest one blocked for 8271 sec, daemons [osd.14,osd.23,osd.27,osd.30,osd.31,osd.7] have slow ops.
 
  services:
    mon: 3 daemons, quorum pve-srv2,pve-srv3,pve-srv4 (age 42m)
    mgr: pve-srv3(active, since 44m), standbys: pve-srv4, pve-srv2
    mds: 2/2 daemons up, 1 standby
    osd: 33 osds: 32 up (since 18m), 32 in (since 37m); 1 remapped pgs
 
  data:
    volumes: 2/2 healthy
    pools:   6 pools, 1393 pgs
    objects: 15.64M objects, 39 TiB
    usage:   116 TiB used, 65 TiB / 182 TiB avail
    pgs:     1.077% pgs unknown
             1378 active+clean
             15   unknown
 
  io:
    client:   8.0 KiB/s wr, 0 op/s rd, 1 op/s wr

any ideas what i can do ?
 
Last edited:
You removed 7 OSDs and recreated them?

The inactive PGs tell you that the Ceph cluster is missing data. It looks like you just deleted it.
deleted and recreated them. because they fail to start.

how i can fix the PG warning? (i have backup for everything) but it don know what is deleted\corrupted.

what can cause the cluster instability? all nodes are appear grayed out.
 
What is the output of "ceph health detail"?
HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 15 pgs inactive; 504 slow ops, oldest one blocked for 11906 sec, daemons [osd.14,osd.23,osd.27,osd.30,osd.31,osd.7] have slow ops.
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
mds.pve-srv3(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 12382 secs
[WRN] PG_AVAILABILITY: Reduced data availability: 15 pgs inactive
pg 2.b is stuck inactive for 3h, current state unknown, last acting []
pg 2.d is stuck inactive for 3h, current state unknown, last acting []
pg 2.2b is stuck inactive for 3h, current state unknown, last acting []
pg 2.2d is stuck inactive for 3h, current state unknown, last acting []
pg 2.4b is stuck inactive for 3h, current state unknown, last acting []
pg 2.4d is stuck inactive for 3h, current state unknown, last acting []
pg 2.6b is stuck inactive for 3h, current state unknown, last acting []
pg 2.6d is stuck inactive for 3h, current state unknown, last acting []
pg 8.8d is stuck inactive for 3h, current state unknown, last acting []
pg 8.185 is stuck inactive for 3h, current state unknown, last acting []
pg 8.1e3 is stuck inactive for 3h, current state unknown, last acting []
pg 14.15 is stuck inactive for 3h, current state unknown, last acting []
pg 14.35 is stuck inactive for 3h, current state unknown, last acting []
pg 14.55 is stuck inactive for 3h, current state unknown, last acting []
pg 14.75 is stuck inactive for 3h, current state unknown, last acting []
[WRN] SLOW_OPS: 504 slow ops, oldest one blocked for 11906 sec, daemons [osd.14,osd.23,osd.27,osd.30,osd.31,osd.7] have slow ops.
 
You have destroyed too many OSDs at the same time and lost data from this action.

The PGs are nowhere to find in the cluster and the objects are gone. I do not even know if the CephFS is able to recover in such a situation.

Either get professional help from a Ceph consultant or redo the CephFS from scratch.
 
i can do (i have backup or able to regenerate everything , but it will take time around a week of work).
but will it fix the cluster? because everything is marked gray, all lxc\vms are down.

perhaps upgrade to latest version might help? 7.1-8 - 7.2 , but it minor dont think it will change anything
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!