Ceph: pg 1.0 inactive for days, slow ops

MasinAD · Oct 18, 2022

There seems to be a problem with pg 1.0 and my understanding of placement groups and pools and OSDs.

Yesterday, I removed osd.0 in an attempt to get the contents of pg 1.0 moved to another osd. But today it was stuck inactive for 24 hours, so my attempt resulted in resetting the inactive state from 8d to 24h. Thus, I moved osd.0 back into the cluster. Currently `ceph daemon osd.0 dump_ops_in_flight` reports 106 ops in queue. Interestingly, yesterday I had more than 400 PGs, but today they are less than 300. I guess the autoscaler did some work.

As I am new to Ceph I don't know what's causing these issues.

Code:

root@node1:~# ceph health detail
HEALTH_WARN Reduced data availability: 1 pg inactive; 17 slow ops, oldest one blocked for 87198 sec, osd.0 has slow ops
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
    pg 1.0 is stuck inactive for 24h, current state unknown, last acting []
[WRN] SLOW_OPS: 17 slow ops, oldest one blocked for 87198 sec, osd.0 has slow ops
root@node1:~# ceph status
  cluster:
    id:     8bc967d9-48a1-444c-af5a-3e875b867150
    health: HEALTH_WARN
            Reduced data availability: 1 pg inactive
            66 slow ops, oldest one blocked for 87209 sec, osd.0 has slow ops

  services:
    mon: 3 daemons, quorum node1,node3,node2 (age 24h)
    mgr: node1(active, since 24h), standbys: node3, node2
    mds: 1/1 daemons up, 2 standby
    osd: 24 osds: 24 up (since 2h), 23 in (since 2h); 21 remapped pgs

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 289 pgs
    objects: 838.09k objects, 3.2 TiB
    usage:   9.8 TiB used, 82 TiB / 92 TiB avail
    pgs:     0.346% pgs unknown
             25477/2514282 objects misplaced (1.013%)
             268 active+clean
             20  active+remapped+backfilling
             1   unknown

  io:
    client:   43 KiB/s rd, 993 KiB/s wr, 2 op/s rd, 86 op/s wr
    recovery: 50 MiB/s, 12 objects/s

root@node1:~# ceph pg map 1.0
osdmap e1044 pg 1.0 (1.0) -> up [0,10,16] acting [0]

Maybe someone can shed some light on my deficiencies. Are PGs numbered globally or on a per-pool basis? Can a pg 1.0 exist in several pools? When defining a pool I also set the number of PGs. Does this increase the total number of PGs over all OSDs if I add another pool?

Can the autoscaler rescaling the PG size interfere with a VM? In the last few days, our Univention Corporate Server VM showed outages we hadn't before moving to Ceph.

Ceph: pg 1.0 inactive for days, slow ops

MasinAD

Member

We value your privacy