Having trouble clearing some ceph warnings.. Reduced data availability & Slow ops

May 9, 2024
1
0
1
Hey all,

I'm having trouble clearing some warnings from my ceph cluster.

1.)
HEALTH_WARN: Reduced data availability: 1 pg inactive
pg 1.0 is stuck inactive for 5m, current state unknown, last acting []


2.)
HEALTH_WARN: 2 slow ops, oldest one blocked for 299 sec, daemons [osd.0,osd.1] have slow ops.

___________________________________________


I have restarted osds, monitors, and managers.
I have tried: ceph pg repair 1.0



___________________________________________
Here is ceph status:

ceph status
cluster:
id: 34d69689-567b-4dac-8b75-382b7aa38dbe
health: HEALTH_WARN
Reduced data availability: 1 pg inactive
2 slow ops, oldest one blocked for 400 sec, daemons [osd.0,osd.1] have slow ops.

services:
mon: 3 daemons, quorum Lab-VMSvr03,Lab-VMSvr02,Lab-VMSvr01 (age 10m)
mgr: Lab-VMSvr03(active, since 7m), standbys: Lab-VMSvr01, Lab-VMSvr02
mds: 1/1 daemons up, 2 standby
osd: 3 osds: 3 up (since 6m), 3 in (since 9h)

data:
volumes: 1/1 healthy
pools: 4 pools, 169 pgs
objects: 144.68k objects, 564 GiB
usage: 1.6 TiB used, 9.3 TiB / 11 TiB avail
pgs: 0.592% pgs unknown
168 active+clean
1 unknown

io:
client: 2.7 KiB/s rd, 158 KiB/s wr, 0 op/s rd, 14 op/s wr




___________________________________________

Apparently 1.0 is my .mgr pool?

ceph osd lspools
1 .mgr
3 ceph-vm-disks
14 ceph-files_data
15 ceph-files_metadata
 
I have similar issue here:

Code:
ceph status



  cluster:

    id:     4c75e468-9438-43e3-bfeb-c8257513dfb9

    health: HEALTH_WARN

            Reduced data availability: 1 pg inactive

            4 slow ops, oldest one blocked for 1725 sec, osd.7 has slow ops



  services:

    mon: 3 daemons, quorum wfln-pve-01,wfln-pve-02,wfln-pve-03 (age 2m)

    mgr: wfln-pve-01(active, since 2m), standbys: wfln-pve-02, wfln-pve-03

    osd: 12 osds: 12 up (since 27m), 12 in (since 3d)


  data:
    pools:   3 pools, 257 pgs
    objects: 2.05k objects, 7.3 GiB
    usage:   27 GiB used, 55 TiB / 55 TiB avail
    pgs:     0.389% pgs unknown
             256 active+clean
             1   unknown

  io:
    client:   4.4 KiB/s wr, 0 op/s rd, 0 op/s wr

Code:
ceph health detail
HEALTH_WARN Reduced data availability: 1 pg inactive; 4 slow ops, oldest one blocked for 1670 sec, osd.7 has slow ops
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
    pg 1.0 is stuck inactive for 94s, current state unknown, last acting []
[WRN] SLOW_OPS: 4 slow ops, oldest one blocked for 1670 sec, osd.7 has slow ops

But all my OSDs are ok, apparently:

Code:
ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME             STATUS  REWEIGHT  PRI-AFF
 -1         54.58072  root default
 -3         18.19357      host wfln-pve-01
  6    hdd   7.27739          osd.6             up   1.00000  1.00000
  7    hdd   7.27739          osd.7             up   1.00000  1.00000
  0    ssd   1.81940          osd.0             up   1.00000  1.00000
  1    ssd   1.81940          osd.1             up   1.00000  1.00000
 -7         18.19357      host wfln-pve-02
  8    hdd   7.27739          osd.8             up   1.00000  1.00000
 10    hdd   7.27739          osd.10            up   1.00000  1.00000
  2    ssd   1.81940          osd.2             up   1.00000  1.00000
  3    ssd   1.81940          osd.3             up   1.00000  1.00000
-10         18.19357      host wfln-pve-03
  9    hdd   7.27739          osd.9             up   1.00000  1.00000
 11    hdd   7.27739          osd.11            up   1.00000  1.00000
  4    ssd   1.81940          osd.4             up   1.00000  1.00000
  5    ssd   1.81940          osd.5             up   1.00000  1.00000


Code:
ceph osd status
ID  HOST          USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
 0  wfln-pve-01  36.9M  1862G      0        0       0        0   exists,up
 1  wfln-pve-01  36.9M  1862G      0        0       0        0   exists,up
 2  wfln-pve-02  36.9M  1862G      0        0       0        0   exists,up
 3  wfln-pve-02  36.9M  1862G      0        0       0        0   exists,up
 4  wfln-pve-03  36.9M  1862G      0        0       0        0   exists,up
 5  wfln-pve-03  36.9M  1862G      0        0       0        0   exists,up
 6  wfln-pve-01  6016M  7446G      0        0       0        0   exists,up
 7  wfln-pve-01  2960M  7449G      0        0       0        0   exists,up
 8  wfln-pve-02  3836M  7448G      0        0       0        0   exists,up
 9  wfln-pve-03  5084M  7447G      0        0       0        0   exists,up
10  wfln-pve-02  5265M  7446G      0        0       0        0   exists,up
11  wfln-pve-03  4178M  7447G      0        0       0        0   exists,up

Tried to restart osd.7, monitors, and manager. After restarting the manager of node 1, the PG stuck inactive message disappears for some seconds, but turns back.