[SOLVED] New ceph cluster: Reduced data availability: 1 pg inactive

hoffmn01 · Feb 25, 2021

Hello Forum,

we do have an issue with a new ceph cluster running on three identical nodes. Everything seems to work fine, but the status on the web console complains:

Bildschirmfoto 2021-02-25 um 11.42.40.png

And:

Bildschirmfoto 2021-02-25 um 11.42.57.png


ceph osd df tree
ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
-1 87.32867 - 87 TiB 20 TiB 20 TiB 158 MiB 37 GiB 67 TiB 23.23 1.00 - root default
-7 29.10956 - 29 TiB 6.8 TiB 6.8 TiB 53 MiB 12 GiB 22 TiB 23.23 1.00 - host delbgpm01
8 hdd  7.27739 1.00000 7.3 TiB 1.6 TiB 1.6 TiB 13 MiB 2.8 GiB 5.7 TiB 21.76 0.94 65 up osd.8
9 hdd  7.27739 1.00000 7.3 TiB 1.8 TiB 1.8 TiB 14 MiB 3.0 GiB 5.5 TiB 24.43 1.05 79 up osd.9
10 hdd  7.27739 1.00000 7.3 TiB 1.9 TiB 1.9 TiB 14 MiB 3.6 GiB 5.4 TiB 25.87 1.11 76 up osd.10
11 hdd  7.27739 1.00000 7.3 TiB 1.5 TiB 1.5 TiB 13 MiB 2.6 GiB 5.8 TiB 20.88 0.90 68 up osd.11
-3 29.10956 - 29 TiB 6.8 TiB 6.8 TiB 53 MiB 12 GiB 22 TiB 23.23 1.00 - host delbgpm02
0 hdd  7.27739 1.00000 7.3 TiB 1.9 TiB 1.9 TiB 7.9 MiB 3.4 GiB 5.4 TiB 26.08 1.12 81 up osd.0
1 hdd  7.27739 1.00000 7.3 TiB 1.9 TiB 1.9 TiB 22 MiB 3.2 GiB 5.4 TiB 25.56 1.10 83 up osd.1
2 hdd  7.27739 1.00000 7.3 TiB 1.3 TiB 1.3 TiB 2.1 MiB 2.5 GiB 6.0 TiB 17.60 0.76 49 up osd.2
3 hdd  7.27739 1.00000 7.3 TiB 1.7 TiB 1.7 TiB 21 MiB 3.2 GiB 5.6 TiB 23.70 1.02 75 up osd.3
-5 29.10956 - 29 TiB 6.8 TiB 6.8 TiB 52 MiB 12 GiB 22 TiB 23.23 1.00 - host delbgpm03
4 hdd  7.27739 1.00000 7.3 TiB 1.7 TiB 1.7 TiB 13 MiB 3.1 GiB 5.6 TiB 22.80 0.98 74 up osd.4
5 hdd  7.27739 1.00000 7.3 TiB 1.5 TiB 1.5 TiB 12 MiB 2.7 GiB 5.7 TiB 21.27 0.92 65 up osd.5
6 hdd  7.27739 1.00000 7.3 TiB 1.5 TiB 1.5 TiB 5.8 MiB 2.8 GiB 5.7 TiB 21.28 0.92 65 up osd.6
7 hdd  7.27739 1.00000 7.3 TiB 2.0 TiB 2.0 TiB 21 MiB 3.8 GiB 5.3 TiB 27.59 1.19 84 up osd.7
                        TOTAL   87 TiB   20 TiB   20 TiB  158 MiB   37 GiB   67 TiB  23.23
MIN/MAX VAR: 0.76/1.19  STDDEV: 2.71

On one of the servers the output of ceph -s is:


ceph -s
  cluster:
    id:     73703df0-b8ae-4aca-8269-1fb68da2142d
    health: HEALTH_WARN
            Reduced data availability: 1 pg inactive
            108 slow ops, oldest one blocked for 96500 sec, osd.7 has slow ops

  services:
    mon: 3 daemons, quorum delbgpm02,delbgpm03,delbgpm01 (age 105m)
    mgr: delbgpm03(active, since 8d), standbys: delbgpm02, delbgpm01
    mds: cephfs:1 {0=delbgpm02=up:active} 2 up:standby
    osd: 12 osds: 12 up (since 104m), 12 in (since 104m)

  data:
    pools:   4 pools, 289 pgs
    objects: 1.82M objects, 6.8 TiB
    usage:   20 TiB used, 67 TiB / 87 TiB avail
    pgs:     0.346% pgs unknown
             288 active+clean
             1   unknown

  io:
    client:   341 B/s rd, 4.9 MiB/s wr, 0 op/s rd, 32 op/s wr

  progress:
    Rebalancing after osd.5 marked in (8d)
      [............................]
    Rebalancing after osd.1 marked in (8d)
      [............................]
    Rebalancing after osd.7 marked in (8d)
      [............................]
    Rebalancing after osd.6 marked in (8d)
      [............................]
    PG autoscaler decreasing pool 3 PGs from 128 to 32 (8d)
      [............................]

ceph pg dump | grep unknown lists this:


ceph pg dump |grep unknown
dumped all
1.0 0 0 0 0 0  0 0 0 0 0 unknown 2021-02-16T17:05:09.104825+0100 0'0 0:0 [] -1 [] -1 0'0 2021-02-16T17:05:09.104825+0100 0'0 2021-02-16T17:05:09.104825+0100 0

From another post I learned that ceph pg 1.0 mark_unfound_lost delete might help, but:


ceph pg 1.0 mark_unfound_lost delete
Error ENOENT: i don't have pgid 1.0

Can you please advise how to solve this issue?

Thank you and best regards,
Nico

Alwin Antreich · Feb 25, 2021

hoffmn01 said:
[] -1 [] -1

PG 1.0 is not associated to any OSDs. And 1.0 looks wrong. In any case, stop the autoscaler, it tries to reduce the PG count since around 8 days. Roughly the same time the as the PG is stuck.

hoffmn01 · Mar 9, 2021

Hello,

I got the error away, but there is a new one:
Error EIO: Module 'devicehealth' has experienced an error and cannot handle commands: [errno 2] RADOS object not found (Failed to operate write op for oid b'HGST_HUS728T8TALE6L4_VGJGNJ0G')

Do you have any idea how to resolve this?

Alwin Antreich · Mar 10, 2021

hoffmn01 said:
HGST_HUS728T8TALE6L4

Check your disk.

hoffmn01 · Mar 10, 2021

Hi @Alwin Antreich,

thank you for the heads-up, but the disk seems to be perfectly fine:

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   133   133   054    Pre-fail  Offline      -       92
  3 Spin_Up_Time            0x0007   152   152   024    Pre-fail  Always       -       536 (Average 534)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       14
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       528
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       14
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       37
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       37
194 Temperature_Celsius     0x0002   142   142   000    Old_age   Always       -       42 (Min/Max 15/47)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       528         -

I wanted to enable devicehealth monitoring with:

Bash:

ceph device monitoring on
Error EIO: Module 'devicehealth' has experienced an error and cannot handle commands: [errno 2] RADOS object not found (Failed to operate write op for oid b'HGST_HUS728T8TALE6L4_VGJGNJ0G')

Is there a way to check weather the pool device_health_metrics is working correctly? I guess that may cause the above error message.

hoffmn01 · Mar 10, 2021

I updated to 15.2.9 recently, now all issues are gone and everything is green!

Thanks everyone for helping!

Search

Search

[SOLVED] New ceph cluster: Reduced data availability: 1 pg inactive

hoffmn01

Active Member

Alwin Antreich

Active Member

hoffmn01

Active Member

Alwin Antreich

Active Member

hoffmn01

Active Member

hoffmn01

Active Member