[SOLVED] New ceph cluster: Reduced data availability: 1 pg inactive

Oct 13, 2020
42
2
28
44
Hello Forum,

we do have an issue with a new ceph cluster running on three identical nodes. Everything seems to work fine, but the status on the web console complains:
Bildschirmfoto 2021-02-25 um 11.42.40.png
And:
Bildschirmfoto 2021-02-25 um 11.42.57.png

ceph osd df tree ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME -1 87.32867 - 87 TiB 20 TiB 20 TiB 158 MiB 37 GiB 67 TiB 23.23 1.00 - root default -7 29.10956 - 29 TiB 6.8 TiB 6.8 TiB 53 MiB 12 GiB 22 TiB 23.23 1.00 - host delbgpm01 8 hdd 7.27739 1.00000 7.3 TiB 1.6 TiB 1.6 TiB 13 MiB 2.8 GiB 5.7 TiB 21.76 0.94 65 up osd.8 9 hdd 7.27739 1.00000 7.3 TiB 1.8 TiB 1.8 TiB 14 MiB 3.0 GiB 5.5 TiB 24.43 1.05 79 up osd.9 10 hdd 7.27739 1.00000 7.3 TiB 1.9 TiB 1.9 TiB 14 MiB 3.6 GiB 5.4 TiB 25.87 1.11 76 up osd.10 11 hdd 7.27739 1.00000 7.3 TiB 1.5 TiB 1.5 TiB 13 MiB 2.6 GiB 5.8 TiB 20.88 0.90 68 up osd.11 -3 29.10956 - 29 TiB 6.8 TiB 6.8 TiB 53 MiB 12 GiB 22 TiB 23.23 1.00 - host delbgpm02 0 hdd 7.27739 1.00000 7.3 TiB 1.9 TiB 1.9 TiB 7.9 MiB 3.4 GiB 5.4 TiB 26.08 1.12 81 up osd.0 1 hdd 7.27739 1.00000 7.3 TiB 1.9 TiB 1.9 TiB 22 MiB 3.2 GiB 5.4 TiB 25.56 1.10 83 up osd.1 2 hdd 7.27739 1.00000 7.3 TiB 1.3 TiB 1.3 TiB 2.1 MiB 2.5 GiB 6.0 TiB 17.60 0.76 49 up osd.2 3 hdd 7.27739 1.00000 7.3 TiB 1.7 TiB 1.7 TiB 21 MiB 3.2 GiB 5.6 TiB 23.70 1.02 75 up osd.3 -5 29.10956 - 29 TiB 6.8 TiB 6.8 TiB 52 MiB 12 GiB 22 TiB 23.23 1.00 - host delbgpm03 4 hdd 7.27739 1.00000 7.3 TiB 1.7 TiB 1.7 TiB 13 MiB 3.1 GiB 5.6 TiB 22.80 0.98 74 up osd.4 5 hdd 7.27739 1.00000 7.3 TiB 1.5 TiB 1.5 TiB 12 MiB 2.7 GiB 5.7 TiB 21.27 0.92 65 up osd.5 6 hdd 7.27739 1.00000 7.3 TiB 1.5 TiB 1.5 TiB 5.8 MiB 2.8 GiB 5.7 TiB 21.28 0.92 65 up osd.6 7 hdd 7.27739 1.00000 7.3 TiB 2.0 TiB 2.0 TiB 21 MiB 3.8 GiB 5.3 TiB 27.59 1.19 84 up osd.7 TOTAL 87 TiB 20 TiB 20 TiB 158 MiB 37 GiB 67 TiB 23.23 MIN/MAX VAR: 0.76/1.19 STDDEV: 2.71

On one of the servers the output of ceph -s is:
ceph -s cluster: id: 73703df0-b8ae-4aca-8269-1fb68da2142d health: HEALTH_WARN Reduced data availability: 1 pg inactive 108 slow ops, oldest one blocked for 96500 sec, osd.7 has slow ops services: mon: 3 daemons, quorum delbgpm02,delbgpm03,delbgpm01 (age 105m) mgr: delbgpm03(active, since 8d), standbys: delbgpm02, delbgpm01 mds: cephfs:1 {0=delbgpm02=up:active} 2 up:standby osd: 12 osds: 12 up (since 104m), 12 in (since 104m) data: pools: 4 pools, 289 pgs objects: 1.82M objects, 6.8 TiB usage: 20 TiB used, 67 TiB / 87 TiB avail pgs: 0.346% pgs unknown 288 active+clean 1 unknown io: client: 341 B/s rd, 4.9 MiB/s wr, 0 op/s rd, 32 op/s wr progress: Rebalancing after osd.5 marked in (8d) [............................] Rebalancing after osd.1 marked in (8d) [............................] Rebalancing after osd.7 marked in (8d) [............................] Rebalancing after osd.6 marked in (8d) [............................] PG autoscaler decreasing pool 3 PGs from 128 to 32 (8d) [............................]
ceph pg dump | grep unknown lists this:
ceph pg dump |grep unknown dumped all 1.0 0 0 0 0 0 0 0 0 0 0 unknown 2021-02-16T17:05:09.104825+0100 0'0 0:0 [] -1 [] -1 0'0 2021-02-16T17:05:09.104825+0100 0'0 2021-02-16T17:05:09.104825+0100 0
From another post I learned that ceph pg 1.0 mark_unfound_lost delete might help, but:
ceph pg 1.0 mark_unfound_lost delete Error ENOENT: i don't have pgid 1.0

Can you please advise how to solve this issue?

Thank you and best regards,
Nico
 
Hello,

I got the error away, but there is a new one:
Error EIO: Module 'devicehealth' has experienced an error and cannot handle commands: [errno 2] RADOS object not found (Failed to operate write op for oid b'HGST_HUS728T8TALE6L4_VGJGNJ0G')

Do you have any idea how to resolve this?
 
Hi @Alwin Antreich,

thank you for the heads-up, but the disk seems to be perfectly fine:
Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   133   133   054    Pre-fail  Offline      -       92
  3 Spin_Up_Time            0x0007   152   152   024    Pre-fail  Always       -       536 (Average 534)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       14
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       528
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       14
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       37
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       37
194 Temperature_Celsius     0x0002   142   142   000    Old_age   Always       -       42 (Min/Max 15/47)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       528         -

I wanted to enable devicehealth monitoring with:
Bash:
ceph device monitoring on
Error EIO: Module 'devicehealth' has experienced an error and cannot handle commands: [errno 2] RADOS object not found (Failed to operate write op for oid b'HGST_HUS728T8TALE6L4_VGJGNJ0G')

Is there a way to check weather the pool device_health_metrics is working correctly? I guess that may cause the above error message.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!