[SOLVED] Reduced data availability: 88 pgs inactive, 88 pgs peering

neils

Member
Mar 21, 2021
4
0
6
48
Hi, i have a 3 Node cluster with ceph. After updating all node one by one, i can see that ceph is not able to peer all pgs.

root@pve01:~# ceph health detail
HEALTH_WARN mons are allowing insecure global_id reclaim; Reduced data availability: 88 pgs inactive, 88 pgs peering; 29 slow ops, oldest one blocked for 5119 sec, daemons [osd.0,osd.1,osd.4,osd.5,mon.pve01] have slow ops.
[WRN] AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure global_id reclaim
mon.pve01 has auth_allow_insecure_global_id_reclaim set to true
mon.pve02 has auth_allow_insecure_global_id_reclaim set to true
mon.pve03 has auth_allow_insecure_global_id_reclaim set to true
[WRN] PG_AVAILABILITY: Reduced data availability: 88 pgs inactive, 88 pgs peering
pg 5.0 is stuck peering for 86m, current state peering, last acting [4,1,2]
pg 5.3 is stuck peering for 107m, current state peering, last acting [0,3,5]
pg 5.4 is stuck peering for 86m, current state peering, last acting [4,1,3]
pg 5.5 is stuck peering for 107m, current state peering, last acting [0,3,4]
pg 5.6 is stuck inactive for 106m, current state peering, last acting [1,2,4]
pg 5.12 is stuck peering for 86m, current state peering, last acting [5,2,1]
pg 5.13 is stuck peering for 86m, current state peering, last acting [4,0,2]
pg 5.14 is stuck peering for 107m, current state peering, last acting [0,4,3]
pg 5.17 is stuck peering for 107m, current state peering, last acting [0,5,3]
pg 5.19 is stuck peering for 107m, current state peering, last acting [1,3,5]
pg 5.1a is stuck peering for 86m, current state peering, last acting [4,2,1]
pg 5.1d is stuck peering for 107m, current state peering, last acting [1,2,4]
pg 5.1e is stuck peering for 107m, current state peering, last acting [0,3,4]
pg 5.20 is stuck peering for 86m, current state peering, last acting [4,2,1]
pg 5.4a is stuck peering for 107m, current state peering, last acting [0,4,2]
pg 5.4e is stuck peering for 86m, current state peering, last acting [5,0,2]
pg 5.50 is stuck peering for 86m, current state peering, last acting [5,2,0]
pg 5.51 is stuck peering for 107m, current state peering, last acting [1,3,4]
pg 5.52 is stuck peering for 86m, current state peering, last acting [5,1,2]
pg 5.57 is stuck peering for 86m, current state peering, last acting [4,2,0]]
pg 5.58 is stuck peering for 86m, current state peering, last acting [4,3,1]
pg 5.59 is stuck peering for 107m, current state peering, last acting [1,5,2]
pg 5.5a is stuck peering for 86m, current state peering, last acting [5,0,3]
pg 5.5b is stuck peering for 86m, current state peering, last acting [5,2,1]
pg 5.5c is stuck peering for 86m, current state peering, last acting [5,2,0]
pg 5.5d is stuck peering for 107m, current state peering, last acting [0,5,2]
pg 5.62 is stuck peering for 86m, current state peering, last acting [5,1,3]
pg 5.63 is stuck peering for 107m, current state peering, last acting [0,4,2]
pg 5.65 is stuck peering for 107m, current state peering, last acting [1,4,3]
pg 5.66 is stuck peering for 86m, current state peering, last acting [4,2,0]
pg 5.67 is stuck peering for 107m, current state peering, last acting [0,5,3]
pg 5.6a is stuck peering for 107m, current state peering, last acting [1,2,4]
pg 5.6b is stuck peering for 107m, current state peering, last acting [1,5,2]
pg 5.6c is stuck peering for 107m, current state peering, last acting [1,2,5]
pg 5.6e is stuck peering for 107m, current state peering, last acting [0,3,5]
pg 5.70 is stuck peering for 86m, current state peering, last acting [5,1,3]
pg 5.71 is stuck peering for 107m, current state peering, last acting [0,2,5]
pg 5.72 is stuck peering for 107m, current state peering, last acting [0,3,5]
pg 5.73 is stuck peering for 86m, current state peering, last acting [5,3,0]
pg 5.74 is stuck peering for 86m, current state peering, last acting [4,0,2]
pg 5.75 is stuck peering for 86m, current state peering, last acting [5,3,1]
pg 5.76 is stuck peering for 86m, current state peering, last acting [4,1,3]
pg 5.77 is stuck peering for 107m, current state peering, last acting [0,2,5]
pg 5.78 is stuck peering for 107m, current state peering, last acting [1,4,3]
pg 5.79 is stuck peering for 86m, current state peering, last acting [4,1,2]
pg 5.7a is stuck peering for 107m, current state peering, last acting [0,2,4]
pg 5.7b is stuck peering for 86m, current state peering, last acting [5,0,2]
pg 5.7c is stuck peering for 107m, current state peering, last acting [1,4,3]
[WRN] SLOW_OPS: 29 slow ops, oldest one blocked for 5119 sec, daemons [osd.0,osd.1,osd.4,osd.5,mon.pve01] have slow ops.

I tried serveral things like putting osds out/in, rebooting node or --osd_find_best_info_ignore_history_les=1 but nothing works.

Can someone put me in the right direction?

Thanks!
 
E.g. querie pg 5.0 brings following output.

root@pve01:~# ceph pg 5.0 query
{
"snap_trimq": "[]",
"snap_trimq_len": 0,
"state": "peering",
"epoch": 2471,
"up": [
4,
1,
2
],
"acting": [
4,
1,
2
],
"info": {
"pgid": "5.0",
"last_update": "2257'1060397",
"last_complete": "2257'1060397",
"log_tail": "2218'1055299",
"last_user_version": 1168516,
"last_backfill": "MAX",
"purged_snaps": [],
"history": {
"epoch_created": 86,
"epoch_pool_created": 86,
"last_epoch_started": 2461,
"last_interval_started": 2459,
"last_epoch_clean": 2239,
"last_interval_clean": 2238,
"last_epoch_split": 1413,
"last_epoch_marked_full": 0,
"same_up_since": 2463,
"same_interval_since": 2463,
"same_primary_since": 2431,
"last_scrub": "2218'1054903",
"last_scrub_stamp": "2021-05-03T11:59:13.491721+0200",
"last_deep_scrub": "2200'967752",
"last_deep_scrub_stamp": "2021-04-29T19:57:52.275685+0200",
"last_clean_scrub_stamp": "2021-05-03T11:59:13.491721+0200",
"prior_readable_until_ub": 0
},
"stats": {
"version": "2257'1060397",
"reported_seq": "2156280",
"reported_epoch": "2471",
"state": "peering",
"last_fresh": "2021-05-03T23:07:16.914347+0200",
"last_change": "2021-05-03T22:24:27.655941+0200",
"last_active": "2021-05-03T22:24:27.655470+0200",
"last_peered": "2021-05-03T22:24:13.489229+0200",
"last_clean": "2021-05-03T19:55:21.415761+0200",
"last_became_active": "2021-05-03T22:24:12.572505+0200",
"last_became_peered": "2021-05-03T22:24:12.572505+0200",
"last_unstale": "2021-05-03T23:07:16.914347+0200",
"last_undegraded": "2021-05-03T23:07:16.914347+0200",
"last_fullsized": "2021-05-03T23:07:16.914347+0200",
"mapping_epoch": 2463,
"log_start": "2218'1055299",
"ondisk_log_start": "2218'1055299",
"created": 86,
"last_epoch_clean": 2239,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "2218'1054903",
"last_scrub_stamp": "2021-05-03T11:59:13.491721+0200",
"last_deep_scrub": "2200'967752",
"last_deep_scrub_stamp": "2021-04-29T19:57:52.275685+0200",
"last_clean_scrub_stamp": "2021-05-03T11:59:13.491721+0200",
"log_size": 5098,
"ondisk_log_size": 5098,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 7673537718,
"num_objects": 2525,
"num_object_clones": 871,
"num_object_copies": 7575,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 2525,
"num_whiteouts": 45,
"num_read": 649072,
"num_read_kb": 60498790,
"num_write": 703069,
"num_write_kb": 21124370,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 59,
"num_bytes_recovered": 37598208,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0,
"num_large_omap_objects": 0,
"num_objects_manifest": 0,
"num_omap_bytes": 0,
"num_omap_keys": 0,
"num_objects_repaired": 0
},
"up": [
4,
1,
2
],
"acting": [
4,
1,
2
],
"avail_no_missing": [],
"object_location_counts": [],
"blocked_by": [
0,
1
],
"up_primary": 4,
"acting_primary": 4,
"purged_snaps": []
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 2461,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
},
"peer_info": [
{
"peer": "2",
"pgid": "5.0",
"last_update": "2257'1060397",
"last_complete": "2257'1060397",
"log_tail": "2218'1055299",
"last_user_version": 1168516,
"last_backfill": "MAX",
"purged_snaps": [],
"history": {
"epoch_created": 86,
"epoch_pool_created": 86,
"last_epoch_started": 2461,
"last_interval_started": 2459,
"last_epoch_clean": 2239,
"last_interval_clean": 2238,
"last_epoch_split": 1413,
"last_epoch_marked_full": 0,
"same_up_since": 2463,
"same_interval_since": 2463,
"same_primary_since": 2431,
"last_scrub": "2218'1054903",
"last_scrub_stamp": "2021-05-03T11:59:13.491721+0200",
"last_deep_scrub": "2200'967752",
"last_deep_scrub_stamp": "2021-04-29T19:57:52.275685+0200",
"last_clean_scrub_stamp": "2021-05-03T11:59:13.491721+0200",
"prior_readable_until_ub": 8.9221200189999994
},
"stats": {
"version": "2257'1060397",
"reported_seq": "2156253",
"reported_epoch": "2428",
"state": "undersized+degraded+peered",
"last_fresh": "2021-05-03T22:04:23.661406+0200",
"last_change": "2021-05-03T22:04:23.661223+0200",
"last_active": "2021-05-03T21:38:06.841747+0200",
"last_peered": "2021-05-03T22:04:23.661406+0200",
"last_clean": "2021-05-03T19:55:21.415761+0200",
"last_became_active": "2021-05-03T21:07:52.856934+0200",
"last_became_peered": "2021-05-03T22:04:23.661223+0200",
"last_unstale": "2021-05-03T22:04:23.661406+0200",
"last_undegraded": "2021-05-03T22:04:23.655675+0200",
"last_fullsized": "2021-05-03T22:04:23.655388+0200",
"mapping_epoch": 2463,
"log_start": "2218'1055299",
"ondisk_log_start": "2218'1055299",
"created": 86,
"last_epoch_clean": 2239,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "2218'1054903",
"last_scrub_stamp": "2021-05-03T11:59:13.491721+0200",
"last_deep_scrub": "2200'967752",
"last_deep_scrub_stamp": "2021-04-29T19:57:52.275685+0200",
"last_clean_scrub_stamp": "2021-05-03T11:59:13.491721+0200",
"log_size": 5098,
"ondisk_log_size": 5098,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 7673537718,
"num_objects": 2525,
"num_object_clones": 871,
"num_object_copies": 7575,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 5050,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 2525,
"num_whiteouts": 45,
"num_read": 649072,
"num_read_kb": 60498790,
"num_write": 703069,
"num_write_kb": 21124370,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 59,
"num_bytes_recovered": 37598208,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0,
"num_large_omap_objects": 0,
"num_objects_manifest": 0,
"num_omap_bytes": 0,
"num_omap_keys": 0,
"num_objects_repaired": 0
},
"up": [
4,
1,
2
],
"acting": [
4,
1,
2
],
"avail_no_missing": [
"2"
],
"object_location_counts": [
{
"shards": "2",
"objects": 2525
}
],
"blocked_by": [],
"up_primary": 4,
"acting_primary": 4,
"purged_snaps": []
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 2461,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
}
],
"recovery_state": [
{
"name": "Started/Primary/Peering/GetInfo",
"enter_time": "2021-05-03T23:07:15.911376+0200",
"requested_info_from": [
{
"osd": "0"
},
{
"osd": "1"
}
]
},
{
"name": "Started/Primary/Peering",
"enter_time": "2021-05-03T23:07:15.911350+0200",
"past_intervals": [
{
"first": "2238",
"last": "2462",
"all_participants": [
{
"osd": 0
},
{
"osd": 1
},
{
"osd": 2
},
{
"osd": 4
}
],
"intervals": [
{
"first": "2420",
"last": "2422",
"acting": "1,2"
},
{
"first": "2444",
"last": "2446",
"acting": "1,4"
},
{
"first": "2459",
"last": "2462",
"acting": "2,4"
}
]
}
],
"probing_osds": [
"0",
"1",
"2",
"4"
],
"down_osds_we_would_probe": [],
"peering_blocked_by": []
},
{
"name": "Started",
"enter_time": "2021-05-03T23:07:15.911224+0200"
}
],
"agent_state": {}
}
 
root@pve01:~# ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 5.23975 - 5.2 TiB 2.4 TiB 2.4 TiB 174 MiB 7.2 GiB 2.8 TiB 46.40 1.00 - root default
-5 1.74658 - 1.7 TiB 829 GiB 827 GiB 61 MiB 2.4 GiB 959 GiB 46.36 1.00 - host pve01
2 ssd 0.87329 1.00000 894 GiB 439 GiB 438 GiB 30 MiB 1.2 GiB 455 GiB 49.12 1.06 68 up osd.2
3 ssd 0.87329 1.00000 894 GiB 390 GiB 389 GiB 32 MiB 1.2 GiB 504 GiB 43.61 0.94 61 up osd.3
-3 1.74658 - 1.7 TiB 831 GiB 829 GiB 61 MiB 2.4 GiB 957 GiB 46.46 1.00 - host pve02
0 ssd 0.87329 1.00000 894 GiB 468 GiB 466 GiB 33 MiB 1.3 GiB 427 GiB 52.30 1.13 73 up osd.0
1 ssd 0.87329 1.00000 894 GiB 363 GiB 362 GiB 28 MiB 1.1 GiB 531 GiB 40.63 0.88 56 up osd.1
-7 1.74658 - 1.7 TiB 829 GiB 827 GiB 51 MiB 2.4 GiB 959 GiB 46.36 1.00 - host pve03
4 ssd 0.87329 1.00000 894 GiB 408 GiB 407 GiB 26 MiB 1.2 GiB 486 GiB 45.64 0.98 63 up osd.4
5 ssd 0.87329 1.00000 894 GiB 421 GiB 420 GiB 25 MiB 1.3 GiB 473 GiB 47.09 1.01 66 up osd.5
TOTAL 5.2 TiB 2.4 TiB 2.4 TiB 174 MiB 7.2 GiB 2.8 TiB 46.40
MIN/MAX VAR: 0.88/1.13 STDDEV: 3.75



root@pve01:~# ceph -s
cluster:
id: 04fbdd5d-9bad-4f0d-a9dc-c2b0f478bc8c
health: HEALTH_WARN
mons are allowing insecure global_id reclaim
Reduced data availability: 88 pgs inactive, 88 pgs peering
36 slow ops, oldest one blocked for 5874 sec, daemons [osd.0,osd.1,osd.4,osd.5,mon.pve01] have slow ops.

services:
mon: 3 daemons, quorum pve01,pve02,pve03 (age 97m)
mgr: pve01(active, since 98m), standbys: pve02, pve03
osd: 6 osds: 6 up (since 56m), 6 in (since 2h)

data:
pools: 2 pools, 129 pgs
objects: 312.29k objects, 885 GiB
usage: 2.4 TiB used, 2.8 TiB / 5.2 TiB avail
pgs: 68.217% pgs not active
88 peering
41 active+clean
 
Problem found!

After rebooting node 3 the network interfaces where renamed. Therefore communication on the ceph cluster network where interrupted in that way that node 3 could not reach node 2. Node 1 could not reach node 3. After a second reboot of node 3, all names where corrected and ceph is healty.