Hey Folks,
Stashing this here as it's the only solution that worked for me and I will undoubtedly need it again.
Given,
Check recovery_info stanza for "peering_blocked_by_history_les_bound"
Check if this flag is false
you can potentially get the incomplete to rebuild.
First grab an OSD involved,
and remember the OSD it informs.
Next flip the state of the osd variable,
then restart the OSD from repair output. I actually restarted the host (after migrating as I had another maint. task). On powerup ceph started rebuilding correctly and in 15 mins the PG was in good shape again. The osd flag won't affect OSDs that don't have that recovery blocked-by set but best to toggle back to false after the cluster has recreated the PG. TBH there was speculation this wouldn't restore up the to exact last write that failed so perhaps you should try to export/backup the data first (never a bad idea).
Thx for being my notepad.
Stashing this here as it's the only solution that worked for me and I will undoubtedly need it again.
Given,
Code:
$> ceph health detail
...
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive, 1 pg incomplete
pg 7.188 is incomplete, acting [5,10,43] (reducing pool ceph_pool min_size from 2 may help; search ceph.com/docs for 'incomplete')
...
Code:
$> ceph pg 7.188 query
Check recovery_info stanza for "peering_blocked_by_history_les_bound"
Check if this flag is false
Code:
ceph config get osd osd_find_best_info_ignore_history_les
you can potentially get the incomplete to rebuild.
First grab an OSD involved,
Code:
ceph pg 7.188 repair
and remember the OSD it informs.
Next flip the state of the osd variable,
Code:
ceph config set osd osd_find_best_info_ignore_history_les true
then restart the OSD from repair output. I actually restarted the host (after migrating as I had another maint. task). On powerup ceph started rebuilding correctly and in 15 mins the PG was in good shape again. The osd flag won't affect OSDs that don't have that recovery blocked-by set but best to toggle back to false after the cluster has recreated the PG. TBH there was speculation this wouldn't restore up the to exact last write that failed so perhaps you should try to export/backup the data first (never a bad idea).
Thx for being my notepad.