Ceph OSDs crash on peering(?)

ccp

Member
Mar 21, 2021
2
0
6
125
I have a somewhat weird problem with ceph. OSDs crash as soon as i mark them in.

Background: I run a erasure-coded pool and recently had some OSDs crash. A few pgs were stuck on incomplete after recovery. The data affected is not critically important but i want to avoid having to delete the pool if possible.

During my attempts on recovery I apparently screwed something up and the pgs got corrupted and caused the OSD to crash (this seems to be a known bug: https://tracker.ceph.com/issues/49689)

When I delete all affected pg shards from the OSD via ceph-objectstore-tool I can start it again. However, as soon as I mark more than one OSD in, the corrupted PG shards get created again and i get to do the cleanup exercise all over again

Sample error: 021-09-15T18:35:49.796+0200 7fe10ddc9700 -1 log_channel(cluster) log [ERR] : 23.19s6 past_intervals [5225,8147) start interval does not contain the required bound [3949,8147) start

As soon as I delete pg 23.19s6 (and in some cases some additional corrupted pgs) I can start the OSD again. The exact same shards are created over and over again. Any suggestions on how to proceed?

Update: truncating the MDS journal let me get up/in all but 4 OSDs

Update 2: further cleanup-loops with ceph-objectstore-tool allowed me to get all but two OSDs up and in - the two OSDs are disks that were wrongfully replaced before marking the dead OSDs as lost. I will wait for recovery to finish and probably zap the two OSDs then
 
Last edited: