I have a somewhat weird problem with ceph. OSDs crash as soon as i mark them in.
Background: I run a erasure-coded pool and recently had some OSDs crash. A few pgs were stuck on incomplete after recovery. The data affected is not critically important but i want to avoid having to delete the pool if possible.
During my attempts on recovery I apparently screwed something up and the pgs got corrupted and caused the OSD to crash (this seems to be a known bug: https://tracker.ceph.com/issues/49689)
When I delete all affected pg shards from the OSD via ceph-objectstore-tool I can start it again. However, as soon as I mark more than one OSD in, the corrupted PG shards get created again and i get to do the cleanup exercise all over again
Sample error: 021-09-15T18:35:49.796+0200 7fe10ddc9700 -1 log_channel(cluster) log [ERR] : 23.19s6 past_intervals [5225,8147) start interval does not contain the required bound [3949,8147) start
As soon as I delete pg 23.19s6 (and in some cases some additional corrupted pgs) I can start the OSD again. The exact same shards are created over and over again. Any suggestions on how to proceed?
Update: truncating the MDS journal let me get up/in all but 4 OSDs
Update 2: further cleanup-loops with ceph-objectstore-tool allowed me to get all but two OSDs up and in - the two OSDs are disks that were wrongfully replaced before marking the dead OSDs as lost. I will wait for recovery to finish and probably zap the two OSDs then
Background: I run a erasure-coded pool and recently had some OSDs crash. A few pgs were stuck on incomplete after recovery. The data affected is not critically important but i want to avoid having to delete the pool if possible.
During my attempts on recovery I apparently screwed something up and the pgs got corrupted and caused the OSD to crash (this seems to be a known bug: https://tracker.ceph.com/issues/49689)
When I delete all affected pg shards from the OSD via ceph-objectstore-tool I can start it again. However, as soon as I mark more than one OSD in, the corrupted PG shards get created again and i get to do the cleanup exercise all over again
Sample error: 021-09-15T18:35:49.796+0200 7fe10ddc9700 -1 log_channel(cluster) log [ERR] : 23.19s6 past_intervals [5225,8147) start interval does not contain the required bound [3949,8147) start
As soon as I delete pg 23.19s6 (and in some cases some additional corrupted pgs) I can start the OSD again. The exact same shards are created over and over again. Any suggestions on how to proceed?
Update: truncating the MDS journal let me get up/in all but 4 OSDs
Update 2: further cleanup-loops with ceph-objectstore-tool allowed me to get all but two OSDs up and in - the two OSDs are disks that were wrongfully replaced before marking the dead OSDs as lost. I will wait for recovery to finish and probably zap the two OSDs then
Last edited: