Ceph OSDs crash on peering(?)

ccp · Sep 15, 2021

I have a somewhat weird problem with ceph. OSDs crash as soon as i mark them in.

Background: I run a erasure-coded pool and recently had some OSDs crash. A few pgs were stuck on incomplete after recovery. The data affected is not critically important but i want to avoid having to delete the pool if possible.

During my attempts on recovery I apparently screwed something up and the pgs got corrupted and caused the OSD to crash (this seems to be a known bug: https://tracker.ceph.com/issues/49689)

When I delete all affected pg shards from the OSD via ceph-objectstore-tool I can start it again. However, as soon as I mark more than one OSD in, the corrupted PG shards get created again and i get to do the cleanup exercise all over again

Sample error: 021-09-15T18:35:49.796+0200 7fe10ddc9700 -1 log_channel(cluster) log [ERR] : 23.19s6 past_intervals [5225,8147) start interval does not contain the required bound [3949,8147) start

As soon as I delete pg 23.19s6 (and in some cases some additional corrupted pgs) I can start the OSD again. The exact same shards are created over and over again. Any suggestions on how to proceed?

Update: truncating the MDS journal let me get up/in all but 4 OSDs

Update 2: further cleanup-loops with ceph-objectstore-tool allowed me to get all but two OSDs up and in - the two OSDs are disks that were wrongfully replaced before marking the dead OSDs as lost. I will wait for recovery to finish and probably zap the two OSDs then

VladimirY · Nov 22, 2022

I faced the similar problem.
Can you help me?
How did you repair OSDs ?

Search

Search

Ceph OSDs crash on peering(?)

ccp

Member

VladimirY

New Member

We value your privacy