Ceph OSDs crash while backfilling

viplanghe · Jan 4, 2023

Ceph version: 14.2.11
There is a PG cause the OSDs in acting set crash whenever it enter backfilling state. I have to set nobackfill for now, so that osds don't flap.

Here is the osd log:

Code:

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/osd/osd_types.cc: 5450: FAILED ceph_assert(clone_overlap.count(clone))

 ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x5564805f80e5]
 2: (()+0x4d72ad) [0x5564805f82ad]
 3: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x5564809120e2]
 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x28c) [0x556480843aac]
 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0xf65) [0x556480872985]
 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x114c) [0x5564808767ac]
 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x2ff) [0x5564806d74ef]
 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x556480966529]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x5564806f2d3f]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x556480ca6c46]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556480ca9760]
 12: (()+0x7dd5) [0x7f6d9c6a1dd5]
 13: (clone()+0x6d) [0x7f6d9b567ead]

2023-01-04 08:10:36.559 7f6d79c37700 -1 *** Caught signal (Aborted) **
 in thread 7f6d79c37700 thread_name:tp_osd_tp

 ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable)
 1: (()+0xf5d0) [0x7f6d9c6a95d0]
 2: (gsignal()+0x37) [0x7f6d9b4a0207]
 3: (abort()+0x148) [0x7f6d9b4a18f8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x5564805f8134]
 5: (()+0x4d72ad) [0x5564805f82ad]
 6: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x5564809120e2]
 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x28c) [0x556480843aac]
 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0xf65) [0x556480872985]
 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x114c) [0x5564808767ac]
 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x2ff) [0x5564806d74ef]
 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x556480966529]
 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x5564806f2d3f]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x556480ca6c46]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556480ca9760]
 15: (()+0x7dd5) [0x7f6d9c6a1dd5]
 16: (clone()+0x6d) [0x7f6d9b567ead]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

PG query has this info (full below):

"last_backfill_started": "6:33f11f71:::rbd_data.4ce87586136379.0000000000004e07:head",

I thought this object causes the issue, then I delete the volume that contain this object. But that doesn't help anything :-(

shanreich · Jan 4, 2023

Looks more like an issue with Ceph itself, you might have more luck getting help asking on the Ceph-user mailinglist.

Search

Search

Ceph OSDs crash while backfilling

viplanghe

New Member

Attachments

shanreich

Proxmox Staff Member