HELP required on incomplete PG

kifeo · Dec 16, 2022

Hi Everyone,

Thank you first for taking time to read this post.
I seek for help, as I was not able to get around this situation.

This blocks the two main pool of my cluster to become active

I've seen this post https://forum.proxmox.com/threads/ceph-osd-crashed.114137/#post-493297, and infact my issue is that I hit Ceph BUG #57940: https://tracker.ceph.com/issues/57940, duplicate of #56772 : https://tracker.ceph.com/issues/56772

No progress is made on the ceph side.

I've tried to export/import the pg, and mark it complete on the new OSD, but without luck, I always hit this bug.
So from my understanding, there is something with the pg metadata.
I would like help finding which metadata causes the bug, and workaround it by modifying the metadata by hand.

In attachment some debugs collected, if you ask, I provide everything you would like.
info-8.6b.txt contains "ceph-objectstore-tool --op info --data-path /var/lib/ceph/osd/ceph-1 --pgid 8.6b > info-8.6b.txt"
query-8.6b.txt contains "ceph pg 8.76 query"

note that even if I remove the copies, the backfilling from osd.1 does also trigger the bug.

the osd that contain the pgs is osd.1

root@proxmox5:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 21.83234 root default
-7 4.54839 host proxmox1
9 hdd 3.63869 osd.9 up 1.00000 1.00000
2 ssd 0.90970 osd.2 up 1.00000 1.00000
-3 6.36778 host proxmox3
4 hdd 1.81940 osd.4 up 1.00000 1.00000
8 hdd 3.63869 osd.8 up 1.00000 1.00000
5 ssd 0.90970 osd.5 up 0 1.00000
-13 0.90970 host proxmox4
0 ssd 0.90970 osd.0 up 1.00000 1.00000
-10 10.00647 host proxmox5
1 hdd 3.63869 osd.1 down 0 1.00000
3 hdd 3.63869 osd.3 down 1.00000 1.00000
6 hdd 1.81940 osd.6 down 0 1.00000
7 ssd 0.90970 osd.7 up 1.00000 1.00000

### CRASH looks like

./src/osd/osd_types.cc: 5888: FAILED ceph_assert(clone_overlap.count(clone))

ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x558e30758f70]
2: /usr/bin/ceph-osd(+0xc2310e) [0x558e3075910e]
3: (SnapSet::get_clone_bytes(snapid_t) const+0xe3) [0x558e30a995f3]
4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x23e) [0x558e3094394e]
5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x19f3) [0x558e309ae543]
6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xf2a) [0x558e309b442a]
7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x295) [0x558e30823175]
8: (ceph:sd::scheduler:GRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x558e30add879]
9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xad0) [0x558e30843bc0]
10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x558e30f26c1a]
11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558e30f291f0]
12: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7ff2ea833ea7]
13: clone()

Thanks a lot !

kifeo · Dec 19, 2022

Another view would be to delete them all and loose the data.

What would happen to a RBD if some pg are removed ? does it crash or something ?

Thanks for sharing any idea / experience.

Search

Search

HELP required on incomplete PG

kifeo

Well-Known Member

Attachments

kifeo

Well-Known Member

We value your privacy