HELP required on incomplete PG

kifeo

Well-Known Member
Oct 28, 2019
112
12
58
Hi Everyone,

Thank you first for taking time to read this post.
I seek for help, as I was not able to get around this situation.

This blocks the two main pool of my cluster to become active :(

I've seen this post https://forum.proxmox.com/threads/ceph-osd-crashed.114137/#post-493297, and infact my issue is that I hit Ceph BUG #57940: https://tracker.ceph.com/issues/57940, duplicate of #56772 : https://tracker.ceph.com/issues/56772

No progress is made on the ceph side.

I've tried to export/import the pg, and mark it complete on the new OSD, but without luck, I always hit this bug.
So from my understanding, there is something with the pg metadata.
I would like help finding which metadata causes the bug, and workaround it by modifying the metadata by hand.

In attachment some debugs collected, if you ask, I provide everything you would like.
info-8.6b.txt contains "ceph-objectstore-tool --op info --data-path /var/lib/ceph/osd/ceph-1 --pgid 8.6b > info-8.6b.txt"
query-8.6b.txt contains "ceph pg 8.76 query"

note that even if I remove the copies, the backfilling from osd.1 does also trigger the bug.


the osd that contain the pgs is osd.1

root@proxmox5:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 21.83234 root default
-7 4.54839 host proxmox1
9 hdd 3.63869 osd.9 up 1.00000 1.00000
2 ssd 0.90970 osd.2 up 1.00000 1.00000
-3 6.36778 host proxmox3
4 hdd 1.81940 osd.4 up 1.00000 1.00000
8 hdd 3.63869 osd.8 up 1.00000 1.00000
5 ssd 0.90970 osd.5 up 0 1.00000
-13 0.90970 host proxmox4
0 ssd 0.90970 osd.0 up 1.00000 1.00000
-10 10.00647 host proxmox5
1 hdd 3.63869 osd.1 down 0 1.00000
3 hdd 3.63869 osd.3 down 1.00000 1.00000
6 hdd 1.81940 osd.6 down 0 1.00000
7 ssd 0.90970 osd.7 up 1.00000 1.00000

### CRASH looks like
./src/osd/osd_types.cc: 5888: FAILED ceph_assert(clone_overlap.count(clone))

ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x558e30758f70]
2: /usr/bin/ceph-osd(+0xc2310e) [0x558e3075910e]
3: (SnapSet::get_clone_bytes(snapid_t) const+0xe3) [0x558e30a995f3]
4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x23e) [0x558e3094394e]
5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x19f3) [0x558e309ae543]
6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xf2a) [0x558e309b442a]
7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x295) [0x558e30823175]
8: (ceph:sd::scheduler:GRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x558e30add879]
9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xad0) [0x558e30843bc0]
10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x558e30f26c1a]
11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558e30f291f0]
12: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7ff2ea833ea7]
13: clone()


Thanks a lot !
 

Attachments

  • info-8.6b.txt
    4.8 KB · Views: 2
  • query-8.6b.txt
    18.5 KB · Views: 0
Last edited:
Another view would be to delete them all and loose the data.

What would happen to a RBD if some pg are removed ? does it crash or something ?

Thanks for sharing any idea / experience.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!