Can't Reshard OSD's under Ceph Pacific 16.2.4

jasonsansone

Active Member
May 17, 2021
157
39
33
Oklahoma City, OK
www.sansonehowell.com
Original Post

Ceph Pacific introduced new RocksDB Sharding. Attempts to reshard an OSD using Ceph Pacific on Proxmox 7.0-5 Beta results in the corruption of the OSD, requiring the OSD's deletion and a backfilling. The OSD can't be restarted or repaired after the failed reshard.

I first stopped the OSD and then used the command from the Ceph documentation:

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-27 --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" reshard

The cluster was 100% healthy before triggering the reshard. I have 3x identical nodes. Each nodes has 2x Intel P3605 NVMe drives for metadata, RBD, and metrics. There is a single dedicated Intel P3605 NVMe drives for DB/WAL of the spinning HDD. There are 10x HDD's for CephFS data. Everything is bluestore running 16.2.4. The cluster was originally started as Ceph Nautilus, upgraded to Octopus, and now to Pacific. Upgrades were always done following the Proxmox official guide.

Error Log:
Code:
root@viper:~# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-27 --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" reshard
2021-06-29T07:39:40.949-0500 7f54703dd240 -1 rocksdb: prepare_for_reshard failure parsing column options: block_cache={type=binned_lru}
ceph-bluestore-tool: /build/ceph/ceph-16.2.4/src/rocksdb/db/column_family.cc:1387: rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed.
*** Caught signal (Aborted) **
 in thread 7f54703dd240 thread_name:ceph-bluestore-
 ceph version 16.2.4 (a912ff2c95b1f9a8e2e48509e602ee008d5c9434) pacific (stable)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f5470aa5140]
 2: gsignal()
 3: abort()
 4: /lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7f54705be40f]
 5: /lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7f54705cd662]
 6: (rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x82) [0x55ec0217fb36]
 7: (std::default_delete<rocksdb::ColumnFamilySet>::operator()(rocksdb::ColumnFamilySet*) const+0x22) [0x55ec01fd699c]
 8: (std::__uniq_ptr_impl<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*)+0x5b) [0x55ec01fd6de5]
 9: (std::unique_ptr<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*)+0x2f) [0x55ec01fd08f5]
 10: (rocksdb::VersionSet::~VersionSet()+0x4f) [0x55ec01fb6ff9]
 11: (rocksdb::VersionSet::~VersionSet()+0x18) [0x55ec01fb7170]
 12: (std::default_delete<rocksdb::VersionSet>::operator()(rocksdb::VersionSet*) const+0x28) [0x55ec01e68d64]
 13: (std::__uniq_ptr_impl<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset(rocksdb::VersionSet*)+0x5b) [0x55ec01e6ac81]
 14: (std::unique_ptr<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset(rocksdb::VersionSet*)+0x2f) [0x55ec01e5bef5]
 15: (rocksdb::DBImpl::CloseHelper()+0xa12) [0x55ec01e27414]
 16: (rocksdb::DBImpl::~DBImpl()+0x4e) [0x55ec01e2784a]
 17: (rocksdb::DBImpl::~DBImpl()+0x18) [0x55ec01e27bfa]
 18: (RocksDBStore::close()+0x355) [0x55ec01dfc9a5]
 19: (RocksDBStore::reshard(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RocksDBStore::resharding_ctrl const*)+0x231) [0x55ec01e03ec1]
 20: main()
 21: __libc_start_main()
 22: _start()
2021-06-29T07:39:40.965-0500 7f54703dd240 -1 *** Caught signal (Aborted) **
 in thread 7f54703dd240 thread_name:ceph-bluestore-

 ceph version 16.2.4 (a912ff2c95b1f9a8e2e48509e602ee008d5c9434) pacific (stable)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f5470aa5140]
 2: gsignal()
 3: abort()
 4: /lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7f54705be40f]
 5: /lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7f54705cd662]
 6: (rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x82) [0x55ec0217fb36]
 7: (std::default_delete<rocksdb::ColumnFamilySet>::operator()(rocksdb::ColumnFamilySet*) const+0x22) [0x55ec01fd699c]
 8: (std::__uniq_ptr_impl<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*)+0x5b) [0x55ec01fd6de5]
 9: (std::unique_ptr<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*)+0x2f) [0x55ec01fd08f5]
 10: (rocksdb::VersionSet::~VersionSet()+0x4f) [0x55ec01fb6ff9]
 11: (rocksdb::VersionSet::~VersionSet()+0x18) [0x55ec01fb7170]
 12: (std::default_delete<rocksdb::VersionSet>::operator()(rocksdb::VersionSet*) const+0x28) [0x55ec01e68d64]
 13: (std::__uniq_ptr_impl<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset(rocksdb::VersionSet*)+0x5b) [0x55ec01e6ac81]
 14: (std::unique_ptr<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset(rocksdb::VersionSet*)+0x2f) [0x55ec01e5bef5]
 15: (rocksdb::DBImpl::CloseHelper()+0xa12) [0x55ec01e27414]
 16: (rocksdb::DBImpl::~DBImpl()+0x4e) [0x55ec01e2784a]
 17: (rocksdb::DBImpl::~DBImpl()+0x18) [0x55ec01e27bfa]
 18: (RocksDBStore::close()+0x355) [0x55ec01dfc9a5]
 19: (RocksDBStore::reshard(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RocksDBStore::resharding_ctrl const*)+0x231) [0x55ec01e03ec1]
 20: main()
 21: __libc_start_main()
 22: _start()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

  -980> 2021-06-29T07:39:40.949-0500 7f54703dd240 -1 rocksdb: prepare_for_reshard failure parsing column options: block_cache={type=binned_lru}
  -979> 2021-06-29T07:39:40.965-0500 7f54703dd240 -1 *** Caught signal (Aborted) **
 in thread 7f54703dd240 thread_name:ceph-bluestore-

 ceph version 16.2.4 (a912ff2c95b1f9a8e2e48509e602ee008d5c9434) pacific (stable)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f5470aa5140]
 2: gsignal()
 3: abort()
 4: /lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7f54705be40f]
 5: /lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7f54705cd662]
 6: (rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x82) [0x55ec0217fb36]
 7: (std::default_delete<rocksdb::ColumnFamilySet>::operator()(rocksdb::ColumnFamilySet*) const+0x22) [0x55ec01fd699c]
 8: (std::__uniq_ptr_impl<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*)+0x5b) [0x55ec01fd6de5]
 9: (std::unique_ptr<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*)+0x2f) [0x55ec01fd08f5]
 10: (rocksdb::VersionSet::~VersionSet()+0x4f) [0x55ec01fb6ff9]
 11: (rocksdb::VersionSet::~VersionSet()+0x18) [0x55ec01fb7170]
 12: (std::default_delete<rocksdb::VersionSet>::operator()(rocksdb::VersionSet*) const+0x28) [0x55ec01e68d64]
 13: (std::__uniq_ptr_impl<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset(rocksdb::VersionSet*)+0x5b) [0x55ec01e6ac81]
 14: (std::unique_ptr<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset(rocksdb::VersionSet*)+0x2f) [0x55ec01e5bef5]
 15: (rocksdb::DBImpl::CloseHelper()+0xa12) [0x55ec01e27414]
 16: (rocksdb::DBImpl::~DBImpl()+0x4e) [0x55ec01e2784a]
 17: (rocksdb::DBImpl::~DBImpl()+0x18) [0x55ec01e27bfa]
 18: (RocksDBStore::close()+0x355) [0x55ec01dfc9a5]
 19: (RocksDBStore::reshard(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RocksDBStore::resharding_ctrl const*)+0x231) [0x55ec01e03ec1]
 20: main()
 21: __libc_start_main()
 22: _start()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

    -3> 2021-06-29T07:39:40.949-0500 7f54703dd240 -1 rocksdb: prepare_for_reshard failure parsing column options: block_cache={type=binned_lru}
     0> 2021-06-29T07:39:40.965-0500 7f54703dd240 -1 *** Caught signal (Aborted) **
 in thread 7f54703dd240 thread_name:ceph-bluestore-

 ceph version 16.2.4 (a912ff2c95b1f9a8e2e48509e602ee008d5c9434) pacific (stable)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f5470aa5140]
 2: gsignal()
 3: abort()
 4: /lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7f54705be40f]
 5: /lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7f54705cd662]
 6: (rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x82) [0x55ec0217fb36]
 7: (std::default_delete<rocksdb::ColumnFamilySet>::operator()(rocksdb::ColumnFamilySet*) const+0x22) [0x55ec01fd699c]
 8: (std::__uniq_ptr_impl<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*)+0x5b) [0x55ec01fd6de5]
 9: (std::unique_ptr<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*)+0x2f) [0x55ec01fd08f5]
 10: (rocksdb::VersionSet::~VersionSet()+0x4f) [0x55ec01fb6ff9]
 11: (rocksdb::VersionSet::~VersionSet()+0x18) [0x55ec01fb7170]
 12: (std::default_delete<rocksdb::VersionSet>::operator()(rocksdb::VersionSet*) const+0x28) [0x55ec01e68d64]
 13: (std::__uniq_ptr_impl<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset(rocksdb::VersionSet*)+0x5b) [0x55ec01e6ac81]
 14: (std::unique_ptr<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset(rocksdb::VersionSet*)+0x2f) [0x55ec01e5bef5]
 15: (rocksdb::DBImpl::CloseHelper()+0xa12) [0x55ec01e27414]
 16: (rocksdb::DBImpl::~DBImpl()+0x4e) [0x55ec01e2784a]
 17: (rocksdb::DBImpl::~DBImpl()+0x18) [0x55ec01e27bfa]
 18: (RocksDBStore::close()+0x355) [0x55ec01dfc9a5]
 19: (RocksDBStore::reshard(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RocksDBStore::resharding_ctrl const*)+0x231) [0x55ec01e03ec1]
 20: main()
 21: __libc_start_main()
 22: _start()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Aborted


Bug Report - https://bugzilla.proxmox.com/show_bug.cgi?id=3499
 
Sadly, despite reading the docs, I've just been hit by this. I now have a number of OSDs that start and then crash with the following error. Is there way to create a non-sharded Octopus OSD in Pacific? I'm thinking that might be a way of getting my pool happy again. Can anyone help?

Code:
NOTIFY mbc={}] 2.14 past_intervals [16575,17527) start interval does not contain the required bound [16034,17527) start
2021-07-14T13:05:13.171+0000 7f2a7d0d8700 -1 log_channel(cluster) log [ERR] : 2.f past_intervals [16580,17527) start interval does not contain the required bound [16031,17527) start
2021-07-14T13:05:13.171+0000 7f2a7d0d8700 -1 osd.8 pg_epoch: 17527 pg[2.f( empty local-lis/les=0/0 n=0 ec=16580/16580 lis/c=16074/16030 les/c/f=16075/16031/0 sis=17527) [0] r=-1 lpr=17527 pi=[16580,17527)/4 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] 2.f past_intervals [16580,17527) start interval does not contain the required bound [16031,17527) start
2021-07-14T13:05:13.195+0000 7f2a7d0d8700 -1 ./src/osd/PeeringState.cc: In function 'void PeeringState::check_past_interval_bounds() const' thread 7f2a7d0d8700 time 2021-07-14T13:05:13.175028+0000
./src/osd/PeeringState.cc: 981: ceph_abort_msg("past_interval start interval mismatch")


 ceph version 16.2.4 (a912ff2c95b1f9a8e2e48509e602ee008d5c9434) pacific (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xd3) [0x56051d647c0b]
 2: (PeeringState::check_past_interval_bounds() const+0x67c) [0x56051d9a98bc]
 3: (PeeringState::Reset::react(PeeringState::AdvMap const&)+0x292) [0x56051d9bb5c2]
 4: (boost::statechart::simple_state<PeeringState::Reset, PeeringState::PeeringMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x206) [0x56051da086e6]
 5: (PeeringState::advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PeeringCtx&)+0x297) [0x56051d9a56f7]
 6: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PeeringCtx&)+0xf9) [0x56051d7bf669]
 7: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x3a4) [0x56051d7175c4]
 8: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x16e) [0x56051d71993e]
 9: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x55) [0x56051d97de05]
 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f8) [0x56051d72d518]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x56051ddc678a]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56051ddc8af0]
 13: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f2a9ba59ea7]
 14: clone()
 
In my limited testing and experience with Ceph Pacific, it will create new OSD's just fine. Those OSD's will use the new DB sharding parameters. The problem is resharding a RocksDB created prior to Pacific in order to enable the new Pacific features. That is where the error occurs. I can't begin to understand your error log, but you should be able to remove and rebuild OSD's (assuming you didn't attempt to reshard so many there is irreparable data loss).
 
I've tried removing and recreating the OSDs affected, and they seem to create OK, but then fall over with the error above. As I understand it the sharding is an OSD thing, not a pool thing, so I don't understand why this would be the case. Hence asking if I can create an OSD that's not sharded in Pacific.
 
Last edited:
Please sticky this thread.

Also, I would suggest moving the RocksDB Resharding notice at the bottom of the instructions to the top of the https://pve.proxmox.com/wiki/Ceph_Octopus_to_Pacific document. It's also just anecdotal, but at this point, I've come across a number of posts about crashes which seem to suggest that upgrading Ceph for PVE7 is not a good idea at this point.
 
Thankfully I've managed to get everything running again from backups and a NAS sharing over NFS, and I've only lost a small amount of data which it simple to recreate. I'm going to try completely recreating the ceph pools and OSDs and migrating the data back onto them (after testing of course!).
 
  • Like
Reactions: branto
Glad to hear it! I'm in the same boat, with a FUBAR'ed cluster that I need help with. I was hoping my post would be approved.
 
A quick update from me. A complete uninstall (following these instructions) and reinstall and rebuild of Ceph fixed my issue, but it also appears it was exacerbated by the fact my Ceph nodes regularly had a clock skew of over 200ms. This was caused by having my Ceph nodes sync their NTP with the router that is running as a VM in the Proxmox cluster. Once I switched to using the default chrony install, and therefore the standard Debian NTP servers, the clock skew is now under 30ms, so Ceph is much happier. Lesson learnt - clock skew is super important for Ceph clusters, and don't sync time to VM based NTP servers!
 
  • Like
Reactions: branto
Also hit with this bug. The timing was god awful too, but I've managed to mostly recover.

I also attempted to manually shard OSDs with:
ceph-bluestore-tool --path <data path> --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" reshard
per ceph's docs.

Reading the bug report details:
"it seems that the rocksdb option parser does not understand the 'block_cache={type=binned_lru}' option despite it being the default for new osds on pacific"

I am currently attempting to recover these OSDs by skipping the "block_cache" portion of the reshard command:
ceph-bluestore-tool --path <data path> --sharding="m(3) p(3,0-12) O(3,0-13) L P" reshard
and will report back how it goes.

I recognize that removing a default cache can't be good for performance, so if this works, each OSD will be marked out, wait for data, and removed + recreated to be safe.

@t.lamprecht, please see if you can have the documentation move the warning to the top of the page (maybe in red). I also read the docs, but clearly stopped once I had the information required to upgrade successfully. This warning should be more prominent due to the apparent risk of data loss. While it's my mistake for not reading the doc completely, this is the whole intent of docs. No one reads the docs cover-to-cover like a novel. They're intended to allow you to quickly get the specific information you are looking for. Hiding a warning which prevents actual data loss in the footer of a page is asking for more like me.

As well, this issue appears to impact new OSDs in some cases as well. I removed all OSDs on a single node, and sometime after rebuild two of the recently created OSDs are exhibiting the same "past_interval start interval mismatch" error preventing boot.


Also, @rupertbenbrook, with regard to NTP/Chrony:
I just had a ton of fun the other day setting up GPSD for gps based clock sync via chrony. While relatively overkill and probably not needed, was a ton of fun and a great learning op which might be worth looking into if you want a fun project. Plus, my ceph cluster is time synced to satalites, so thats cool.
 
Confirmed, the command does indeed reshard successfully, and from there does indeed allow the OSD to start.
Again, recommend marking out, allowing data transfer, then removing and recreating any OSD that gets this fix, as I am unsure of the performance implications of removing the "Binned, and Least Recently Used Block Cache", but am pretty sure they won't be a performance increase..

Workaround, for prior to the PR going through:
Code:
ceph-bluestore-tool --path <data path> --sharding="m(3) p(3,0-12) O(3,0-13) L P" reshard
 
In unrelated news, don't be a smarty-pants like me, and attempt to implement the PR noted above.
If you're not careful, you'll upgrade yourself to Ceph version 17.0.0.XX (Quincy), and be unable to downgrade due to changes in "Mon disk structure".
 
While it's my mistake for not reading the doc completely, this is the whole intent of docs. No one reads the docs cover-to-cover like a novel
There's a explicit "Known Issues" section which includes this one, that section is also linked too in the table of content at the top, known issues section should always be looked for and read, this is not some book/novel or documentation after all. This is a clear and short how-to where every step and the order they are executed with is very important.

They're intended to allow you to quickly get the specific information you are looking for.
No, a upgrade how-to is meant to be followed closely and as a whole, this isn't some configuration documentation where one can pick out a few pieces and where it won't matter much because one is setting up the service for testing only anyway.

Hiding a warning which prevents actual data loss in the footer of a page is asking for more like me.
We dedicated a whole prominently (TOC) linked section to such "Known Issues" warnings, I wouldn't exactly calling that hiding them ;)

I added a bulletin point to the "Assumption" section at the start to point another time at the "Known Issues" section though, further I added also more explicit hint that the article is meant to be read completely and executed in order.
 
FYI: 16.2.6 is on our todo list since last week, we need to sort out some release issues with upstream first though, but if nothing bigger comes up I'd expect the availability of that version on the test repo still this week.
 
  • Like
Reactions: jasonsansone
Thanks for your feedback, our tests here worked out OK too, so the Bugzilla entry and the update wiki got updated.

Note though that the required version is only available on the ceph-pacific test repo at time of writing, it may need a few days until we're sure enough there's no bigger regression in there before we'll move it into the main repo.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!