Original Post
Ceph Pacific introduced new RocksDB Sharding. Attempts to reshard an OSD using Ceph Pacific on Proxmox 7.0-5 Beta results in the corruption of the OSD, requiring the OSD's deletion and a backfilling. The OSD can't be restarted or repaired after the failed reshard.
I first stopped the OSD and then used the command from the Ceph documentation:
The cluster was 100% healthy before triggering the reshard. I have 3x identical nodes. Each nodes has 2x Intel P3605 NVMe drives for metadata, RBD, and metrics. There is a single dedicated Intel P3605 NVMe drives for DB/WAL of the spinning HDD. There are 10x HDD's for CephFS data. Everything is bluestore running 16.2.4. The cluster was originally started as Ceph Nautilus, upgraded to Octopus, and now to Pacific. Upgrades were always done following the Proxmox official guide.
Error Log:
Bug Report - https://bugzilla.proxmox.com/show_bug.cgi?id=3499
Ceph Pacific introduced new RocksDB Sharding. Attempts to reshard an OSD using Ceph Pacific on Proxmox 7.0-5 Beta results in the corruption of the OSD, requiring the OSD's deletion and a backfilling. The OSD can't be restarted or repaired after the failed reshard.
I first stopped the OSD and then used the command from the Ceph documentation:
ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-27 --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" reshard
The cluster was 100% healthy before triggering the reshard. I have 3x identical nodes. Each nodes has 2x Intel P3605 NVMe drives for metadata, RBD, and metrics. There is a single dedicated Intel P3605 NVMe drives for DB/WAL of the spinning HDD. There are 10x HDD's for CephFS data. Everything is bluestore running 16.2.4. The cluster was originally started as Ceph Nautilus, upgraded to Octopus, and now to Pacific. Upgrades were always done following the Proxmox official guide.
Error Log:
Code:
root@viper:~# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-27 --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" reshard
2021-06-29T07:39:40.949-0500 7f54703dd240 -1 rocksdb: prepare_for_reshard failure parsing column options: block_cache={type=binned_lru}
ceph-bluestore-tool: /build/ceph/ceph-16.2.4/src/rocksdb/db/column_family.cc:1387: rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed.
*** Caught signal (Aborted) **
in thread 7f54703dd240 thread_name:ceph-bluestore-
ceph version 16.2.4 (a912ff2c95b1f9a8e2e48509e602ee008d5c9434) pacific (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f5470aa5140]
2: gsignal()
3: abort()
4: /lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7f54705be40f]
5: /lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7f54705cd662]
6: (rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x82) [0x55ec0217fb36]
7: (std::default_delete<rocksdb::ColumnFamilySet>::operator()(rocksdb::ColumnFamilySet*) const+0x22) [0x55ec01fd699c]
8: (std::__uniq_ptr_impl<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*)+0x5b) [0x55ec01fd6de5]
9: (std::unique_ptr<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*)+0x2f) [0x55ec01fd08f5]
10: (rocksdb::VersionSet::~VersionSet()+0x4f) [0x55ec01fb6ff9]
11: (rocksdb::VersionSet::~VersionSet()+0x18) [0x55ec01fb7170]
12: (std::default_delete<rocksdb::VersionSet>::operator()(rocksdb::VersionSet*) const+0x28) [0x55ec01e68d64]
13: (std::__uniq_ptr_impl<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset(rocksdb::VersionSet*)+0x5b) [0x55ec01e6ac81]
14: (std::unique_ptr<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset(rocksdb::VersionSet*)+0x2f) [0x55ec01e5bef5]
15: (rocksdb::DBImpl::CloseHelper()+0xa12) [0x55ec01e27414]
16: (rocksdb::DBImpl::~DBImpl()+0x4e) [0x55ec01e2784a]
17: (rocksdb::DBImpl::~DBImpl()+0x18) [0x55ec01e27bfa]
18: (RocksDBStore::close()+0x355) [0x55ec01dfc9a5]
19: (RocksDBStore::reshard(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RocksDBStore::resharding_ctrl const*)+0x231) [0x55ec01e03ec1]
20: main()
21: __libc_start_main()
22: _start()
2021-06-29T07:39:40.965-0500 7f54703dd240 -1 *** Caught signal (Aborted) **
in thread 7f54703dd240 thread_name:ceph-bluestore-
ceph version 16.2.4 (a912ff2c95b1f9a8e2e48509e602ee008d5c9434) pacific (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f5470aa5140]
2: gsignal()
3: abort()
4: /lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7f54705be40f]
5: /lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7f54705cd662]
6: (rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x82) [0x55ec0217fb36]
7: (std::default_delete<rocksdb::ColumnFamilySet>::operator()(rocksdb::ColumnFamilySet*) const+0x22) [0x55ec01fd699c]
8: (std::__uniq_ptr_impl<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*)+0x5b) [0x55ec01fd6de5]
9: (std::unique_ptr<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*)+0x2f) [0x55ec01fd08f5]
10: (rocksdb::VersionSet::~VersionSet()+0x4f) [0x55ec01fb6ff9]
11: (rocksdb::VersionSet::~VersionSet()+0x18) [0x55ec01fb7170]
12: (std::default_delete<rocksdb::VersionSet>::operator()(rocksdb::VersionSet*) const+0x28) [0x55ec01e68d64]
13: (std::__uniq_ptr_impl<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset(rocksdb::VersionSet*)+0x5b) [0x55ec01e6ac81]
14: (std::unique_ptr<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset(rocksdb::VersionSet*)+0x2f) [0x55ec01e5bef5]
15: (rocksdb::DBImpl::CloseHelper()+0xa12) [0x55ec01e27414]
16: (rocksdb::DBImpl::~DBImpl()+0x4e) [0x55ec01e2784a]
17: (rocksdb::DBImpl::~DBImpl()+0x18) [0x55ec01e27bfa]
18: (RocksDBStore::close()+0x355) [0x55ec01dfc9a5]
19: (RocksDBStore::reshard(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RocksDBStore::resharding_ctrl const*)+0x231) [0x55ec01e03ec1]
20: main()
21: __libc_start_main()
22: _start()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
-980> 2021-06-29T07:39:40.949-0500 7f54703dd240 -1 rocksdb: prepare_for_reshard failure parsing column options: block_cache={type=binned_lru}
-979> 2021-06-29T07:39:40.965-0500 7f54703dd240 -1 *** Caught signal (Aborted) **
in thread 7f54703dd240 thread_name:ceph-bluestore-
ceph version 16.2.4 (a912ff2c95b1f9a8e2e48509e602ee008d5c9434) pacific (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f5470aa5140]
2: gsignal()
3: abort()
4: /lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7f54705be40f]
5: /lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7f54705cd662]
6: (rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x82) [0x55ec0217fb36]
7: (std::default_delete<rocksdb::ColumnFamilySet>::operator()(rocksdb::ColumnFamilySet*) const+0x22) [0x55ec01fd699c]
8: (std::__uniq_ptr_impl<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*)+0x5b) [0x55ec01fd6de5]
9: (std::unique_ptr<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*)+0x2f) [0x55ec01fd08f5]
10: (rocksdb::VersionSet::~VersionSet()+0x4f) [0x55ec01fb6ff9]
11: (rocksdb::VersionSet::~VersionSet()+0x18) [0x55ec01fb7170]
12: (std::default_delete<rocksdb::VersionSet>::operator()(rocksdb::VersionSet*) const+0x28) [0x55ec01e68d64]
13: (std::__uniq_ptr_impl<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset(rocksdb::VersionSet*)+0x5b) [0x55ec01e6ac81]
14: (std::unique_ptr<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset(rocksdb::VersionSet*)+0x2f) [0x55ec01e5bef5]
15: (rocksdb::DBImpl::CloseHelper()+0xa12) [0x55ec01e27414]
16: (rocksdb::DBImpl::~DBImpl()+0x4e) [0x55ec01e2784a]
17: (rocksdb::DBImpl::~DBImpl()+0x18) [0x55ec01e27bfa]
18: (RocksDBStore::close()+0x355) [0x55ec01dfc9a5]
19: (RocksDBStore::reshard(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RocksDBStore::resharding_ctrl const*)+0x231) [0x55ec01e03ec1]
20: main()
21: __libc_start_main()
22: _start()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
-3> 2021-06-29T07:39:40.949-0500 7f54703dd240 -1 rocksdb: prepare_for_reshard failure parsing column options: block_cache={type=binned_lru}
0> 2021-06-29T07:39:40.965-0500 7f54703dd240 -1 *** Caught signal (Aborted) **
in thread 7f54703dd240 thread_name:ceph-bluestore-
ceph version 16.2.4 (a912ff2c95b1f9a8e2e48509e602ee008d5c9434) pacific (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f5470aa5140]
2: gsignal()
3: abort()
4: /lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7f54705be40f]
5: /lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7f54705cd662]
6: (rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x82) [0x55ec0217fb36]
7: (std::default_delete<rocksdb::ColumnFamilySet>::operator()(rocksdb::ColumnFamilySet*) const+0x22) [0x55ec01fd699c]
8: (std::__uniq_ptr_impl<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*)+0x5b) [0x55ec01fd6de5]
9: (std::unique_ptr<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*)+0x2f) [0x55ec01fd08f5]
10: (rocksdb::VersionSet::~VersionSet()+0x4f) [0x55ec01fb6ff9]
11: (rocksdb::VersionSet::~VersionSet()+0x18) [0x55ec01fb7170]
12: (std::default_delete<rocksdb::VersionSet>::operator()(rocksdb::VersionSet*) const+0x28) [0x55ec01e68d64]
13: (std::__uniq_ptr_impl<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset(rocksdb::VersionSet*)+0x5b) [0x55ec01e6ac81]
14: (std::unique_ptr<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset(rocksdb::VersionSet*)+0x2f) [0x55ec01e5bef5]
15: (rocksdb::DBImpl::CloseHelper()+0xa12) [0x55ec01e27414]
16: (rocksdb::DBImpl::~DBImpl()+0x4e) [0x55ec01e2784a]
17: (rocksdb::DBImpl::~DBImpl()+0x18) [0x55ec01e27bfa]
18: (RocksDBStore::close()+0x355) [0x55ec01dfc9a5]
19: (RocksDBStore::reshard(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RocksDBStore::resharding_ctrl const*)+0x231) [0x55ec01e03ec1]
20: main()
21: __libc_start_main()
22: _start()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Aborted
Bug Report - https://bugzilla.proxmox.com/show_bug.cgi?id=3499