PVE8.4.14 + Ceph 19.2.3, 3 node cluster. All disks are PCIe NVMe. Different pools, some with zstd compression enabled.
I'm seeing OSD crashing lately with the same failure. Journal shows that it is unable to properly run RocksDB with an assert message. There are a few entries like these every time the OSD service tries to start:
The failed OSD are on different hosts of the cluster (initially I suspected some hardware issue with motherboard, PCI risers, etc). The disks are ok, pass all tests, no errors in dmesg or journal related to any kind of disk failure. In fact, removing the OSD and creating a new one works fine. No "recreated" OSD has failed so far. The OSDs are quite full, around 75%.
Tried to repair RocksDB, but failed too with the same error log (
Searching the internet has pointed to a bug report [1] that mentions "flapping" OSDs on Ceph Reef (although last comment mentions a backport for Squid too). In my case, once the OSD fail, they never come back.
Questions:
Many thanks in advance
[1] https://bugzilla.proxmox.com/show_bug.cgi?id=7211
I'm seeing OSD crashing lately with the same failure. Journal shows that it is unable to properly run RocksDB with an assert message. There are a few entries like these every time the OSD service tries to start:
Code:
Feb 20 09:44:34 PVE06 systemd[1]: Started ceph-osd@20.service - Ceph object storage daemon osd.20.
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: ./src/os/bluestore/BlueFS.cc: In function 'int BlueFS::truncate(FileWriter*, uint64_t)' thread 7d15d26de940 time 2026-02-20T09:45:46.959670+0100
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: ./src/os/bluestore/BlueFS.cc: 3871: FAILED ceph_assert(cut_off == p->length)
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: ceph version 19.2.3 (116fa4d1a2c5227d907163f1d05a062467c99f57) squid (stable)
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x11e) [0x617eb68ba84b]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 2: /usr/bin/ceph-osd(+0x67a9e8) [0x617eb68ba9e8]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 3: (BlueFS::truncate(BlueFS::FileWriter*, unsigned long)+0x852) [0x617eb7021a92]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 4: (BlueRocksWritableFile::Close()+0x2d) [0x617eb7040d2d]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 5: /usr/bin/ceph-osd(+0x14e8aa6) [0x617eb7728aa6]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 6: (rocksdb::WritableFileWriter::Close()+0xc1a) [0x617eb776279a]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 7: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::VersionSet*, rocksdb::ImmutableDBOptions const&, rocksdb::TableBuilderOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, rocksdb::FileMetaData*, std::vector<rocksdb::BlobFileAddition, std::allocator<rocksdb::BlobFileAddition> >*, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, unsigned long, rocksdb::SnapshotChecker*, bool, rocksdb::InternalStats*, rocksdb::IOStatus*, std::shared_ptr<rocksdb::IOTracer> const&, rocksdb::BlobFileCreationReason, rocksdb::SeqnoToTimeMapping const&, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, rocksdb::Env::WriteLifeTimeHint, std::__cxx11::basic_string<char, std::ch>
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 8: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0x1047) [0x617eb7609557]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 9: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool, bool*, rocksdb::DBImpl::RecoveryContext*)+0x1ec4) [0x617eb760c0e4]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 10: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*)+0x1fb9) [0x617eb760ef09]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 11: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool)+0x7a0) [0x617eb7605b30]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 12: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x24) [0x617eb7607c74]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 13: (RocksDBStore::do_open(std::ostream&, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x776) [0x617eb754e5f6]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 14: (BlueStore::_open_db(bool, bool, bool)+0x9d1) [0x617eb6f63231]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 15: (BlueStore::_open_db_and_around(bool, bool)+0x37f) [0x617eb6fa99df]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 16: (BlueStore::_mount()+0x242) [0x617eb6face42]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 17: (OSD::init()+0x4e9) [0x617eb6a1eee9]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 18: main()
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 19: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7d15d32c724a]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 20: __libc_start_main()
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: 21: _start()
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: *** Caught signal (Aborted) **
The failed OSD are on different hosts of the cluster (initially I suspected some hardware issue with motherboard, PCI risers, etc). The disks are ok, pass all tests, no errors in dmesg or journal related to any kind of disk failure. In fact, removing the OSD and creating a new one works fine. No "recreated" OSD has failed so far. The OSDs are quite full, around 75%.
Tried to repair RocksDB, but failed too with the same error log (
ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-20 repair).Searching the internet has pointed to a bug report [1] that mentions "flapping" OSDs on Ceph Reef (although last comment mentions a backport for Squid too). In my case, once the OSD fail, they never come back.
Questions:
- IIUC, this the same issue as in the bug report (./src/os/bluestore/BlueFS.cc: 3871: FAILED ceph_assert(cut_off == p->length)
- Would it be convinient recreate every OSD "just in case" to circunvent this bug?
- Is this bug more prone to show up the fuller an OSD is?
Many thanks in advance
[1] https://bugzilla.proxmox.com/show_bug.cgi?id=7211