Ceph squid OSD crash related to RocksDB ceph_assert(cut_off == p->length)

VictorSTS

Distinguished Member
Oct 7, 2019
1,110
622
158
Spain
PVE8.4.14 + Ceph 19.2.3, 3 node cluster. All disks are PCIe NVMe. Different pools, some with zstd compression enabled.

I'm seeing OSD crashing lately with the same failure. Journal shows that it is unable to properly run RocksDB with an assert message. There are a few entries like these every time the OSD service tries to start:

Code:
Feb 20 09:44:34 PVE06 systemd[1]: Started ceph-osd@20.service - Ceph object storage daemon osd.20.
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: ./src/os/bluestore/BlueFS.cc: In function 'int BlueFS::truncate(FileWriter*, uint64_t)' thread 7d15d26de940 time 2026-02-20T09:45:46.959670+0100
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: ./src/os/bluestore/BlueFS.cc: 3871: FAILED ceph_assert(cut_off == p->length)
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  ceph version 19.2.3 (116fa4d1a2c5227d907163f1d05a062467c99f57) squid (stable)
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x11e) [0x617eb68ba84b]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  2: /usr/bin/ceph-osd(+0x67a9e8) [0x617eb68ba9e8]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  3: (BlueFS::truncate(BlueFS::FileWriter*, unsigned long)+0x852) [0x617eb7021a92]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  4: (BlueRocksWritableFile::Close()+0x2d) [0x617eb7040d2d]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  5: /usr/bin/ceph-osd(+0x14e8aa6) [0x617eb7728aa6]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  6: (rocksdb::WritableFileWriter::Close()+0xc1a) [0x617eb776279a]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  7: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::VersionSet*, rocksdb::ImmutableDBOptions const&, rocksdb::TableBuilderOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, rocksdb::FileMetaData*, std::vector<rocksdb::BlobFileAddition, std::allocator<rocksdb::BlobFileAddition> >*, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, unsigned long, rocksdb::SnapshotChecker*, bool, rocksdb::InternalStats*, rocksdb::IOStatus*, std::shared_ptr<rocksdb::IOTracer> const&, rocksdb::BlobFileCreationReason, rocksdb::SeqnoToTimeMapping const&, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, rocksdb::Env::WriteLifeTimeHint, std::__cxx11::basic_string<char, std::ch>
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  8: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0x1047) [0x617eb7609557]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  9: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool, bool*, rocksdb::DBImpl::RecoveryContext*)+0x1ec4) [0x617eb760c0e4]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  10: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*)+0x1fb9) [0x617eb760ef09]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  11: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool)+0x7a0) [0x617eb7605b30]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  12: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x24) [0x617eb7607c74]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  13: (RocksDBStore::do_open(std::ostream&, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x776) [0x617eb754e5f6]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  14: (BlueStore::_open_db(bool, bool, bool)+0x9d1) [0x617eb6f63231]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  15: (BlueStore::_open_db_and_around(bool, bool)+0x37f) [0x617eb6fa99df]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  16: (BlueStore::_mount()+0x242) [0x617eb6face42]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  17: (OSD::init()+0x4e9) [0x617eb6a1eee9]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  18: main()
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  19: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7d15d32c724a]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  20: __libc_start_main()
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  21: _start()
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: *** Caught signal (Aborted) **

The failed OSD are on different hosts of the cluster (initially I suspected some hardware issue with motherboard, PCI risers, etc). The disks are ok, pass all tests, no errors in dmesg or journal related to any kind of disk failure. In fact, removing the OSD and creating a new one works fine. No "recreated" OSD has failed so far. The OSDs are quite full, around 75%.

Tried to repair RocksDB, but failed too with the same error log (ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-20 repair).

Searching the internet has pointed to a bug report [1] that mentions "flapping" OSDs on Ceph Reef (although last comment mentions a backport for Squid too). In my case, once the OSD fail, they never come back.

Questions:
  • IIUC, this the same issue as in the bug report (./src/os/bluestore/BlueFS.cc: 3871: FAILED ceph_assert(cut_off == p->length)
  • Would it be convinient recreate every OSD "just in case" to circunvent this bug?
  • Is this bug more prone to show up the fuller an OSD is?
Need to recreate the OSD ASAP, but I've gathered all logs in case they are needed.

Many thanks in advance

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=7211
 
Initially i'd like to raise concerns about the amount of available storage already beeing in use. By default CEPH doesnt allow more then 80% so you'd have to take precautions really soon while taking these concerns into consideration.

I'd highly suspect that your error is caused by your OSD beeing too full. Based on the amount of data and other factors like fragmentation this could lead to all kind of different errors. I've had a client a few months back with similiar problems due to lack of free space.

The fact that a re-init of the OSD helps might be the reason because fragmentation will be circumvented for a short period of time but the error will return sooner then later.

Can you please provide

ceph daemon osd.123 bluestore allocator score block

for your still "full" OSDs.
 
Last edited:
  • Like
Reactions: VictorSTS
In short: if Ceph warns you about something, do something about it.

Read the full bugreport and found this comment [1]: "This issue seems to mostly affect disks which were heavily fragmented.". Mine are and in fact I have some warnings related to this, although the webUI doesn't show them properly due to this other bug [2].

I've already recreated some other OSDs that had this BLUESTORE_FREE_FRAGMENTATION warning in the past to fix them, so I'm going to recreate those affected, just in case.

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=7211#c3
[2] https://bugzilla.proxmox.com/show_bug.cgi?id=6972
 
@fstrankowski I'm fully aware about the risks of OSD being full and know how to deal with that, but in any case an OSD should break because of that ;)
Definitely fragmentation has an impact on this and will watch it more closely from now on. Anyway, I'm expecting new servers for this cluster with ETA 1 week, I'm just hoping to make it stable until then.
 
  • Like
Reactions: fstrankowski
so I'm going to recreate those affected, just in case.
This will only fix your problem in the short term. Fragmentation will come back relativly quick. You better add more OSDs or wipe some data off your pools :-)
 
I'm just hoping to make it stable until then.
Best of luck to you * fingers crossed *. I had to rebuild the whole cluster in my clients case and fix ceph by manually restoring placement groups - which was a pain.
 
Last edited:
  • Like
Reactions: VictorSTS
as others pointed out already this hits OSDs that are fairly full (~75%+) with heavy disk fragmentation.
v19.2.3 already ships a race condition fix (https://ceph.io/en/news/blog/2025/v19-2-3-squid-released/) that prevents new corruption, but it can't fix what's already broken on disk.

First, try the non-destructive repair it sometimes works:

ceph-bluestore-tool fsck \
--path /var/lib/ceph/osd/ceph-20 \
--bluefs_replay_recovery=true \
--bluefs_replay_recovery_disable_compact=true

The hybrid allocator (default since Pacific) was designed to improve on bitmap's fragmentation handling, so check if its active :

# Check which allocator type is active (on active OSD)
ceph daemon osd.20 config get bluestore_allocator

# If the OSD is down (like your OSD.20), you can check offline:
ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-20

# For a quick cluster-wide fragmentation overview:
for osd in $(ceph osd ls); do
echo -n "osd.$osd: "
ceph daemon osd.$osd bluestore allocator score block 2>/dev/null || echo "down"
done

I had similar case on my 8 Node Cluster.
 
  • Like
Reactions: gurubert

Thanks for the heads up. Pretty sure most where created with Ceph Reef except a few that got recreated recently with Squid 19.2.3. I'm aware of that bug, but given that I don't use EC pools (Ceph bugreport mentions it seems to only happen on OSD that hold EC pools) never really payed attention to it. Also, the assertions in the Ceph bug report aren't related to my issue.

@bitranox Thanks for the suggestions. As mentioned above, I already tried ceph-bluestore-tool but it failed with the same assert error. Most of my not-yet-recreated OSDs are fragmented over 0'75, so I'm in the process of rebuilding them all to dodge this very bug. This one isn't too big, disks are fast and network is 25G, so it won't take too long.
 
  • Like
Reactions: gurubert