Ceph squid OSD crash related to RocksDB ceph_assert(cut_off == p->length)

VictorSTS · Feb 20, 2026

PVE8.4.14 + Ceph 19.2.3, 3 node cluster. All disks are PCIe NVMe. Different pools, some with zstd compression enabled.

I'm seeing OSD crashing lately with the same failure. Journal shows that it is unable to properly run RocksDB with an assert message. There are a few entries like these every time the OSD service tries to start:

Code:

Feb 20 09:44:34 PVE06 systemd[1]: Started ceph-osd@20.service - Ceph object storage daemon osd.20.
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: ./src/os/bluestore/BlueFS.cc: In function 'int BlueFS::truncate(FileWriter*, uint64_t)' thread 7d15d26de940 time 2026-02-20T09:45:46.959670+0100
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: ./src/os/bluestore/BlueFS.cc: 3871: FAILED ceph_assert(cut_off == p->length)
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  ceph version 19.2.3 (116fa4d1a2c5227d907163f1d05a062467c99f57) squid (stable)
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x11e) [0x617eb68ba84b]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  2: /usr/bin/ceph-osd(+0x67a9e8) [0x617eb68ba9e8]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  3: (BlueFS::truncate(BlueFS::FileWriter*, unsigned long)+0x852) [0x617eb7021a92]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  4: (BlueRocksWritableFile::Close()+0x2d) [0x617eb7040d2d]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  5: /usr/bin/ceph-osd(+0x14e8aa6) [0x617eb7728aa6]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  6: (rocksdb::WritableFileWriter::Close()+0xc1a) [0x617eb776279a]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  7: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::VersionSet*, rocksdb::ImmutableDBOptions const&, rocksdb::TableBuilderOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, rocksdb::FileMetaData*, std::vector<rocksdb::BlobFileAddition, std::allocator<rocksdb::BlobFileAddition> >*, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, unsigned long, rocksdb::SnapshotChecker*, bool, rocksdb::InternalStats*, rocksdb::IOStatus*, std::shared_ptr<rocksdb::IOTracer> const&, rocksdb::BlobFileCreationReason, rocksdb::SeqnoToTimeMapping const&, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, rocksdb::Env::WriteLifeTimeHint, std::__cxx11::basic_string<char, std::ch>
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  8: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0x1047) [0x617eb7609557]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  9: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool, bool*, rocksdb::DBImpl::RecoveryContext*)+0x1ec4) [0x617eb760c0e4]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  10: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*)+0x1fb9) [0x617eb760ef09]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  11: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool)+0x7a0) [0x617eb7605b30]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  12: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x24) [0x617eb7607c74]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  13: (RocksDBStore::do_open(std::ostream&, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x776) [0x617eb754e5f6]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  14: (BlueStore::_open_db(bool, bool, bool)+0x9d1) [0x617eb6f63231]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  15: (BlueStore::_open_db_and_around(bool, bool)+0x37f) [0x617eb6fa99df]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  16: (BlueStore::_mount()+0x242) [0x617eb6face42]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  17: (OSD::init()+0x4e9) [0x617eb6a1eee9]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  18: main()
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  19: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7d15d32c724a]
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  20: __libc_start_main()
Feb 20 09:45:46 PVE06 ceph-osd[3405113]:  21: _start()
Feb 20 09:45:46 PVE06 ceph-osd[3405113]: *** Caught signal (Aborted) **

The failed OSD are on different hosts of the cluster (initially I suspected some hardware issue with motherboard, PCI risers, etc). The disks are ok, pass all tests, no errors in dmesg or journal related to any kind of disk failure. In fact, removing the OSD and creating a new one works fine. No "recreated" OSD has failed so far. The OSDs are quite full, around 75%.

Tried to repair RocksDB, but failed too with the same error log (ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-20 repair).

Searching the internet has pointed to a bug report [1] that mentions "flapping" OSDs on Ceph Reef (although last comment mentions a backport for Squid too). In my case, once the OSD fail, they never come back.

Questions:

IIUC, this the same issue as in the bug report (./src/os/bluestore/BlueFS.cc: 3871: FAILED ceph_assert(cut_off == p->length)
Would it be convinient recreate every OSD "just in case" to circunvent this bug?
Is this bug more prone to show up the fuller an OSD is?

Need to recreate the OSD ASAP, but I've gathered all logs in case they are needed.

Many thanks in advance

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=7211

fstrankowski · Feb 20, 2026

Initially i'd like to raise concerns about the amount of available storage already beeing in use. By default CEPH doesnt allow more then 80% so you'd have to take precautions really soon while taking these concerns into consideration.

I'd highly suspect that your error is caused by your OSD beeing too full. Based on the amount of data and other factors like fragmentation this could lead to all kind of different errors. I've had a client a few months back with similiar problems due to lack of free space.

The fact that a re-init of the OSD helps might be the reason because fragmentation will be circumvented for a short period of time but the error will return sooner then later.

Can you please provide

ceph daemon osd.123 bluestore allocator score block

for your still "full" OSDs.

VictorSTS · Feb 20, 2026

In short: if Ceph warns you about something, do something about it.

Read the full bugreport and found this comment [1]: "This issue seems to mostly affect disks which were heavily fragmented.". Mine are and in fact I have some warnings related to this, although the webUI doesn't show them properly due to this other bug [2].

I've already recreated some other OSDs that had this BLUESTORE_FREE_FRAGMENTATION warning in the past to fix them, so I'm going to recreate those affected, just in case.

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=7211#c3
[2] https://bugzilla.proxmox.com/show_bug.cgi?id=6972

VictorSTS · Feb 20, 2026

@fstrankowski I'm fully aware about the risks of OSD being full and know how to deal with that, but in any case an OSD should break because of that

Definitely fragmentation has an impact on this and will watch it more closely from now on. Anyway, I'm expecting new servers for this cluster with ETA 1 week, I'm just hoping to make it stable until then.

fstrankowski · Feb 20, 2026

VictorSTS said:
so I'm going to recreate those affected, just in case.

This will only fix your problem in the short term. Fragmentation will come back relativly quick. You better add more OSDs or wipe some data off your pools

fstrankowski · Feb 20, 2026

VictorSTS said:
I'm just hoping to make it stable until then.

Best of luck to you * fingers crossed *. I had to rebuild the whole cluster in my clients case and fix ceph by manually restoring placement groups - which was a pain.

gurubert · Feb 20, 2026

Have these OSDs been deployed with 19.2?
You may be seeing this bug: https://docs.clyso.com/blog/critical-bugs-ceph-reef-squid/#squid-deployed-osds-are-crashing

SteveITS · Feb 20, 2026

For reference, PVE doc on that: https://pve.proxmox.com/wiki/Roadmap#8.4-known-issues

bitranox · Feb 20, 2026

as others pointed out already this hits OSDs that are fairly full (~75%+) with heavy disk fragmentation.
v19.2.3 already ships a race condition fix (https://ceph.io/en/news/blog/2025/v19-2-3-squid-released/) that prevents new corruption, but it can't fix what's already broken on disk.

First, try the non-destructive repair it sometimes works:

ceph-bluestore-tool fsck \
--path /var/lib/ceph/osd/ceph-20 \
--bluefs_replay_recovery=true \
--bluefs_replay_recovery_disable_compact=true

The hybrid allocator (default since Pacific) was designed to improve on bitmap's fragmentation handling, so check if its active :

# Check which allocator type is active (on active OSD)
ceph daemon osd.20 config get bluestore_allocator

# If the OSD is down (like your OSD.20), you can check offline:
ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-20

# For a quick cluster-wide fragmentation overview:
for osd in $(ceph osd ls); do
echo -n "osd.$osd: "
ceph daemon osd.$osd bluestore allocator score block 2>/dev/null || echo "down"
done

I had similar case on my 8 Node Cluster.

VictorSTS · Feb 20, 2026

gurubert said:
Have these OSDs been deployed with 19.2?
You may be seeing this bug: https://docs.clyso.com/blog/critical-bugs-ceph-reef-squid/#squid-deployed-osds-are-crashing

SteveITS said:
For reference, PVE doc on that: https://pve.proxmox.com/wiki/Roadmap#8.4-known-issues

Thanks for the heads up. Pretty sure most where created with Ceph Reef except a few that got recreated recently with Squid 19.2.3. I'm aware of that bug, but given that I don't use EC pools (Ceph bugreport mentions it seems to only happen on OSD that hold EC pools) never really payed attention to it. Also, the assertions in the Ceph bug report aren't related to my issue.

@bitranox Thanks for the suggestions. As mentioned above, I already tried ceph-bluestore-tool but it failed with the same assert error. Most of my not-yet-recreated OSDs are fragmented over 0'75, so I'm in the process of rebuilding them all to dodge this very bug. This one isn't too big, disks are fast and network is 25G, so it won't take too long.

Deerom · Feb 23, 2026

VictorSTS said:
Thanks for the heads up. Pretty sure most where created with Ceph Reef except a few that got recreated recently with Squid 19.2.3. I'm aware of that bug, but given that I don't use EC pools (Ceph bugreport mentions it seems to only happen on OSD that hold EC pools) never really payed attention to it. Also, the assertions in the Ceph bug report aren't related to my issue.

@bitranox Thanks for the suggestions. As mentioned above, I already tried ceph-bluestore-tool but it failed with the same assert error. Most of my not-yet-recreated OSDs are fragmented over 0'75, so I'm in the process of rebuilding them all to dodge this very bug. This one isn't too big, disks are fast and network is 25G, so it won't take too long.

Hi Victor,

One other 'temporary' thing that you may configure if there is a critical need for all OSDs to be up is to change the allocation_size for each OSD from 64k to 4k using the 'bluestore_shared_alloc_size' parameter [0], which you can then set to '4096'. This atleast allows the OSDs to start and not crash on-boot due to the lack of free 64k aligned sectors.

I did see that this backport fix is already present on no-sub repositories, we're hoping that the patch is live in enterprise soon.

[0] https://tracker.ceph.com/issues/71235

SteveITS · Mar 3, 2026

I see several Ceph updates today:
ceph (19.2.3-1~bpo12+2) bookworm; urgency=medium
* bluefs: fix OSD crash caused by incorrect allocation unit size alignment assertion.

Search

Search

Ceph squid OSD crash related to RocksDB ceph_assert(cut_off == p->length)

VictorSTS

Distinguished Member

fstrankowski

Famous Member

VictorSTS

Distinguished Member

VictorSTS

Distinguished Member

fstrankowski

Famous Member

fstrankowski

Famous Member

gurubert

Distinguished Member

SteveITS

Renowned Member

bitranox

Member

VictorSTS

Distinguished Member

Deerom

New Member

SteveITS

Renowned Member

We value your privacy