[SOLVED] Possible bluestoreDB bug in Ceph 17.2.8

Jan 17, 2025
6
3
3
Netherlands
Hi all,

Recently the latest version of PVE-Ceph Quincy released (17.2.8)* within the Enterprise repository.
However on another (non-PVE ceph) storage cluster we experienced quite the nasty bug related to BluestoreDB (https://tracker.ceph.com/issues/69764). Causing OSD's to crash and recover, putting heavy load on the cluster.

Update-day is coming up soon, and I'd like to make sure that we don't have to rollback quite a few Ceph clusters. Looking at the tracker the fix would be targeted for the next release of Ceph Reef (18.2.5).
So my question would be: Is this an active issue for PVE-ceph packages, or has this issue been resolved within this release?

Thanks in advance!
Best regards
 
AFAICT, 18.2.4 is the most current release. Once 18.2.5 is released (shouldn't be far away from what I can tell), we will package it for Proxmox VE and will slowly push it out through the repository chain, test->no-subscription->enterprise.

But I cannot make any guarantees when the 18.2.5 packages will be available in the respective repositories.
 
Hi Aaron,

We have a couple clusters still running Ceph Quincy, my fear was more the minor-release upgrade from 17.2.7 > 17.2.8 where this bug is present.
It seems that Ceph 17.2.8 is exposed to this bug release under the BlueFS.cc file:
https://git.proxmox.com/?p=ceph.git...0;hb=b009440314ae417689ea1c7d5d9e5874e7e3812b
Line: 3116 _log_advance_seq();

This bug had been introduced in commit: https://github.com/ceph/ceph/pull/57241

We had to roll-back an entire cluster to 17.2.7 for it to become stable again, hence the fear.

We'd like to verify that this issue would not have any impact before upgrading.
FYI I have made a support-ticket, I don't mind if we continue there.

Best Regards,
- Demian
 
Hi all,

Thanks to all the amazing staff at Proxmox!
I just received my answer regarding this fear, they already fixed this in their PVE-ceph version!

(https://git.proxmox.com/?p=ceph.git...d;hb=71ce71edd912bf31ab1ce63723f43911814c6e3a)

For any of the people that encounter this issue on non pve-ceph clusters, my solution was to rollback to 17.2.7 and restart all daemons.
That seems to have done the trick!

With kind regards,
- Demian
 
  • Like
Reactions: aaron