Ceph 19.2.1 2 OSD(s) experiencing slow operations in BlueStore

Petr Svacina · Jun 23, 2025

I have upgraded too to 19.2.2 before weekend, but no luck:

HEALTH_WARN: 2 OSD(s) experiencing slow operations in BlueStore
osd.9 observed slow operation indications in BlueStore
osd.15 observed slow operation indications in BlueStore

Shelve5049 · Jun 23, 2025

I experience all of the mentioned BlueStore warnings... Additionally taking snapshots is taking forever now.
This used to be a matter of seconds, now it takes minutes + the snaptrim process loads the cpu for a very long time.

I feel, this is introduced in ceph 19.2.2. I will try and have proper data for this. Just sharing for now, maybe others experience this also.

Petr Svacina · Jun 23, 2025

I am not able to do snapshots on RBD storage, another bug:

https://tracker.ceph.com/issues/61582?next_issue_id=61581

I have had using Ceph long time, but, this going to be worse and worse ...

spirit · Jun 23, 2025

BloodBlight said:
After data recovery, and upgrading to 19.2.2, I am now getting this one of my pure SSD class pools (no wall or db, basic replication only):

Code:

[WRN] BLUESTORE_SLOW_OP_ALERT: 1 OSD(s) experiencing slow operations in BlueStore osd.9 observed slow operation indications in BlueStore [WRN] DB_DEVICE_STALLED_READ_ALERT: 1 OSD(s) experiencing stalled read in db device of BlueFS osd.9 observed stalled read indications in DB device

As best as I can tell, this disk is perfectly healthy. :/

what is your ssd model ?

wuwzy · Jun 24, 2025

The problem still exists, and it is not solved. I can only look at the error message and pray that it will not crash. Haha.

Petr Svacina · Jun 24, 2025

Anyone tried this ?

https://github.com/rook/rook/discussions/15403

Specially this two:

ceph config set global bdev_async_discard_threads 1
ceph config set global bdev_enable_discard true

SteveITS · Jun 24, 2025

Petr Svacina said:
Anyone tried this ?

Several, starting here.

arteck · Jun 25, 2025

little infor from here

after change crucial CT240BX500SSD1 to WD blue is the problem DONE

unixe · Jun 26, 2025

Hi there!

As far as I can read the docs, it's not a disk failure, it's not even an error condition.

This feature was introduced in Reef with 18.2.5 and Squid with 19.2.1.

You can find the documentation here.
German speaking blog post here.

You can adjust the two variables to your needs:

Code:

ceph config set global bluestore_slow_ops_warn_lifetime 21600
ceph config set global bluestore_slow_ops_warn_threshold 5

I would be very careful changing them both or setting thresholds too low, but you're the expert in your environment.
For my stage cluster, it solves the unnecessary noises, but I'm also not dealing with performance issues there, so who knows...
Now I can go forward with prod.

Regards and happy hacking,
Marianne

Petr Svacina · Jun 27, 2025

unixe said:
Hi there!

As far as I can read the docs, it's not a disk failure, it's not even an error condition.

This feature was introduced in Reef with 18.2.5 and Squid with 19.2.1.

You can find the documentation here.
German speaking blog post here.

You can adjust the two variables to your needs:

Code:

ceph config set global bluestore_slow_ops_warn_lifetime 21600 ceph config set global bluestore_slow_ops_warn_threshold 5

I would be very careful changing them both or setting thresholds too low, but you're the expert in your environment.
For my stage cluster, it solves the unnecessary noises, but I'm also not dealing with performance issues there, so who knows...
Now I can go forward with prod.

Regards and happy hacking,
Marianne

I have to say, that I also found this documentation and I did set bluestore_slow_ops_warn_threshold per problematic OSD and warning is gone !

So it is really seems to be a feature ....

BloodBlight · Jun 30, 2025

Petr Svacina said:
I have to say, that I also found this documentation and I did set bluestore_slow_ops_warn_threshold per problematic OSD and warning is gone !

So it is really seems to be a feature ....

Ya, I still think something is still up... I have three SSD OSDs across hosts, different types, brands and controller types all reporting this. I mean MAYBE all thee are being bogged down enough to delay IO for more than a second, but (it not that busy)... Maybe that is some sort of round trip time that includes processes outside of actual read/writes...

Wish the docs said what these values are measured in. "bluestore_slow_ops_warn_threshold" Seems to be defaulted to 1, so I assume 1 second. It looks like the default for "bluestore_slow_ops_warn_lifetime" is only 600 (10 minutes? hours?).

Will experiment here.

aychprox · Jul 8, 2025

Observed for quite some time, it does happened only on SSD, but never happen to NVMe and HDD OSD.
I'm still using ceph: 17.2.8-pve2.

when happen, run in cli:

ceph config set osd.x bluestore_slow_ops_warn_threshold 120

error go away.

However, it come back again randomly on different SSD OSD, even plugged in the new one.

da-alb · Jul 17, 2025

Does anyone still have this issue?

mattock · Jul 22, 2025

Yes, I'm using 7200 RPM HDD's and had to set the following as a workaround:

Bash:

ceph config set class:hdd bluestore_slow_ops_warn_lifetime 21600
ceph config set class:hdd bluestore_slow_ops_warn_threshold 320

wuwzy · Jul 30, 2025

I see that 19.2.3 has been released. I don't know when it will be available. Will it solve this problem?

BloodBlight · Jul 31, 2025

da-alb said:
Does anyone still have this issue?

Ya, still seeing it, but I am still on 19.2.2.
I get they this would happen on HDDs, but not SSDs...

mattock · Jul 31, 2025

Replaced the HDD's with SSD's (Samsung SM863a) and the warning disappeared for me.

FireStormOOO · Aug 3, 2025

I think I was able to trigger these warning with minor network disruptions to the cluster network (<<5s; LACP bond renegotiated) and in my case I don't think it's anything to do with the drives. It seems like the performance monitoring is just on a hair trigger and any network hickup will raise this warning too? As noted above, restarting the OSD does clear the warning until something else happens.

I see multiple references to this being a new Ceph feature (e.g. we got more observability, nothing new is broken?). Does sound useful to flag drives with the wrong firmware for array operations that are going offline for multiple seconds on a failed read doing retries.

da-alb · Aug 5, 2025

On my server it has disappeared... Strange

wuwzy · Aug 7, 2025

da-alb said:
On my server it has disappeared... Strange

Usually it disappears on Monday and Tuesday, and comes back on Wednesday, Thursday, and Friday. This is positively correlated with how busy your system is.

Ceph 19.2.1 2 OSD(s) experiencing slow operations in BlueStore

Well-Known Member

Member

Well-Known Member

Distinguished Member

Well-Known Member

Well-Known Member

Active Member

Active Member

New Member

Well-Known Member

Member

Renowned Member

Active Member

New Member

Well-Known Member

Member

New Member

New Member

Active Member

Well-Known Member

We value your privacy