Ceph 19.2.1 2 OSD(s) experiencing slow operations in BlueStore

I have upgraded too to 19.2.2 before weekend, but no luck:

HEALTH_WARN: 2 OSD(s) experiencing slow operations in BlueStore
osd.9 observed slow operation indications in BlueStore
osd.15 observed slow operation indications in BlueStore
 
I experience all of the mentioned BlueStore warnings... Additionally taking snapshots is taking forever now.
This used to be a matter of seconds, now it takes minutes + the snaptrim process loads the cpu for a very long time.

I feel, this is introduced in ceph 19.2.2. I will try and have proper data for this. Just sharing for now, maybe others experience this also.
 
After data recovery, and upgrading to 19.2.2, I am now getting this one of my pure SSD class pools (no wall or db, basic replication only):


Code:
[WRN] BLUESTORE_SLOW_OP_ALERT: 1 OSD(s) experiencing slow operations in BlueStore
     osd.9 observed slow operation indications in BlueStore
[WRN] DB_DEVICE_STALLED_READ_ALERT: 1 OSD(s) experiencing stalled read in db device of BlueFS
     osd.9 observed stalled read indications in DB device

As best as I can tell, this disk is perfectly healthy. :/
what is your ssd model ?
 
The problem still exists, and it is not solved. I can only look at the error message and pray that it will not crash. Haha.
 
little infor from here

after change crucial CT240BX500SSD1 to WD blue is the problem DONE
 
Hi there!

As far as I can read the docs, it's not a disk failure, it's not even an error condition.

This feature was introduced in Reef with 18.2.5 and Squid with 19.2.1.

You can find the documentation here.
German speaking blog post here.

You can adjust the two variables to your needs:
Code:
ceph config set global bluestore_slow_ops_warn_lifetime 21600
ceph config set global bluestore_slow_ops_warn_threshold 5

I would be very careful changing them both or setting thresholds too low, but you're the expert in your environment.
For my stage cluster, it solves the unnecessary noises, but I'm also not dealing with performance issues there, so who knows...
Now I can go forward with prod.

Regards and happy hacking,
Marianne
 
Hi there!

As far as I can read the docs, it's not a disk failure, it's not even an error condition.

This feature was introduced in Reef with 18.2.5 and Squid with 19.2.1.

You can find the documentation here.
German speaking blog post here.

You can adjust the two variables to your needs:
Code:
ceph config set global bluestore_slow_ops_warn_lifetime 21600
ceph config set global bluestore_slow_ops_warn_threshold 5

I would be very careful changing them both or setting thresholds too low, but you're the expert in your environment.
For my stage cluster, it solves the unnecessary noises, but I'm also not dealing with performance issues there, so who knows...
Now I can go forward with prod.

Regards and happy hacking,
Marianne
I have to say, that I also found this documentation and I did set bluestore_slow_ops_warn_threshold per problematic OSD and warning is gone !

So it is really seems to be a feature ....
 
I have to say, that I also found this documentation and I did set bluestore_slow_ops_warn_threshold per problematic OSD and warning is gone !

So it is really seems to be a feature ....
Ya, I still think something is still up... I have three SSD OSDs across hosts, different types, brands and controller types all reporting this. I mean MAYBE all thee are being bogged down enough to delay IO for more than a second, but (it not that busy)... Maybe that is some sort of round trip time that includes processes outside of actual read/writes...

Wish the docs said what these values are measured in. "bluestore_slow_ops_warn_threshold" Seems to be defaulted to 1, so I assume 1 second. It looks like the default for "bluestore_slow_ops_warn_lifetime" is only 600 (10 minutes? hours?).

Will experiment here.
 
Observed for quite some time, it does happened only on SSD, but never happen to NVMe and HDD OSD.
I'm still using ceph: 17.2.8-pve2.

when happen, run in cli:

ceph config set osd.x bluestore_slow_ops_warn_threshold 120

error go away.

However, it come back again randomly on different SSD OSD, even plugged in the new one.