Ceph 19.2.1 2 OSD(s) experiencing slow operations in BlueStore

Hello,

I do the same, but for me with the "ceph config set ..." I have I/O errors with Samsung, Intel SSD, I do not see I/O errors with Crucial SSD.


[172265.244864] critical target error, dev sdd, sector 34601544 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
[172265.320785] sd 0:0:8:0: [sda] tag#109 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[172265.320792] sd 0:0:8:0: [sda] tag#109 Sense Key : Illegal Request [current]
[172265.320795] sd 0:0:8:0: [sda] tag#109 Add. Sense: Invalid field in parameter list
[172265.320798] sd 0:0:8:0: [sda] tag#109 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00

Best regards.
Francis
 
Hello
Facing the same issue, I think the problem is well described here:
https://www.spinics.net/lists/ceph-users/msg86138.html
In 19.2.1, code has been added for looking at osd slow ops and osd read stalled, raising alarms.
I change this: ceph config set class:hdd bdev_stalled_read_warn_lifetime 3600
The warning is triggered by the backup process, and an hour later the warning disappears.
But I would like to increase bdev_stalled_read_warn_threshold to avoid alerts.
How can I know the value to choose ?
I dont want to mask real problems with a value too high.
 
I am doing a test. Steps:

1. Use ceph config dump first to check.

2. Enter the command again
ceph config set global bdev_async_discard_threads 1
ceph config set global bdev_enable_discard true

3. Use ceph config dump again to check and see if it works.

I have waited for 30 minutes now. The warning is not automatically eliminated.

Now I start to restart a node, which will temporarily eliminate the problem, and usually the next day, the error prompt will appear again. I need time to report back to you.


If the command cannot solve the problem, you can use the following command to delete the two added configurations and restore them to their original state.

ceph config rm global bdev_async_discard_threads
ceph config rm global bdev_enable_discard
Happy Monday, I report to you that after a weekend, the error message has disappeared. This command is effective on my small cluster. I did not update any patches during this weekend.
A healthy cluster ceph is back again.
 
We upgraded to 19.2.1 Friday night, and rebooted all servers. Saturday morning two out of three HDDs (with DB on SSD) had this warning. Without doing anything, when I looked on Sunday (early and again late in the day) the error was gone. No SSD OSDs had the warning.
 
After 2 days, I came back and the error appeared again. Ha, it's so magical, first there were 7 osd errors. This morning, it was reduced to 2 osd errors. It seems that the problem still exists. I look forward to the next update patch to solve it.


2 OSD(s) experiencing slow operations in BlueStore
osd.8 observed slow operation indications in BlueStore
osd.17 observed slow operation indications in BlueStore
 
  • Like
Reactions: flames
same issue here with

ceph: 17.2.8-pve2
proxmox-ve: 8.4.0
pve-manager: 8.4.1

slow warning appeared at least 2-3 times a day from random SSD OSD.
restart the OSD will temporary remove the warning.
 
same issue here with

ceph: 17.2.8-pve2
proxmox-ve: 8.4.0
pve-manager: 8.4.1

slow warning appeared at least 2-3 times a day from random SSD OSD.
restart the OSD will temporary remove the warning.
Yes, usually after 1 day, new reminders will slowly appear. I just pushed a core patch and reminded that I need to restart to make the new core take effect. I just finished it, so there are currently 0 errors. I will check again tomorrow and the day after tomorrow to see if it returns to normal.
 
The changelog has 1 entry, thats got nothing to do with this warning..
Hi Yaga,

Are you sure because in the 19.2.2 changelog there is only one change (critical) not related with the slow warning ?
  • squid: rgw: keep the tails when copying object to itself (pr#62711, cbodley)
 
Last edited:
@aychprox I thought the warning text was new in 19.2.1? Or did 17 get it also?
Also interesting you saw it on SSDs.
Yes, no plan to upgrade to 19.2.1 yet.
Only seen this on SSD so far. Initially thought it was caused by bluestore_cache_size and bluestore_cache_kv_ratio, no luck even after adjusted.
 

4 OSD(s) experiencing slow operations in BlueStore
osd.7 observed slow operation indications in BlueStore
osd.8 observed slow operation indications in BlueStore
osd.9 observed slow operation indications in BlueStore
osd.15 observed slow operation indications in BlueStore

I continue to follow this topic. Yesterday, I installed all the latest patches and restarted all the nodes on the cluster. Naturally, all the osd errors were cleared. This morning, when I checked, 4 osd errors appeared again. It can be seen that the problem still exists. I don’t know if it will affect the security of the data. If you haven’t upgraded yet, you can wait and see 19.2.1.