Ceph 19.2.1 2 OSD(s) experiencing slow operations in BlueStore

Hello,

I do the same, but for me with the "ceph config set ..." I have I/O errors with Samsung, Intel SSD, I do not see I/O errors with Crucial SSD.


[172265.244864] critical target error, dev sdd, sector 34601544 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
[172265.320785] sd 0:0:8:0: [sda] tag#109 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[172265.320792] sd 0:0:8:0: [sda] tag#109 Sense Key : Illegal Request [current]
[172265.320795] sd 0:0:8:0: [sda] tag#109 Add. Sense: Invalid field in parameter list
[172265.320798] sd 0:0:8:0: [sda] tag#109 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00

Best regards.
Francis
 
Hello
Facing the same issue, I think the problem is well described here:
https://www.spinics.net/lists/ceph-users/msg86138.html
In 19.2.1, code has been added for looking at osd slow ops and osd read stalled, raising alarms.
I change this: ceph config set class:hdd bdev_stalled_read_warn_lifetime 3600
The warning is triggered by the backup process, and an hour later the warning disappears.
But I would like to increase bdev_stalled_read_warn_threshold to avoid alerts.
How can I know the value to choose ?
I dont want to mask real problems with a value too high.
 
I am doing a test. Steps:

1. Use ceph config dump first to check.

2. Enter the command again
ceph config set global bdev_async_discard_threads 1
ceph config set global bdev_enable_discard true

3. Use ceph config dump again to check and see if it works.

I have waited for 30 minutes now. The warning is not automatically eliminated.

Now I start to restart a node, which will temporarily eliminate the problem, and usually the next day, the error prompt will appear again. I need time to report back to you.


If the command cannot solve the problem, you can use the following command to delete the two added configurations and restore them to their original state.

ceph config rm global bdev_async_discard_threads
ceph config rm global bdev_enable_discard
Happy Monday, I report to you that after a weekend, the error message has disappeared. This command is effective on my small cluster. I did not update any patches during this weekend.
A healthy cluster ceph is back again.
 
We upgraded to 19.2.1 Friday night, and rebooted all servers. Saturday morning two out of three HDDs (with DB on SSD) had this warning. Without doing anything, when I looked on Sunday (early and again late in the day) the error was gone. No SSD OSDs had the warning.