Ceph 19.2.1 2 OSD(s) experiencing slow operations in BlueStore

FrancisS · Apr 17, 2025

Hello,

I do the same, but for me with the "ceph config set ..." I have I/O errors with Samsung, Intel SSD, I do not see I/O errors with Crucial SSD.

[172265.244864] critical target error, dev sdd, sector 34601544 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
[172265.320785] sd 0:0:8:0: [sda] tag#109 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[172265.320792] sd 0:0:8:0: [sda] tag#109 Sense Key : Illegal Request [current]
[172265.320795] sd 0:0:8:0: [sda] tag#109 Add. Sense: Invalid field in parameter list
[172265.320798] sd 0:0:8:0: [sda] tag#109 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00

Best regards.
Francis

jloms · Apr 19, 2025

Hello
Facing the same issue, I think the problem is well described here:
https://www.spinics.net/lists/ceph-users/msg86138.html
In 19.2.1, code has been added for looking at osd slow ops and osd read stalled, raising alarms.
I change this: ceph config set class:hdd bdev_stalled_read_warn_lifetime 3600
The warning is triggered by the backup process, and an hour later the warning disappears.
But I would like to increase bdev_stalled_read_warn_threshold to avoid alerts.
How can I know the value to choose ?
I dont want to mask real problems with a value too high.

aufwiz · Apr 19, 2025

same as me.
follow up.

wuwzy · Apr 21, 2025

wuwzy said:
I am doing a test. Steps:

1. Use ceph config dump first to check.

2. Enter the command again
ceph config set global bdev_async_discard_threads 1
ceph config set global bdev_enable_discard true

3. Use ceph config dump again to check and see if it works.

I have waited for 30 minutes now. The warning is not automatically eliminated.

Now I start to restart a node, which will temporarily eliminate the problem, and usually the next day, the error prompt will appear again. I need time to report back to you.

If the command cannot solve the problem, you can use the following command to delete the two added configurations and restore them to their original state.

ceph config rm global bdev_async_discard_threads
ceph config rm global bdev_enable_discard

Happy Monday, I report to you that after a weekend, the error message has disappeared. This command is effective on my small cluster. I did not update any patches during this weekend.
A healthy cluster ceph is back again.

SteveITS · Apr 21, 2025

We upgraded to 19.2.1 Friday night, and rebooted all servers. Saturday morning two out of three HDDs (with DB on SSD) had this warning. Without doing anything, when I looked on Sunday (early and again late in the day) the error was gone. No SSD OSDs had the warning.

YAGA · Apr 22, 2025

Ceph 19.2.2 might fix this issue, please find the change log : https://docs.ceph.com/en/latest/releases/squid/#v19-2-2-squid

Regards

wuwzy · Apr 23, 2025

After 2 days, I came back and the error appeared again. Ha, it's so magical, first there were 7 osd errors. This morning, it was reduced to 2 osd errors. It seems that the problem still exists. I look forward to the next update patch to solve it.

2 OSD(s) experiencing slow operations in BlueStore

osd.8 observed slow operation indications in BlueStore
osd.17 observed slow operation indications in BlueStore

SteveITS · Apr 23, 2025

@wuwzy Are yours all on spinning disks?

wuwzy · Apr 24, 2025

SteveITS said:
@wuwzy Are yours all on spinning disks?

HDD 10TB

aychprox · Apr 24, 2025

same issue here with

ceph: 17.2.8-pve2
proxmox-ve: 8.4.0
pve-manager: 8.4.1

slow warning appeared at least 2-3 times a day from random SSD OSD.
restart the OSD will temporary remove the warning.

wuwzy · Apr 24, 2025

aychprox said:
same issue here with

ceph: 17.2.8-pve2
proxmox-ve: 8.4.0
pve-manager: 8.4.1

slow warning appeared at least 2-3 times a day from random SSD OSD.
restart the OSD will temporary remove the warning.

Yes, usually after 1 day, new reminders will slowly appear. I just pushed a core patch and reminded that I need to restart to make the new core take effect. I just finished it, so there are currently 0 errors. I will check again tomorrow and the day after tomorrow to see if it returns to normal.

SteveITS · Apr 24, 2025

@aychprox I thought the warning text was new in 19.2.1? Or did 17 get it also?
Also interesting you saw it on SSDs.

EllerholdAG · Apr 24, 2025

YAGA said:
Ceph 19.2.2 might fix this issue, please find the change log : https://docs.ceph.com/en/latest/releases/squid/#v19-2-2-squid

Regards

The changelog has 1 entry, thats got nothing to do with this warning...

FrancisS · Apr 24, 2025

EllerholdAG said:
The changelog has 1 entry, thats got nothing to do with this warning..

Hi Yaga,

Are you sure because in the 19.2.2 changelog there is only one change (critical) not related with the slow warning ?

squid: rgw: keep the tails when copying object to itself (pr#62711, cbodley)

aychprox · Apr 24, 2025

SteveITS said:
@aychprox I thought the warning text was new in 19.2.1? Or did 17 get it also?
Also interesting you saw it on SSDs.

Yes, no plan to upgrade to 19.2.1 yet.
Only seen this on SSD so far. Initially thought it was caused by bluestore_cache_size and bluestore_cache_kv_ratio, no luck even after adjusted.

wuwzy · Apr 25, 2025

4 OSD(s) experiencing slow operations in BlueStore
	osd.7 observed slow operation indications in BlueStore osd.8 observed slow operation indications in BlueStore osd.9 observed slow operation indications in BlueStore osd.15 observed slow operation indications in BlueStore

I continue to follow this topic. Yesterday, I installed all the latest patches and restarted all the nodes on the cluster. Naturally, all the osd errors were cleared. This morning, when I checked, 4 osd errors appeared again. It can be seen that the problem still exists. I don’t know if it will affect the security of the data. If you haven’t upgraded yet, you can wait and see 19.2.1.

MartinHS · Apr 28, 2025

Following. I started getting the same issue after upgrade to ceph 19.2.1. All my OSDs are SATA or NVME SSDs. I did upgrade my whole network infrastructure around the same time, though (from 1Gbit to 10Gbit) - so I have no idea if this is Ceph related/caused or not. In the Proxmox Ceph OSD overview, the OSDs mentioned in the warning do not appear to have higher latencies than the rest of the OSDs.

Petr Svacina · Apr 28, 2025

Hi all, I am just observing and witing for a real patch, I have did not "repair" CEPH health anyhow, but ....
All warnings did disappeared a 3-4 days ago and CEPH health is green ...

So I am confused ...

wuwzy · Apr 28, 2025

Petr Svacina said:
Hi all, I am just observing and witing for a real patch, I have did not "repair" CEPH health anyhow, but ....
All warnings did disappeared a 3-4 days ago and CEPH health is green ...

So I am confused ...

My understanding is that this fault will disappear automatically on weekends when no one is using the cluster and the pressure is reduced. If it runs for a few more days and the pressure is sufficient, the fault may reappear automatically. Just wait a few more days and you will know.

wuwzy · Apr 29, 2025

On Tuesday, the error reappeared. With 4 osds, the error should disappear on Monday after a 2-day weekend break.

Ceph 19.2.1 2 OSD(s) experiencing slow operations in BlueStore

Well-Known Member

Member

Member

Well-Known Member

Active Member

Renowned Member

Well-Known Member

Active Member

Well-Known Member

Renowned Member

Well-Known Member

Active Member

Member

Well-Known Member

Renowned Member

Well-Known Member

New Member

Well-Known Member

Well-Known Member

Well-Known Member

We value your privacy