Ceph 19.2.1 2 OSD(s) experiencing slow operations in BlueStore

There have been several updates, but all the upgrades cannot fix this problem. It may be caused by the Linux kernel upgrade or the Ceph upgrade. I hope it can be fixed in the next update.
 
  • Like
Reactions: RockyRen
Just to chime in, I am also seeing this on my homelab. Everything was fine until I upgrades to 19 (from 18). And I think there is an issue, not just alerting, as when it gets bad enough, the MDS servers start getting "cranky" (slow ops), and won't become healthy again until I restart the OSDs that are running slow.
 
Just to chime in, I am also seeing this on my homelab. Everything was fine until I upgrades to 19 (from 18). And I think there is an issue, not just alerting, as when it gets bad enough, the MDS servers start getting "cranky" (slow ops), and won't become healthy again until I restart the OSDs that are running slow.
I also encountered this problem. I would like to ask if I restart the osd after it occurs, and the alarm will disappear after a few days. If this problem occurs in an osd in the future, restart the osd to temporarily solve the problem.
 
I also encountered this problem. I would like to ask if I restart the osd after it occurs, and the alarm will disappear after a few days. If this problem occurs in an osd in the future, restart the osd to temporarily solve the problem.
Sure enough, the alarm disappeared after I restarted osd.
1749172077157.png
 
What does:
ceph daemon osd.<id> dump_historic_ops

Tell you.

This error should occur when you OSD cannot write their queue for 30s. This typically means serious hardware or network issues. Eg after a full cluster reboot/recovery, this may happen if OSD are still starting or slow to start or just 1 drive is lagging causing the others not to start.
 
Last edited:
Looks like you may have a bad drive, these ops complete in 120ms which is long even for spinning hard drives, with a spinning drive, you would expect <20ms +/- network latency of ~1-2ms. Are you using SMR drives? These seem to be mostly around the time you are rebuilding an OSD, which can indeed put very high load on both drives and network.

Again, either you are severely overloading your network leading to packet drops at this time, or your drive is failing causing some operations to take very long, you won’t notice in most cases as Ceph will redirect operations that don’t complete in time, but rebuilding does require the drive to be functional.

Given this is generally around commit time to disk, I would suspect the disk.
 
Last edited:
Hello,

Like I say in the first post no problem with Ceph 19.2.0, I have the messages with Ceph 19.2.1.

I have 6 clusters of 6 nodes , 3 clusters on one site (production) and 3 clusters on other site (rescue).

No messages on the rescue site only on the production site (where I have more I/Os)

All the cluster have the "same" configuration.

System storage: 2 SCSI 10k rpm disks (MD)
Ceph storage: N SATA SSD disks
Ceph Network: 4 clusters with (shared with VMs) 2x40Gb (Mellanox), 2 clusters with Ceph dedicated 2x10Gb.

At this moment on one production 40Gb cluster with for Ceph 3 SSD for each nodes I have 7 OSD with message "slow..."
(I have also "slow..." messages on the others production 40Gb and 10GB)

# ceph -s
cluster:
id: XXXX
health: HEALTH_WARN
7 OSD(s) experiencing slow operations in BlueStore

services:
mon: 3 daemons, quorum XXX1,XXX3,XXX5 (age 9d)
mgr: XXX2(active, since 6w), standbys: XXX4, XXX6
osd: 18 osds: 18 up (since 9h), 18 in (since 8w)

data:
pools: 2 pools, 513 pgs
objects: 1.15M objects, 4.2 TiB
usage: 12 TiB used, 7.8 TiB / 20 TiB avail
pgs: 513 active+clean

io:
client: 807 MiB/s rd, 6.6 MiB/s wr, 18.18k op/s rd, 623 op/s wr

Best regards.
Francis
 
Last edited:
Looks like you may have a bad drive, these ops complete in 120ms which is long even for spinning hard drives, with a spinning drive, you would expect <20ms +/- network latency of ~1-2ms. Are you using SMR drives? These seem to be mostly around the time you are rebuilding an OSD, which can indeed put very high load on both drives and network.

Again, either you are severely overloading your network leading to packet drops at this time, or your drive is failing causing some operations to take very long, you won’t notice in most cases as Ceph will redirect operations that don’t complete in time, but rebuilding does require the drive to be functional.

Given this is generally around commit time to disk, I would suspect the disk.
ohmmm...I know you don't want to hear it but this is a SSD drive. and befor 18.2.6 it works without issue
and nope no SMR method.. i think its only for HDD
 
Last edited:
What brand and model? 120ms is a really long time, for SSD you would expect <2ms. I don’t think it was ever ‘without issue’, you just never noticed or the new versions have a slightly different load pattern that trigger it, or you added more load. You can upgrade to Squid and see if it improves any, but the logs are pretty clear.
 
What brand and model? 120ms is a really long time, for SSD you would expect <2ms. I don’t think it was ever ‘without issue’, you just never noticed or the new versions have a slightly different load pattern that trigger it, or you added more load. You can upgrade to Squid and see if it improves any, but the logs are pretty clear.
Hello,

I have messages "slow..." also with squid 19.2.1 (no messages with squid 19.2.0)

For my 3 clusters with the messages at the moment all the disks with osd slow message are Crucial "CT1000MX500SSD1"...

For the 3 other clusters with no messages I have only Intel and Samsung disks (but lessssss I/Os).

Best regards.

Francis
 
Last edited:
This is a common issue with the (nearly decade old) MX500s. They are not very good in general, even for desktop use, they have major firmware issues, glitch out even in desktops. You can see if updating the firmware resolves the issue, but also check your SMART values (smartctl -a /dev/xxx) I will guess you have tons of pending sectors, they are probably 'worn out' to some extent as well. One of the recommendations is to start the computer but not start using them for about a minute so the firmware can boot up properly.
 
This is a common issue with the (nearly decade old) MX500s. They are not very good in general, even for desktop use, they have major firmware issues, glitch out even in desktops. You can see if updating the firmware resolves the issue, but also check your SMART values (smartctl -a /dev/xxx) I will guess you have tons of pending sectors, they are probably 'worn out' to some extent as well. One of the recommendations is to start the computer but not start using them for about a minute so the firmware can boot up properly.
Thank you we have planned to update the firmware for most of the disks from M3CR043 to M3CR046
 
What brand and model? 120ms is a really long time, for SSD you would expect <2ms. I don’t think it was ever ‘without issue’, you just never noticed or the new versions have a slightly different load pattern that trigger it, or you added more load. You can upgrade to Squid and see if it improves any, but the logs are pretty clear.
the issue was with crucial CT240BX500SSD1

I have changed it to WD blue..

i'm watching
 
An update for my "Also seeing this." post.

I do not have the Crucial drives mentioned, but I do have a bunch of Samsung MZ6ER400HAGL-003 disks. All of the groups reporting this where HDD OSDs with their DB on these disks. Since then THREE of these disks failed! Seems like a firmware bug, and they make it VERY hard to update them. They fail with a generic:

Code:
FAILURE PREDICTION THRESHOLD EXCEEDED: ascq=0x73 [asc=5d, ascq=73]

But have 100% endurance left... Not ALL of the disks that have failed at reporting this, but all ARE reporting issues in dmesg. Look for "Buffer I/O error" or "Media impending failure endurance limit met".

This COULD just be a doomsday firmware bug (I might loose all my data, but it's healing), it COULD be a firmware bug being triggered by something in Ceph 19, or maybe Ceph 19 just "caught" the issue just before failure...

If you are seeing this, I recommend looking for drive health issues. No idea if this is a caught by, or caused by situation. But I will be phasing these disks out (assuming my data recovers, still have a few that I can't "out" just yet...).
 
Just updated. Today's push of ceph 19.2.2 will take 2 days to verify whether this problem is solved. I hope it will be good news.
 
After data recovery, and upgrading to 19.2.2, I am now getting this one of my pure SSD class pools (no wall or db, basic replication only):


Code:
[WRN] BLUESTORE_SLOW_OP_ALERT: 1 OSD(s) experiencing slow operations in BlueStore
     osd.9 observed slow operation indications in BlueStore
[WRN] DB_DEVICE_STALLED_READ_ALERT: 1 OSD(s) experiencing stalled read in db device of BlueFS
     osd.9 observed stalled read indications in DB device

As best as I can tell, this disk is perfectly healthy. :/