Ceph 19.2.1 2 OSD(s) experiencing slow operations in BlueStore

Just to ask are we sure this is a problem? They added a warning for slowness but has something actually gotten slower or are they just alerting on the same behavior now?
 
There have been 8 OSD errors today, and the speed has not been found to be particularly slow.
 
Hi,

Same issue here after update from 8.3.5 to 8.4.1 and ceph from 19.2.0 to 19.2.1

Any help?

Thanks
 
I don't think we've seen this alert for a few weeks now.

FWIW I saw this post, but did not change any of our (19.2.1) settings:
 
  • Like
Reactions: aufwiz
One possibly related note, especially for those with multiple OSD classes, we set our few remaining HDDs to primary-affinity 0, so the primary read would always be from an SSD.

View:
Code:
ceph osd tree

Set:
Code:
ceph osd primary-affinity osd.12 0
 
I don't think we've seen this alert for a few weeks now.

FWIW I saw this post, but did not change any of our (19.2.1) settings:

It is helpful for me.
 
  • Like
Reactions: SteveITS
I had this issue slow operations in bluestore issue as well and had failed to resolve it with the fixes addressed here, just like the rest of you. I think I'm finding out the power supply to that machine was failing. I've just replaced it and no immediate issues (which I was having with bluestore, and disk I/O in general, on the old PSU). I highly doubt this helps anyone else, just sharing my observation in case it does help someone.
 
We have some remaining SAS 10k drives. On the prior platform they had a read/write cache SSD, which we're using for DB/WAL. They'll get replaced eventually.
For this specific case, I think this is normal to have random slow ops error, as your pg and replicats can be on different slorage speed. (so, a primary write on fast sdd, will always wait for replicat on slow hdd), and for read, it's really russian roulette.
(Personally, I'll do 2 differents pools, to be sure to have have random behaviour)
 
> for read, it's really russian roulette

For read, by default it’s random but affinity is set 0 on the HDDs per my post above, so now all SSD reads.

I’m not saying you’re wrong with the rest, just that I haven’t seen this thread’s warning in a few weeks (both before and after that change, which was a week or so ago). So the warning is not really a factor for us, I guess. (And my point/question above was, since there wasn’t a visible warning before, was there actually a problem and people weren’t aware of it? Or is there just a new warning text people are concerned about and nothing has changed performance wise)

It’s early so I’m not at higher math levels but a “slow write” chance would be dependent on number of hosts with HDDs at all, and also the relative number of HDDs vs SSDs in them? (Host with all SSD would have zero chance of course; host with say 2/6 has 33% chance)

Not sure I follow how a separate HDD pool would result in random behavior…doesn’t matter anyway for us as they are only in two servers so not enough for 3/2 replication by themselves.

I was pointing out the affinity setting because if there are less random reads on a drive I’d expect it to be a bit “faster” for writes due to less head seeking.
 
i see the same message and i only have ssd's. the strange thing is that the message goes away after 4-6 days and comes back after 7 days. without rebooting or anything else

my cluster includes 5 devices and the cluster is running since 5 yeras.. the last reef version had the problem but not all those before it
 
Last edited: