Ceph uses false osd_mclock_max_capacity_iops_ssd value

Noah0302

Member
Jul 21, 2022
65
17
13
Hello everyone,

I recently installed Ceph into my 3 node cluster, which worked out great at first.
But after a while I noticed that the Ceph Pool would sometimes hang and stutter. Thats when I looked into the Configuration and saw this:
1667321193992.png
I use 3 exactly the same SSDs, checked if every node uses Sata 6G and so on. Everything should be working fine, but it seems Ceph thinks OSD 1 is using SATA 3 or something.
There probably is a way to manually adjust it, or let Ceph recalculate the values, but I have not seen it anywhere I looked.

This also happens sometimes, right now I tried to re-add the OSD and see if that works, but now I think I nuked my Ceph Pool:
1667321867982.png
Might this SSD be dead, although the SMART values are OK?

If anyone could help me here, I would be very grateful!
Thanks!
 

Attachments

  • 1667321818269.png
    1667321818269.png
    18 KB · Views: 15
Last edited:
Here is the Ceph Manual. To set a custom mclock iops value, use the following command:

Code:
ceph config set osd.N osd_mclock_max_capacity_iops_[hdd,ssd] <value>



What type of drives are these?
Thank you, that worked! Lets see if it improves performance...

They are cheap 480GB PNY CS900 SSDs with:
IOPS 4K r/w 89k/83k, but barely any cache

I obviously know, that if I want to be serious about Ceph, I should use better SSDs like Samsung Pros or something the like, but this is more testing in my homelab than anything else. I will probably upgrade then down the line, if I have tested Ceph a bit more.
I was just wondering why it was not recognized like the others, since they are exactly the same
 
I benchmark all of my drives multiple times and then set a consistent value for all of the same type across the cluster. The variance between each is easily explained by differences at the time of benchmarking (which occurs automatically when you upgraded or installed Ceph)
 
I benchmark all of my drives multiple times and then set a consistent value for all of the same type across the cluster. The variance between each is easily explained by differences at the time of benchmarking (which occurs automatically when you upgraded or installed Ceph)
Interesting.
Did you leave some buffer or just set the max value? The Docs described some reservation with the max IOPS
 
Same EXACT problem here, never ran into this before, but it's been a few months since i had a drive fail.. somewhere along the lines this came in as defaults.. been PULLING MY HAIR OUT trying to figure out what was going on. Stumbled across the values in the "CONFIGURATION DATABASE" section, spent a couple hours in google, and figured it out.

1672015645673.png

Benchmarked the actual drives, and got MUCH different results than the "automatic" values.

So the HDD drives absolutely can't take 600-800 IOPS:

Code:
{
    "bytes_written": 12288000,
    "blocksize": 4096,
    "elapsed_sec": 33.020831205999997,
    "bytes_per_sec": 372128.73059861764,
    "iops": 90.851740868803134
}

I wrote a script to walk to the database, and set the values, and the cluster is responding MUCH better. (sharing in case it helps others)

Would suggest setting some sort of lower defaults, or ability to tune in the webUI.


Code:
#intel S3500 120G = 11,500 write IOPS (AKA MSD)
#intel S3700 200G = 32,000 write IOPS (AKA SSD)
#seagate 8TB drives = 120 write IOPS (AKA HDD)

ceph osd df > /tmp/osd.txt
grep hdd /tmp/osd.txt | awk '{ print $1}' > /tmp/hdd.list
grep msd /tmp/osd.txt | awk '{ print $1}' > /tmp/msd.list
grep ssd /tmp/osd.txt | awk '{ print $1}' > /tmp/ssd.list

HDDIOP=100
SSDIOP=16000
MSDIOP=5500

echo "Setting $(cat /tmp/hdd.list | wc -l) HDD to ${HDDIOP}"
while read OSD; do
        echo "setting ${HDDIOP} on OSD.${OSD}"
        ceph config set osd.${OSD} osd_mclock_max_capacity_iops_hdd ${HDDIOP}
done </tmp/hdd.list

echo "Setting $(cat /tmp/ssd.list | wc -l) SSD to ${SSDIOP}"
while read OSD; do
        echo "setting ${SSDIOP} on OSD.${OSD}"
        ceph config set osd.${OSD} osd_mclock_max_capacity_iops_ssd ${SSDIOP}
done </tmp/ssd.list

echo "Setting $(cat /tmp/msd.list | wc -l) MSD to ${MSDIOP}"
while read OSD; do
        echo "setting ${MSDIOP} on OSD.${OSD}"
        ceph config set osd.${OSD} osd_mclock_max_capacity_iops_ssd ${HDDIOP}
done </tmp/msd.list

REFS:
** https://docs.ceph.com/en/quincy/rad...ref/#confval-osd_mclock_max_capacity_iops_hdd
** https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/
 
Last edited:
@dlasher thanks for sharing that. How did you benchmark your drives? This?


Code:
root@pve1:~# ceph tell osd.5 cache drop
root@pve1:~# ceph tell osd.5 bench 12288000 4096 4194304 100