[SOLVED] Ceph - Schedule deep scrubs to prevent service degradation

David Herselman · Dec 5, 2017

Whilst our cluster can sustain backfills and deep scrubs during normal operation, it does affect storage I/O when deep scrubs run during production hours (7am-7pm Mondays through to Fridays). This particularly affects legacy Linux VMs running kernels prior to 2.6.32, which don't send 'flush' instructions and RBD subsequently never transitions to writeback caching.

We made the following adjustments to Ceph:

Code:

/etc/ceph/ceph.conf
[global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         cluster network = 10.254.1.0/24
         debug ms = 0/0
         filestore xattr use omap = true
         fsid = a3f1c21f-f883-48e0-9bd2-4f869c72b17d
         keyring = /etc/pve/priv/$cluster.$name.keyring
         mon allow pool delete = true
         osd deep scrub interval = 1209600
         osd scrub begin hour = 19
         osd scrub end hour = 6
         osd scrub sleep = 0.1
         public network = 10.254.1.0/24

Essentially:

Disabled debug messages
Set deep scrub interval to 2 weeks
Scrub hours to begin at 7pm and end at 6am
OSD scrub to sleep '0.1' between chunks

We finally schedule deep scrubs manually:

Code:

/etc/cron.d/ceph-scrub-deep
0 1 * * 0,1 root /root/ceph-deep-scrub-pg-ratio 4
0 1 * * 2-6 root /root/ceph-deep-scrub-pg-ratio 10

The '/root/ceph-deep-scrub-pg-ratio' script:

Code:

#!/bin/bash

# /etc/cron.d/ceph-scrub-deep
#  0 1 * * 0,1 root /root/ceph-deep-scrub-pg-ratio 4
#  0 1 * * 2-6 root /root/ceph-deep-scrub-pg-ratio 10
#    Scrub 25% of placement groups that were last scrubbed the longest time ago, starting at 1am Sunday and Monday
#    Scrub 10% of placement groups that were last scrubbed the longest time ago, starting at 1am Tuesday through Saturday

set -o nounset
set -o errexit

CEPH=/usr/bin/ceph
AWK=/usr/bin/awk
SORT=/usr/bin/sort
HEAD=/usr/bin/head
DATE=/bin/date
SED=/bin/sed
GREP=/bin/grep
PYTHON=/usr/bin/python3


DEEPMARK="scrubbing+deep";              # What string matches a deep scrubing state in ceph pg's output?
MAXSCRUBS=2;                            # Max concurrent deep scrubs operations

workratio=$1;
[ "x$workratio" == x ] && workratio=7;  # Set work ratio from first arg; fall back to '7'.


function isNewerThan() {
    # Args: [PG] [TIMESTAMP]
    # Output: None
    # Returns: 0 if changed; 1 otherwise
    # Desc: Check if a placement group "PG" deep scrub stamp has changed
    # (i.e != "TIMESTAMP")
    pg=$1;
    ots=$2;
    ndate=$($CEPH pg $pg query -f json-pretty | \
        $PYTHON -c 'import json;import sys; print(json.load(sys.stdin)["info"]["stats"]["last_deep_scrub_stamp"])');
    nts=$($DATE -d "$ndate" +%s);
    [ $ots -ne $nts ] && return 0;
    return 1;
}

function scrubbingCount() {
    # Args: None
    # Output: int
    # Returns: 0
    # Desc: Outputs concurent deep scrubbing tasks.
    cnt=$($CEPH -s | $GREP $DEEPMARK | $AWK '{ print $1; }');
    [ "x$cnt" == x ] && cnt=0;
    echo $cnt;
    return 0;
}

function waitForScrubSlot() {
    # Args: None
    # Output: Informative text
    # Returns: true
    # Desc: Idle loop waiting for a free deepscrub slot.
    while [ $(scrubbingCount) -ge $MAXSCRUBS ]; do
        sleep 1;
    done
    return 0;
}

function deepScrubPg() {
    # Args: [PG]
    # Output: Informative text
    # Return: 0 when PG is effectively deep scrubing
    # Desc: Start a PG "PG" deep-scrub
    $CEPH pg deep-scrub $1 >& /dev/null;
    # Must sleep as ceph does not immediately start scrubbing
    # So we wait until wanted PG effectively goes into deep scrubbing state...
    local emergencyCounter=0;
    while ! $CEPH pg $1 query | $GREP state | $GREP -q $DEEPMARK; do
        isNewerThan $1 $2 && break;
        test $emergencyCounter -gt 150 && break;
        sleep 1;
        emergencyCounter=$[ $emergencyCounter +1 ];
    done
    sleep 2;
    return 0;
}


function getOldestScrubs() {
    # Args: [num_res]
    # Output: [num_res] PG ids
    # Return: 0
    # Desc: Get the "num_res" oldest deep-scrubbed PGs
    numres=$1;
    [ x$numres == x ] && numres=20;
    $CEPH pg dump pgs 2>/dev/null | \
        $AWK '/^[0-9]+\.[0-9a-z]+/ { if($12 == "active+clean") {  print $1,$23 ; }; }' | \
        while read line; do set $line; echo $1 $($DATE -d "$2" +%s); done | \
        $SORT -n -k2 | \
        $HEAD -n $numres;
    return 0;
}

function getPgCount() {
    # Args:
    # Output: number of total PGs
    # Desc: Output the total number of "active+clean" PGs
    $CEPH pg stat | $SED 's/^.* \([0-9]\+\) active+clean[^+].*/\1/g';
}


# Get PG count
pgcnt=$(getPgCount);
# Get the number of PGs we'll be working on
pgwork=$((pgcnt / workratio + 1));

# Actual work starts here, quite self-explanatory.
logger -t ceph_scrub "Ceph deep scrub - Start on $[100/workratio]% of $pgcnt PGs = $pgwork PGs";
getOldestScrubs $pgwork | while read line; do
    set $line;
    waitForScrubSlot;
    deepScrubPg $1 $2;
done
logger -t ceph_scrub "Ceph deep scrub - End";

Credits to Johannes Formanns, who's script I virtually didn't alter:
https://www.formann.de/2015/05/cronjob-to-enable-timed-deep-scrubbing-in-a-ceph-cluster/

You can view deep scrub distribution, once the script has been running for a while, with the following commands:

Code:

[root@kvm5a ~]# ceph pg dump | grep active | awk '{print $23}' | cut -dT -f1 | sort | uniq -c
dumped all
     51 2017-12-02
    267 2017-12-03
    183 2017-12-04
     91 2017-12-05

[root@kvm5a ~]# for date in `ceph pg dump | grep active | awk '{print $23}' | cut -dT -f1`; do date +%A -d $date; done | sort | uniq -c;
dumped all
    183 Monday
     51 Saturday
    267 Sunday
     91 Tuesday

NB: This script has been updated to work with Ceph Pacific (16), previous edits of this forum post provide support for Octopus (15), Nautilus (14) and Luminous (12).

David Herselman · Dec 5, 2017

You can makes changes to running OSDs without having to restart services:

Code:

ceph tell osd.* injectargs '--debug_ms 0/0';
ceph tell osd.* injectargs '--osd_deep_scrub_interval 1209600';
ceph tell osd.* injectargs '--osd_scrub_begin_hour 19';
ceph tell osd.* injectargs '--osd_scrub_end_hour 6';
ceph tell osd.* injectargs '--osd_scrub_sleep 0.1';

Check effective settings:

Code:

for f in /var/run/ceph/ceph-osd.*.asok; do ceph --admin-daemon $f config show; done | grep 'debug_ms\|osd_deep_scrub_interval\|osd_scrub_begin_hour\|osd_scrub_end_hour\|osd_scrub_sleep'

David Herselman · Aug 21, 2019

The script above has been updated to work with Ceph 14 (Nautilus). The affected line, for Ceph 12 (Nautilus) was:

Code:

$AWK '/^[0-9]+\.[0-9a-z]+/ { if($10 == "active+clean") {  print $1,$23,$24 ; }; }' | \

RobFantini · Nov 29, 2020

Hello David
with Ceph 15 the script has the following issue.

Code:

ceph-deep-scrub-pg-ratio: line 104: $2: unbound variable

# line 104:
         while read line; do set $line; echo $1 $($DATE -d "$2 $3" +%s); done | \

PS thank you for this script, we've been using it for a few years.

David Herselman · Dec 2, 2020

Hi Rob,

Been upgrading all non-clustered PVE nodes and completed our first cluster upgrade which included moving from Nautilus to Octopus today.

The 'ceph pg dump' output has changed yet again, although all that happened this time is that the space between the date and time has been replaced with a 'T'. Herewith the relevant changes to the deep scrub scheduling script, date is able to convert the timestamp to an epoch value without further modification:

Code:

[root@backup1 ~]# diff -uNr ceph-deep-scrub-pg-ratio.luminous ceph-deep-scrub-pg-ratio
--- ceph-deep-scrub-pg-ratio.luminous   2019-10-04 11:08:53.417797715 +0200
+++ ceph-deep-scrub-pg-ratio    2020-12-02 13:39:07.678349904 +0200
@@ -91,8 +91,8 @@
     numres=$1;
     [ x$numres == x ] && numres=20;
     $CEPH pg dump pgs 2>/dev/null | \
-        $AWK '/^[0-9]+\.[0-9a-z]+/ { if($12 == "active+clean") {  print $1,$25,$26 ; }; }' | \
-        while read line; do set $line; echo $1 $($DATE -d "$2 $3" +%s); done | \
+        $AWK '/^[0-9]+\.[0-9a-z]+/ { if($12 == "active+clean") {  print $1,$23 ; }; }' | \
+        while read line; do set $line; echo $1 $($DATE -d "$2" +%s); done | \
         $SORT -n -k2 | \
         $HEAD -n $numres;
     return 0;

I'll get around to updating the script in the first post above a little later...

David Herselman · Dec 2, 2020

Updated commands to view scrubbing distribution:

Code:

[root@kvm1 ~]# # By Day of week:
[root@kvm1 ~]# for date in `ceph pg dump 2> /dev/null | grep active | awk '{print $23}' | cut -dT -f1`; do date +%A -d $date; done | sort | uniq -c;
     30 Friday
     79 Monday
    101 Saturday
    112 Sunday
    104 Tuesday
     21 Wednesday
[root@kvm1 ~]# # By Date:
[root@kvm1 ~]# ceph pg dump 2> /dev/null | grep active | awk '{print $23}' | cut -dT -f1 | sort | uniq -c;
     30 2020-11-27
    100 2020-11-28
    111 2020-11-29
     78 2020-11-30
    104 2020-12-01
     21 2020-12-02

David Herselman · Sep 6, 2021

Updated to support Python3 in PVE 7 (Debian 11 (bullseye)).

dlasher · Apr 10, 2022

Fantastic work, thanks for sharing... favorited this one.

(honestly, this should be a set of options in the PMX Ceph admin page.)

aarononeal · Apr 20, 2022

Does osd scrub max interval need to be increased as well?

References imply this is necessary.

Search

Search

[SOLVED] Ceph - Schedule deep scrubs to prevent service degradation

David Herselman

Renowned Member

David Herselman

Renowned Member

David Herselman

Renowned Member

RobFantini

Famous Member

David Herselman

Renowned Member

David Herselman

Renowned Member

David Herselman

Renowned Member

dlasher

Renowned Member

aarononeal

Member

We value your privacy