[SOLVED] Ceph - Schedule deep scrubs to prevent service degradation

Jun 8, 2016
213
40
28
43
Johannesburg, South Africa
Whilst our cluster can sustain backfills and deep scrubs during normal operation, it does affect storage I/O when deep scrubs run during production hours (7am-7pm Mondays through to Fridays). This particularly affects legacy Linux VMs running kernels prior to 2.6.32, which don't send 'flush' instructions and RBD subsequently never transitions to writeback caching.

We made the following adjustments to Ceph:
Code:
/etc/ceph/ceph.conf
[global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         cluster network = 10.254.1.0/24
         debug ms = 0/0
         filestore xattr use omap = true
         fsid = a3f1c21f-f883-48e0-9bd2-4f869c72b17d
         keyring = /etc/pve/priv/$cluster.$name.keyring
         mon allow pool delete = true
         osd deep scrub interval = 1209600
         osd scrub begin hour = 19
         osd scrub end hour = 6
         osd scrub sleep = 0.1
         public network = 10.254.1.0/24
Essentially:
  • Disabled debug messages
  • Set deep scrub interval to 2 weeks
  • Scrub hours to begin at 7pm and end at 6am
  • OSD scrub to sleep '0.1' between chunks

We finally schedule deep scrubs manually:
Code:
/etc/cron.d/ceph-scrub-deep
0 1 * * 0,1 root /root/ceph-deep-scrub-pg-ratio 4
0 1 * * 2-6 root /root/ceph-deep-scrub-pg-ratio 10

The '/root/ceph-deep-scrub-pg-ratio' script:
Code:
#!/bin/bash

# /etc/cron.d/ceph-scrub-deep
#  0 1 * * 0,1 root /root/ceph-deep-scrub-pg-ratio 4
#  0 1 * * 2-6 root /root/ceph-deep-scrub-pg-ratio 10
#    Scrub 25% of placement groups that were last scrubbed the longest time ago, starting at 1am Sunday and Monday
#    Scrub 10% of placement groups that were last scrubbed the longest time ago, starting at 1am Tuesday through Saturday

set -o nounset
set -o errexit

CEPH=/usr/bin/ceph
AWK=/usr/bin/awk
SORT=/usr/bin/sort
HEAD=/usr/bin/head
DATE=/bin/date
SED=/bin/sed
GREP=/bin/grep
PYTHON=/usr/bin/python


DEEPMARK="scrubbing+deep";              # What string matches a deep scrubing state in ceph pg's output?
MAXSCRUBS=2;                            # Max concurrent deep scrubs operations

workratio=$1;
[ "x$workratio" == x ] && workratio=7;  # Set work ratio from first arg; fall back to '7'.


function isNewerThan() {
    # Args: [PG] [TIMESTAMP]
    # Output: None
    # Returns: 0 if changed; 1 otherwise
    # Desc: Check if a placement group "PG" deep scrub stamp has changed
    # (i.e != "TIMESTAMP")
    pg=$1;
    ots=$2;
    ndate=$($CEPH pg $pg query -f json-pretty | \
        $PYTHON -c 'import json;import sys; print json.loads(sys.stdin.read())["info"]["stats"]["last_deep_scrub_stamp"]');
    nts=$($DATE -d "$ndate" +%s);
    [ $ots -ne $nts ] && return 0;
    return 1;
}

function scrubbingCount() {
    # Args: None
    # Output: int
    # Returns: 0
    # Desc: Outputs concurent deep scrubbing tasks.
    cnt=$($CEPH -s | $GREP $DEEPMARK | $AWK '{ print $1; }');
    [ "x$cnt" == x ] && cnt=0;
    echo $cnt;
    return 0;
}

function waitForScrubSlot() {
    # Args: None
    # Output: Informative text
    # Returns: true
    # Desc: Idle loop waiting for a free deepscrub slot.
    while [ $(scrubbingCount) -ge $MAXSCRUBS ]; do
        sleep 1;
    done
    return 0;
}

function deepScrubPg() {
    # Args: [PG]
    # Output: Informative text
    # Return: 0 when PG is effectively deep scrubing
    # Desc: Start a PG "PG" deep-scrub
    $CEPH pg deep-scrub $1 >& /dev/null;
    # Must sleep as ceph does not immediately start scrubbing
    # So we wait until wanted PG effectively goes into deep scrubbing state...
    local emergencyCounter=0;
    while ! $CEPH pg $1 query | $GREP state | $GREP -q $DEEPMARK; do
        isNewerThan $1 $2 && break;
        test $emergencyCounter -gt 150 && break;
        sleep 1;
        emergencyCounter=$[ $emergencyCounter +1 ];
    done
    sleep 2;
    return 0;
}


function getOldestScrubs() {
    # Args: [num_res]
    # Output: [num_res] PG ids
    # Return: 0
    # Desc: Get the "num_res" oldest deep-scrubbed PGs
    numres=$1;
    [ x$numres == x ] && numres=20;
    $CEPH pg dump pgs 2>/dev/null | \
        $AWK '/^[0-9]+\.[0-9a-z]+/ { if($12 == "active+clean") {  print $1,$25,$26 ; }; }' | \
        while read line; do set $line; echo $1 $($DATE -d "$2 $3" +%s); done | \
        $SORT -n -k2 | \
        $HEAD -n $numres;
    return 0;
}

function getPgCount() {
    # Args:
    # Output: number of total PGs
    # Desc: Output the total number of "active+clean" PGs
    $CEPH pg stat | $SED 's/^.* \([0-9]\+\) active+clean[^+].*/\1/g';
}


# Get PG count
pgcnt=$(getPgCount);
# Get the number of PGs we'll be working on
pgwork=$((pgcnt / workratio + 1));

# Actual work starts here, quite self-explanatory.
logger -t ceph_scrub "Ceph deep scrub - Start on $[100/workratio]% of $pgcnt PGs = $pgwork PGs";
getOldestScrubs $pgwork | while read line; do
    set $line;
    waitForScrubSlot;
    deepScrubPg $1 $2;
done
logger -t ceph_scrub "Ceph deep scrub - End";
Credits to Johannes Formanns, who's script I virtually didn't alter:
https://www.formann.de/2015/05/cronjob-to-enable-timed-deep-scrubbing-in-a-ceph-cluster/


You can view deep scrub distribution, once the script has been running for a while, with the following commands:
Code:
[root@kvm5a ~]# ceph pg dump | grep active | awk '{print $25}' | sort | uniq -c
dumped all
     51 2017-12-02
    267 2017-12-03
    183 2017-12-04
     91 2017-12-05

[root@kvm5a ~]# for date in `ceph pg dump | grep active | awk '{print $25}'`; do date +%A -d $date; done | sort | uniq -c;
dumped all
    183 Monday
     51 Saturday
    267 Sunday
     91 Tuesday
 
Last edited:
Jun 8, 2016
213
40
28
43
Johannesburg, South Africa
You can makes changes to running OSDs without having to restart services:
Code:
ceph tell osd.* injectargs '--debug_ms 0/0';
ceph tell osd.* injectargs '--osd_deep_scrub_interval 1209600';
ceph tell osd.* injectargs '--osd_scrub_begin_hour 19';
ceph tell osd.* injectargs '--osd_scrub_end_hour 6';
ceph tell osd.* injectargs '--osd_scrub_sleep 0.1';

Check effective settings:
Code:
for f in /var/run/ceph/ceph-osd.*.asok; do ceph --admin-daemon $f config show; done | grep 'debug_ms\|osd_deep_scrub_interval\|osd_scrub_begin_hour\|osd_scrub_end_hour\|osd_scrub_sleep'
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!