[SOLVED] Ceph - Schedule deep scrubs to prevent service degradation

Jun 8, 2016
344
75
93
48
Johannesburg, South Africa
Whilst our cluster can sustain backfills and deep scrubs during normal operation, it does affect storage I/O when deep scrubs run during production hours (7am-7pm Mondays through to Fridays). This particularly affects legacy Linux VMs running kernels prior to 2.6.32, which don't send 'flush' instructions and RBD subsequently never transitions to writeback caching.

We made the following adjustments to Ceph:
Code:
/etc/ceph/ceph.conf
[global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         cluster network = 10.254.1.0/24
         debug ms = 0/0
         filestore xattr use omap = true
         fsid = a3f1c21f-f883-48e0-9bd2-4f869c72b17d
         keyring = /etc/pve/priv/$cluster.$name.keyring
         mon allow pool delete = true
         osd deep scrub interval = 1209600
         osd scrub begin hour = 19
         osd scrub end hour = 6
         osd scrub sleep = 0.1
         public network = 10.254.1.0/24

Essentially:
  • Disabled debug messages
  • Set deep scrub interval to 2 weeks
  • Scrub hours to begin at 7pm and end at 6am
  • OSD scrub to sleep '0.1' between chunks

We finally schedule deep scrubs manually:
Code:
/etc/cron.d/ceph-scrub-deep
0 1 * * 0,1 root /root/ceph-deep-scrub-pg-ratio 4
0 1 * * 2-6 root /root/ceph-deep-scrub-pg-ratio 10


The '/root/ceph-deep-scrub-pg-ratio' script:
Code:
#!/bin/bash

# /etc/cron.d/ceph-scrub-deep
#  0 1 * * 0,1 root /root/ceph-deep-scrub-pg-ratio 4
#  0 1 * * 2-6 root /root/ceph-deep-scrub-pg-ratio 10
#    Scrub 25% of placement groups that were last scrubbed the longest time ago, starting at 1am Sunday and Monday
#    Scrub 10% of placement groups that were last scrubbed the longest time ago, starting at 1am Tuesday through Saturday

set -o nounset
set -o errexit

CEPH=/usr/bin/ceph
AWK=/usr/bin/awk
SORT=/usr/bin/sort
HEAD=/usr/bin/head
DATE=/bin/date
SED=/bin/sed
GREP=/bin/grep
PYTHON=/usr/bin/python3


DEEPMARK="scrubbing+deep";              # What string matches a deep scrubing state in ceph pg's output?
MAXSCRUBS=2;                            # Max concurrent deep scrubs operations

workratio=$1;
[ "x$workratio" == x ] && workratio=7;  # Set work ratio from first arg; fall back to '7'.


function isNewerThan() {
    # Args: [PG] [TIMESTAMP]
    # Output: None
    # Returns: 0 if changed; 1 otherwise
    # Desc: Check if a placement group "PG" deep scrub stamp has changed
    # (i.e != "TIMESTAMP")
    pg=$1;
    ots=$2;
    ndate=$($CEPH pg $pg query -f json-pretty | \
        $PYTHON -c 'import json;import sys; print(json.load(sys.stdin)["info"]["stats"]["last_deep_scrub_stamp"])');
    nts=$($DATE -d "$ndate" +%s);
    [ $ots -ne $nts ] && return 0;
    return 1;
}

function scrubbingCount() {
    # Args: None
    # Output: int
    # Returns: 0
    # Desc: Outputs concurent deep scrubbing tasks.
    cnt=$($CEPH -s | $GREP $DEEPMARK | $AWK '{ print $1; }');
    [ "x$cnt" == x ] && cnt=0;
    echo $cnt;
    return 0;
}

function waitForScrubSlot() {
    # Args: None
    # Output: Informative text
    # Returns: true
    # Desc: Idle loop waiting for a free deepscrub slot.
    while [ $(scrubbingCount) -ge $MAXSCRUBS ]; do
        sleep 1;
    done
    return 0;
}

function deepScrubPg() {
    # Args: [PG]
    # Output: Informative text
    # Return: 0 when PG is effectively deep scrubing
    # Desc: Start a PG "PG" deep-scrub
    $CEPH pg deep-scrub $1 >& /dev/null;
    # Must sleep as ceph does not immediately start scrubbing
    # So we wait until wanted PG effectively goes into deep scrubbing state...
    local emergencyCounter=0;
    while ! $CEPH pg $1 query | $GREP state | $GREP -q $DEEPMARK; do
        isNewerThan $1 $2 && break;
        test $emergencyCounter -gt 150 && break;
        sleep 1;
        emergencyCounter=$[ $emergencyCounter +1 ];
    done
    sleep 2;
    return 0;
}


function getOldestScrubs() {
    # Args: [num_res]
    # Output: [num_res] PG ids
    # Return: 0
    # Desc: Get the "num_res" oldest deep-scrubbed PGs
    numres=$1;
    [ x$numres == x ] && numres=20;
    $CEPH pg dump pgs 2>/dev/null | \
        $AWK '/^[0-9]+\.[0-9a-z]+/ { if($12 == "active+clean") {  print $1,$23 ; }; }' | \
        while read line; do set $line; echo $1 $($DATE -d "$2" +%s); done | \
        $SORT -n -k2 | \
        $HEAD -n $numres;
    return 0;
}

function getPgCount() {
    # Args:
    # Output: number of total PGs
    # Desc: Output the total number of "active+clean" PGs
    $CEPH pg stat | $SED 's/^.* \([0-9]\+\) active+clean[^+].*/\1/g';
}


# Get PG count
pgcnt=$(getPgCount);
# Get the number of PGs we'll be working on
pgwork=$((pgcnt / workratio + 1));

# Actual work starts here, quite self-explanatory.
logger -t ceph_scrub "Ceph deep scrub - Start on $[100/workratio]% of $pgcnt PGs = $pgwork PGs";
getOldestScrubs $pgwork | while read line; do
    set $line;
    waitForScrubSlot;
    deepScrubPg $1 $2;
done
logger -t ceph_scrub "Ceph deep scrub - End";

Credits to Johannes Formanns, who's script I virtually didn't alter:
https://www.formann.de/2015/05/cronjob-to-enable-timed-deep-scrubbing-in-a-ceph-cluster/


You can view deep scrub distribution, once the script has been running for a while, with the following commands:
Code:
[root@kvm5a ~]# ceph pg dump | grep active | awk '{print $23}' | cut -dT -f1 | sort | uniq -c
dumped all
     51 2017-12-02
    267 2017-12-03
    183 2017-12-04
     91 2017-12-05

[root@kvm5a ~]# for date in `ceph pg dump | grep active | awk '{print $23}' | cut -dT -f1`; do date +%A -d $date; done | sort | uniq -c;
dumped all
    183 Monday
     51 Saturday
    267 Sunday
     91 Tuesday


NB: This script has been updated to work with Ceph Pacific (16), previous edits of this forum post provide support for Octopus (15), Nautilus (14) and Luminous (12).
 
Last edited:
You can makes changes to running OSDs without having to restart services:
Code:
ceph tell osd.* injectargs '--debug_ms 0/0';
ceph tell osd.* injectargs '--osd_deep_scrub_interval 1209600';
ceph tell osd.* injectargs '--osd_scrub_begin_hour 19';
ceph tell osd.* injectargs '--osd_scrub_end_hour 6';
ceph tell osd.* injectargs '--osd_scrub_sleep 0.1';


Check effective settings:
Code:
for f in /var/run/ceph/ceph-osd.*.asok; do ceph --admin-daemon $f config show; done | grep 'debug_ms\|osd_deep_scrub_interval\|osd_scrub_begin_hour\|osd_scrub_end_hour\|osd_scrub_sleep'
 
Hello David
with Ceph 15 the script has the following issue.
Code:
ceph-deep-scrub-pg-ratio: line 104: $2: unbound variable

# line 104:
         while read line; do set $line; echo $1 $($DATE -d "$2 $3" +%s); done | \

PS thank you for this script, we've been using it for a few years.
 
Hi Rob,

Been upgrading all non-clustered PVE nodes and completed our first cluster upgrade which included moving from Nautilus to Octopus today.

The 'ceph pg dump' output has changed yet again, although all that happened this time is that the space between the date and time has been replaced with a 'T'. Herewith the relevant changes to the deep scrub scheduling script, date is able to convert the timestamp to an epoch value without further modification:

Code:
[root@backup1 ~]# diff -uNr ceph-deep-scrub-pg-ratio.luminous ceph-deep-scrub-pg-ratio
--- ceph-deep-scrub-pg-ratio.luminous   2019-10-04 11:08:53.417797715 +0200
+++ ceph-deep-scrub-pg-ratio    2020-12-02 13:39:07.678349904 +0200
@@ -91,8 +91,8 @@
     numres=$1;
     [ x$numres == x ] && numres=20;
     $CEPH pg dump pgs 2>/dev/null | \
-        $AWK '/^[0-9]+\.[0-9a-z]+/ { if($12 == "active+clean") {  print $1,$25,$26 ; }; }' | \
-        while read line; do set $line; echo $1 $($DATE -d "$2 $3" +%s); done | \
+        $AWK '/^[0-9]+\.[0-9a-z]+/ { if($12 == "active+clean") {  print $1,$23 ; }; }' | \
+        while read line; do set $line; echo $1 $($DATE -d "$2" +%s); done | \
         $SORT -n -k2 | \
         $HEAD -n $numres;
     return 0;


I'll get around to updating the script in the first post above a little later...
 
Updated commands to view scrubbing distribution:

Code:
[root@kvm1 ~]# # By Day of week:
[root@kvm1 ~]# for date in `ceph pg dump 2> /dev/null | grep active | awk '{print $23}' | cut -dT -f1`; do date +%A -d $date; done | sort | uniq -c;
     30 Friday
     79 Monday
    101 Saturday
    112 Sunday
    104 Tuesday
     21 Wednesday
[root@kvm1 ~]# # By Date:
[root@kvm1 ~]# ceph pg dump 2> /dev/null | grep active | awk '{print $23}' | cut -dT -f1 | sort | uniq -c;
     30 2020-11-27
    100 2020-11-28
    111 2020-11-29
     78 2020-11-30
    104 2020-12-01
     21 2020-12-02
 
Fantastic work, thanks for sharing... favorited this one.

(honestly, this should be a set of options in the PMX Ceph admin page.)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!