[SOLVED] Ceph - Schedule deep scrubs to prevent service degradation

Discussion in 'Proxmox VE: Installation and configuration' started by David Herselman, Dec 5, 2017.

  1. David Herselman

    David Herselman Active Member
    Proxmox Subscriber

    Joined:
    Jun 8, 2016
    Messages:
    185
    Likes Received:
    38
    Whilst our cluster can sustain backfills and deep scrubs during normal operation, it does affect storage I/O when deep scrubs run during production hours (7am-7pm Mondays through to Fridays). This particularly affects legacy Linux VMs running kernels prior to 2.6.32, which don't send 'flush' instructions and RBD subsequently never transitions to writeback caching.

    We made the following adjustments to Ceph:
    Code:
    /etc/ceph/ceph.conf
    [global]
             auth client required = cephx
             auth cluster required = cephx
             auth service required = cephx
             cluster network = 10.254.1.0/24
             debug ms = 0/0
             filestore xattr use omap = true
             fsid = a3f1c21f-f883-48e0-9bd2-4f869c72b17d
             keyring = /etc/pve/priv/$cluster.$name.keyring
             mon allow pool delete = true
             osd deep scrub interval = 1209600
             osd scrub begin hour = 19
             osd scrub end hour = 6
             osd scrub sleep = 0.1
             public network = 10.254.1.0/24
    Essentially:
    • Disabled debug messages
    • Set deep scrub interval to 2 weeks
    • Scrub hours to begin at 7pm and end at 6am
    • OSD scrub to sleep '0.1' between chunks

    We finally schedule deep scrubs manually:
    Code:
    /etc/cron.d/ceph-scrub-deep
    0 1 * * 0,1 root /root/ceph-deep-scrub-pg-ratio 4
    0 1 * * 2-6 root /root/ceph-deep-scrub-pg-ratio 10

    The '/root/ceph-deep-scrub-pg-ratio' script:
    Code:
    #!/bin/bash
    
    # /etc/cron.d/ceph-scrub-deep
    #  0 1 * * 0,1 root /root/ceph-deep-scrub-pg-ratio 4
    #  0 1 * * 2-6 root /root/ceph-deep-scrub-pg-ratio 10
    #    Scrub 25% of placement groups that were last scrubbed the longest time ago, starting at 1am Sunday and Monday
    #    Scrub 10% of placement groups that were last scrubbed the longest time ago, starting at 1am Tuesday through Saturday
    
    set -o nounset
    set -o errexit
    
    CEPH=/usr/bin/ceph
    AWK=/usr/bin/awk
    SORT=/usr/bin/sort
    HEAD=/usr/bin/head
    DATE=/bin/date
    SED=/bin/sed
    GREP=/bin/grep
    PYTHON=/usr/bin/python
    
    
    DEEPMARK="scrubbing+deep";              # What string matches a deep scrubing state in ceph pg's output?
    MAXSCRUBS=2;                            # Max concurrent deep scrubs operations
    
    workratio=$1;
    [ "x$workratio" == x ] && workratio=7;  # Set work ratio from first arg; fall back to '7'.
    
    
    function isNewerThan() {
        # Args: [PG] [TIMESTAMP]
        # Output: None
        # Returns: 0 if changed; 1 otherwise
        # Desc: Check if a placement group "PG" deep scrub stamp has changed
        # (i.e != "TIMESTAMP")
        pg=$1;
        ots=$2;
        ndate=$($CEPH pg $pg query -f json-pretty | \
            $PYTHON -c 'import json;import sys; print json.loads(sys.stdin.read())["info"]["stats"]["last_deep_scrub_stamp"]');
        nts=$($DATE -d "$ndate" +%s);
        [ $ots -ne $nts ] && return 0;
        return 1;
    }
    
    function scrubbingCount() {
        # Args: None
        # Output: int
        # Returns: 0
        # Desc: Outputs concurent deep scrubbing tasks.
        cnt=$($CEPH -s | $GREP $DEEPMARK | $AWK '{ print $1; }');
        [ "x$cnt" == x ] && cnt=0;
        echo $cnt;
        return 0;
    }
    
    function waitForScrubSlot() {
        # Args: None
        # Output: Informative text
        # Returns: true
        # Desc: Idle loop waiting for a free deepscrub slot.
        while [ $(scrubbingCount) -ge $MAXSCRUBS ]; do
            sleep 1;
        done
        return 0;
    }
    
    function deepScrubPg() {
        # Args: [PG]
        # Output: Informative text
        # Return: 0 when PG is effectively deep scrubing
        # Desc: Start a PG "PG" deep-scrub
        $CEPH pg deep-scrub $1 >& /dev/null;
        # Must sleep as ceph does not immediately start scrubbing
        # So we wait until wanted PG effectively goes into deep scrubbing state...
        local emergencyCounter=0;
        while ! $CEPH pg $1 query | $GREP state | $GREP -q $DEEPMARK; do
            isNewerThan $1 $2 && break;
            test $emergencyCounter -gt 150 && break;
            sleep 1;
            emergencyCounter=$[ $emergencyCounter +1 ];
        done
        sleep 2;
        return 0;
    }
    
    
    function getOldestScrubs() {
        # Args: [num_res]
        # Output: [num_res] PG ids
        # Return: 0
        # Desc: Get the "num_res" oldest deep-scrubbed PGs
        numres=$1;
        [ x$numres == x ] && numres=20;
        $CEPH pg dump pgs 2>/dev/null | \
            $AWK '/^[0-9]+\.[0-9a-z]+/ { if($10 == "active+clean") {  print $1,$23,$24 ; }; }' | \
            while read line; do set $line; echo $1 $($DATE -d "$2 $3" +%s); done | \
            $SORT -n -k2 | \
            $HEAD -n $numres;
        return 0;
    }
    
    function getPgCount() {
        # Args:
        # Output: number of total PGs
        # Desc: Output the total number of "active+clean" PGs
        $CEPH pg stat | $SED 's/^.* \([0-9]\+\) active+clean[^+].*/\1/g';
    }
    
    
    # Get PG count
    pgcnt=$(getPgCount);
    # Get the number of PGs we'll be working on
    pgwork=$((pgcnt / workratio + 1));
    
    # Actual work starts here, quite self-explanatory.
    logger -t ceph_scrub "Ceph deep scrub - Start on $[100/workratio]% of $pgcnt PGs = $pgwork PGs";
    getOldestScrubs $pgwork | while read line; do
        set $line;
        waitForScrubSlot;
        deepScrubPg $1 $2;
    done
    logger -t ceph_scrub "Ceph deep scrub - End";
    Credits to Johannes Formanns, who's script I virtually didn't alter:
    https://www.formann.de/2015/05/cronjob-to-enable-timed-deep-scrubbing-in-a-ceph-cluster/


    You can view deep scrub distribution, once the script has been running for a while, with the following commands:
    Code:
    [root@kvm5a ~]# ceph pg dump | grep active | awk '{print $23}' | sort | uniq -c
    dumped all
         51 2017-12-02
        267 2017-12-03
        183 2017-12-04
         91 2017-12-05
    
    [root@kvm5a ~]# for date in `ceph pg dump | grep active | awk '{print $23}'`; do date +%A -d $date; done | sort | uniq -c;
    dumped all
        183 Monday
         51 Saturday
        267 Sunday
         91 Tuesday
     
  2. David Herselman

    David Herselman Active Member
    Proxmox Subscriber

    Joined:
    Jun 8, 2016
    Messages:
    185
    Likes Received:
    38
    You can makes changes to running OSDs without having to restart services:
    Code:
    ceph tell osd.* injectargs '--debug_ms 0/0';
    ceph tell osd.* injectargs '--osd_deep_scrub_interval 1209600';
    ceph tell osd.* injectargs '--osd_scrub_begin_hour 19';
    ceph tell osd.* injectargs '--osd_scrub_end_hour 6';
    ceph tell osd.* injectargs '--osd_scrub_sleep 0.1';

    Check effective settings:
    Code:
    for f in /var/run/ceph/ceph-osd.*.asok; do ceph --admin-daemon $f config show; done | grep 'debug_ms\|osd_deep_scrub_interval\|osd_scrub_begin_hour\|osd_scrub_end_hour\|osd_scrub_sleep'
     
    RobFantini and gosha like this.
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice