Whilst our cluster can sustain backfills and deep scrubs during normal operation, it does affect storage I/O when deep scrubs run during production hours (7am-7pm Mondays through to Fridays). This particularly affects legacy Linux VMs running kernels prior to 2.6.32, which don't send 'flush' instructions and RBD subsequently never transitions to writeback caching.
We made the following adjustments to Ceph:
Essentially:
We finally schedule deep scrubs manually:
The '/root/ceph-deep-scrub-pg-ratio' script:
Credits to Johannes Formanns, who's script I virtually didn't alter:
https://www.formann.de/2015/05/cronjob-to-enable-timed-deep-scrubbing-in-a-ceph-cluster/
You can view deep scrub distribution, once the script has been running for a while, with the following commands:
NB: This script has been updated to work with Ceph Pacific (16), previous edits of this forum post provide support for Octopus (15), Nautilus (14) and Luminous (12).
We made the following adjustments to Ceph:
Code:
/etc/ceph/ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.254.1.0/24
debug ms = 0/0
filestore xattr use omap = true
fsid = a3f1c21f-f883-48e0-9bd2-4f869c72b17d
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd deep scrub interval = 1209600
osd scrub begin hour = 19
osd scrub end hour = 6
osd scrub sleep = 0.1
public network = 10.254.1.0/24
Essentially:
- Disabled debug messages
- Set deep scrub interval to 2 weeks
- Scrub hours to begin at 7pm and end at 6am
- OSD scrub to sleep '0.1' between chunks
We finally schedule deep scrubs manually:
Code:
/etc/cron.d/ceph-scrub-deep
0 1 * * 0,1 root /root/ceph-deep-scrub-pg-ratio 4
0 1 * * 2-6 root /root/ceph-deep-scrub-pg-ratio 10
The '/root/ceph-deep-scrub-pg-ratio' script:
Code:
#!/bin/bash
# /etc/cron.d/ceph-scrub-deep
# 0 1 * * 0,1 root /root/ceph-deep-scrub-pg-ratio 4
# 0 1 * * 2-6 root /root/ceph-deep-scrub-pg-ratio 10
# Scrub 25% of placement groups that were last scrubbed the longest time ago, starting at 1am Sunday and Monday
# Scrub 10% of placement groups that were last scrubbed the longest time ago, starting at 1am Tuesday through Saturday
set -o nounset
set -o errexit
CEPH=/usr/bin/ceph
AWK=/usr/bin/awk
SORT=/usr/bin/sort
HEAD=/usr/bin/head
DATE=/bin/date
SED=/bin/sed
GREP=/bin/grep
PYTHON=/usr/bin/python3
DEEPMARK="scrubbing+deep"; # What string matches a deep scrubing state in ceph pg's output?
MAXSCRUBS=2; # Max concurrent deep scrubs operations
workratio=$1;
[ "x$workratio" == x ] && workratio=7; # Set work ratio from first arg; fall back to '7'.
function isNewerThan() {
# Args: [PG] [TIMESTAMP]
# Output: None
# Returns: 0 if changed; 1 otherwise
# Desc: Check if a placement group "PG" deep scrub stamp has changed
# (i.e != "TIMESTAMP")
pg=$1;
ots=$2;
ndate=$($CEPH pg $pg query -f json-pretty | \
$PYTHON -c 'import json;import sys; print(json.load(sys.stdin)["info"]["stats"]["last_deep_scrub_stamp"])');
nts=$($DATE -d "$ndate" +%s);
[ $ots -ne $nts ] && return 0;
return 1;
}
function scrubbingCount() {
# Args: None
# Output: int
# Returns: 0
# Desc: Outputs concurent deep scrubbing tasks.
cnt=$($CEPH -s | $GREP $DEEPMARK | $AWK '{ print $1; }');
[ "x$cnt" == x ] && cnt=0;
echo $cnt;
return 0;
}
function waitForScrubSlot() {
# Args: None
# Output: Informative text
# Returns: true
# Desc: Idle loop waiting for a free deepscrub slot.
while [ $(scrubbingCount) -ge $MAXSCRUBS ]; do
sleep 1;
done
return 0;
}
function deepScrubPg() {
# Args: [PG]
# Output: Informative text
# Return: 0 when PG is effectively deep scrubing
# Desc: Start a PG "PG" deep-scrub
$CEPH pg deep-scrub $1 >& /dev/null;
# Must sleep as ceph does not immediately start scrubbing
# So we wait until wanted PG effectively goes into deep scrubbing state...
local emergencyCounter=0;
while ! $CEPH pg $1 query | $GREP state | $GREP -q $DEEPMARK; do
isNewerThan $1 $2 && break;
test $emergencyCounter -gt 150 && break;
sleep 1;
emergencyCounter=$[ $emergencyCounter +1 ];
done
sleep 2;
return 0;
}
function getOldestScrubs() {
# Args: [num_res]
# Output: [num_res] PG ids
# Return: 0
# Desc: Get the "num_res" oldest deep-scrubbed PGs
numres=$1;
[ x$numres == x ] && numres=20;
$CEPH pg dump pgs 2>/dev/null | \
$AWK '/^[0-9]+\.[0-9a-z]+/ { if($12 == "active+clean") { print $1,$23 ; }; }' | \
while read line; do set $line; echo $1 $($DATE -d "$2" +%s); done | \
$SORT -n -k2 | \
$HEAD -n $numres;
return 0;
}
function getPgCount() {
# Args:
# Output: number of total PGs
# Desc: Output the total number of "active+clean" PGs
$CEPH pg stat | $SED 's/^.* \([0-9]\+\) active+clean[^+].*/\1/g';
}
# Get PG count
pgcnt=$(getPgCount);
# Get the number of PGs we'll be working on
pgwork=$((pgcnt / workratio + 1));
# Actual work starts here, quite self-explanatory.
logger -t ceph_scrub "Ceph deep scrub - Start on $[100/workratio]% of $pgcnt PGs = $pgwork PGs";
getOldestScrubs $pgwork | while read line; do
set $line;
waitForScrubSlot;
deepScrubPg $1 $2;
done
logger -t ceph_scrub "Ceph deep scrub - End";
Credits to Johannes Formanns, who's script I virtually didn't alter:
https://www.formann.de/2015/05/cronjob-to-enable-timed-deep-scrubbing-in-a-ceph-cluster/
You can view deep scrub distribution, once the script has been running for a while, with the following commands:
Code:
[root@kvm5a ~]# ceph pg dump | grep active | awk '{print $23}' | cut -dT -f1 | sort | uniq -c
dumped all
51 2017-12-02
267 2017-12-03
183 2017-12-04
91 2017-12-05
[root@kvm5a ~]# for date in `ceph pg dump | grep active | awk '{print $23}' | cut -dT -f1`; do date +%A -d $date; done | sort | uniq -c;
dumped all
183 Monday
51 Saturday
267 Sunday
91 Tuesday
NB: This script has been updated to work with Ceph Pacific (16), previous edits of this forum post provide support for Octopus (15), Nautilus (14) and Luminous (12).
Last edited: