ceph 'slow requests are blocked'

RobFantini · Nov 18, 2018

Hello,

Approximately once per day we are seeing 'slow requests are blocked' in /var/log/ceph/ceph.log .

I stumbled across the fact that it was occurring daily when checking ceph.log for something else. So we have a couple of cron jobs that check for the event. I'll post that next .

the hardware we use is very good. not perfect. good enough that i do not think it is the cause of the slow requests. we use 65 ssd's - 2 different models and listed as recommended on pve ceph wiki page. 10G network. very little disk i/o . very little network bandwidth. I could be wrong on those assumptions but for now lets say I am correct - that it is not a hardware issue causing the slow requests.

Eliminating the slow requests has stayed twords the top of my TBD list for months . Because extreme slow requests can lead to data file corruption. So eliminating the few second per day slow requests is important.

RobFantini · Nov 18, 2018

cron jobs to detect slow request.

daily and the 1ST occurrence of slow requests - we want an email.

* crontab.

Code:

# we install this crontab at a admin pc or container.  we do not run on the node we want to check from as sometimes we replace nodes and forget to reinstall this crontab.
#
#   daily  ,  we do logrotate at  23:59 on all systems,  so you may want to change the  min/hour to suit your logrotate schedule
#
#  Change pve3 to the node you want to check
58 23 * * * root   ssh pve3 'grep "slow requests are blocked"   /var/log/ceph/ceph.log'

#
# 2018-11-16   this will send just the 1ST HEALTH_WARN of the day
#
0   * * * *  root  /usr/local/bin/ceph-check1
               # rm the file that makes sure /fbc/bin/ceph-check1 only reports 1x per day
1   0 * * * root  rm -f /root/.cron-tmp/ceph-check1   /tmp/ceph-check1

*ceph-check1

Code:

#!/bin/bash
# ceph-check1

#
#  let me know when there is the 1st of these per day
#  the done file gets deleted near midnight , so the check can occur again the next day
#

mkdir -p /root/.cron-tmp

if [ ! -e  /root/.cron-tmp/ceph-check1 ] ;  then

   #
   #  change pve3 to the node you want to check
   #
    #
     #  there are of course other and probably better ways to do this.
     #
        ssh pve3 'grep "slow requests are blocked"   /var/log/ceph/ceph.log'  > /tmp/ceph-check1

        if [ -s /tmp/ceph-check1 ] ; then
                rm -f /tmp/ceph-check1
                # echo 'file size > 0 '

                #
                # this makes it so the script will not rerun today.  this file is deleted at midnight in crontab
                #
                date >  /root/.cron-tmp/ceph-check1

                ssh pve3 'grep "slow requests are blocked"   /var/log/ceph/ceph.log'
                echo '
           
 This script checks only until the 1ST ceph issue.  there is another script run at midnight that will send a report for all issues.  i do this as I want to know if changes being made to systems have fixed issue or not.'


        fi
fi

RobFantini · Nov 18, 2018

the last thing i tried to eliminate the issue was :

we were using a network bond for ceph 10G network.

I tried using just a single nic.

a few days later the slow request email came again.

note in the past restarting all nodes has helped prevent the slow requests for a few days.

RobFantini · Nov 18, 2018

Next - after search saw this:
https://techtran.science/2018/06/06/ceph-solarflare-and-proxmox-slow-requests-are-blocked/

"
Are you seeing lots of `slow requests are blocked` errors during high throughput on your Ceph storage?

We were experiencing serious issues on two supermicro nodes with IOMMU enabled (Keywords: dmar dma pte vpfn) but even on our ASRack C2750 system things weren’t behaving as they should.

We were tearing our hair out trying to figure out what was going on. Especially as we had been using my Solarflare Dual SFP+ 10GB NICs for non-ceph purposes for years."

"The answer in this case was to manually install the sfc driver from Solarflare’s website (kudos to solarflare for providing active driver releases covering 5+ year old hardware btw)."

Check the videos on 'What is kernel Bypass' and others. this module looks promising. If it works we'll look at their network adapters .
And that the active drivers cover older hardware is is very good / normal feature.

So I'll try that next.

[ If it works the downside is the need to build a kernel module for every pve kernel upgrade. For a couple of years we used to do that for zfs . if the kernel module fixes the issue then so what.]

RobFantini · Nov 18, 2018

I do not think that kernel module will work for non Solarflare nics...

we use Intel 10G nics .

we may just purchase some used Solarflare nics - a quick search at ebay shows them avail for around $30 each..

Edit- those are SFP. so may go down that path... the $30 are old, considering newer SFP+ ...

spirit · Nov 20, 2018

Hi,
could be interesting to have graphs of ceph performance counters. (maybe with telegraf/influxdb or prometheus + grafana dashboards).

I had recently slow request on a full nvme cluster, because of ceph throttling (enable by default because optimised for hdd)

I can advise you to read
http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments

# disable ceph auth
auth client required = none
auth cluster required = none
auth service required = none

#disable debug (on ceph nodes, but also proxmox nodes if they are not the same box)

debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0

#disable throttling

objecter_inflight_ops = 10240
objecter_inflight_op_bytes = 1048576000
ms_dispatch_throttle_bytes = 1048576000
osd_client_message_size_cap = 0
osd_client_message_cap = 0
osd_enable_op_tracker = false

and use cache=none for vm. (cache=writeback slowdown read because of a global mutex, and only improve sequential write of small block)

RobFantini · Nov 21, 2018

Thank you , I'll give that a try.

RobFantini · Nov 21, 2018

Hello Spirit,

interesting article - I'll need to check it a few times.

Did you make any changes to /etc/sysctl.conf ?

I'll start with your suggested ceph.conf changes.

RobFantini · Nov 22, 2018

So this looks like the issue we run in to:

"
Memory Tuning
Ceph default packages use tcmalloc. For flash optimized configurations, we found jemalloc providing best possible performance without performance degradation over time. Ceph supports jemalloc for the hammer release and later releases but you need to build with jemalloc option enabled.

Below graph in figure 4 shows how thread cache size impacts throughput. By tuning thread cache size, performance is comparable between TCMalloc and JEMalloc. However as shown in Figure 5 and Figure 6, TCMalloc performance degrades over time unlike JEMalloc. "

What needs to be done to use JEMalloc instead of TCMalloc ?

PS: this thread contradicts the above conclusion https://forum.proxmox.com/threads/ceph-using-jemalloc.43302/

Still It is worth trying JEMalloc . We have seen a relationship between length of time systems have been running and increased latency / slow requests.

S

spirit · Nov 22, 2018

RobFantini said:
So this looks like the issue we run in to:

"
Memory Tuning
Ceph default packages use tcmalloc. For flash optimized configurations, we found jemalloc providing best possible performance without performance degradation over time. Ceph supports jemalloc for the hammer release and later releases but you need to build with jemalloc option enabled.

Below graph in figure 4 shows how thread cache size impacts throughput. By tuning thread cache size, performance is comparable between TCMalloc and JEMalloc. However as shown in Figure 5 and Figure 6, TCMalloc performance degrades over time unlike JEMalloc. "

What needs to be done to use JEMalloc instead of TCMalloc ?

PS: this thread contradicts the above conclusion https://forum.proxmox.com/threads/ceph-using-jemalloc.43302/

Still It is worth trying JEMalloc . We have seen a relationship between length of time systems have been running and increased latency / slow requests.

S

no, it's was a long time ago (jewel). now tcmalloc is working fine (because async messenger is the default now) , and jemaloc don't work with bluestore.

RobFantini · Nov 22, 2018

spirit said:
no, it's was a long time ago (jewel). now tcmalloc is working fine (because async messenger is the default now) , and jemaloc don't work with bluestore.

OK Thanks.

Also: Did you make any changes to /etc/sysctl.conf , as mentioned in that link?

Search

Search

ceph 'slow requests are blocked'

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

spirit

Distinguished Member

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

spirit

Distinguished Member

RobFantini

Famous Member