Ceph - Slow requests are blocked

Discussion in 'Proxmox VE: Networking and Firewall' started by holr, Jun 14, 2019 at 23:12.

  1. holr

    holr New Member

    Joined:
    Jan 4, 2019
    Messages:
    6
    Likes Received:
    0
    Hello,
    I'm receiving the message "slow requests are blocked" several times a day, and I am having trouble :
    identifying the root cause. I have tried various troubleshooting guides, such as:
    https://access.redhat.com/documenta...ml/troubleshooting_guide/troubleshooting-osds
    and
    https://forum.proxmox.com/threads/ceph-slow-requests-are-blocked.48955/

    but, sadly, running through the tweaks (e.g. tuning for SSD performance) and network tests haven't helped me identify or resolve the problem.

    I was on Proxmox 5.3 which did not exhibit the issue (unless heavy workloads were underway). I wiped the servers and installed 5.4, used the new Wizard Ceph tools to get up and running.

    Hardware-wise, I'm using four servers, each server has 2 x Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, 1TB ram, 5 x Crucial MX300 2TB (two servers have 4 Crucial drives, two have 5). 18 OSDs altogether. Networking-wise, there is a 10gb management network, and a 40gb ceph network for the osds to communicate. I'll be adding a fifth note later (though not to this spec, just for quorum!). Running the network throughput tests as described in the redhat guide shows expected throughput for the 10gb and 40gb networks.

    I'm not running many VMs right now, only 6 or so to get things going (windows AD server, pxe server etc) and a few test workstation VMs. When the cluster has health ok, it works well, but often the slow requests are blocked message appears and causes the few VMs to hang.

    Leaving the cluster alone for a long while will clear the log, but I cannot figure out what is the root cause, considering the Proxmox usage tends to be light. I'd rather clear it with half a dozen VMs, before hundreds of VMs cause major bottlenecks!

    Any advice anyone can provide would be greatly appreciated.

    Code:
    [global]
             auth client required = none
             auth cluster required = none
             auth service required = none
             cluster network = 10.10.10.0/24
             debug_asok = 0/0
             debug_auth = 0/0
             debug_buffer = 0/0
             debug_client = 0/0
             debug_context = 0/0
             debug_crush = 0/0
             debug_filer = 0/0
             debug_filestore = 0/0
             debug_finisher = 0/0
             debug_heartbeatmap = 0/0
             debug_journal = 0/0
             debug_journaler = 0/0
             debug_lockdep = 0/0
             debug_mon = 0/0
             debug_monc = 0/0
             debug_ms = 0/0
             debug_objclass = 0/0
             debug_objectcatcher = 0/0
             debug_objecter = 0/0
             debug_optracker = 0/0
             debug_osd = 0/0
             debug_paxos = 0/0
             debug_perfcounter = 0/0
             debug_rados = 0/0
             debug_rbd = 0/0
             debug_rgw = 0/0
             debug_throttle = 0/0
             debug_timer = 0/0
             debug_tp = 0/0
             fsid = c86a15fc-8c29-42f4-a332-3d7b76822502
             keyring = /etc/pve/priv/$cluster.$name.keyring
             mon allow pool delete = true
             osd journal size = 5120
             osd pool default min size = 2
             osd pool default size = 2
             public network = 192.168.1.0/24
    
    [mds]
             keyring = /var/lib/ceph/mds/ceph-$id/keyring
    
    [osd]
             keyring = /var/lib/ceph/osd/ceph-$id/keyring
    
    [mds.pve2]
             host = pve2
             mds standby for name = pve
    
    [mds.pve3]
             host = pve3
             mds standby for name = pve
    
    [mds.pve1]
             host = pve1
             mds standby for name = pve
    
    [mds.pve4]
             host = pve4
             mds standby for name = pve
    
    [mon.pve2]
             host = pve2
             mon addr = 192.168.1.102:6789
    
    [mon.pve4]
             host = pve4
             mon addr = 192.168.1.104:6789
    
    [mon.pve3]
             host = pve3
             mon addr = 192.168.1.103:6789
    
    [mon.pve1]
             host = pve1
             mon addr = 192.168.1.101:6789
    
    Code:
    # begin crush map
    tunable choose_local_tries 0
    tunable choose_local_fallback_tries 0
    tunable choose_total_tries 50
    tunable chooseleaf_descend_once 1
    tunable chooseleaf_vary_r 1
    tunable chooseleaf_stable 1
    tunable straw_calc_version 1
    tunable allowed_bucket_algs 54
    
    # devices
    device 0 osd.0 class ssd
    device 1 osd.1 class ssd
    device 2 osd.2 class ssd
    device 3 osd.3 class ssd
    device 4 osd.4 class ssd
    device 5 osd.5 class ssd
    device 6 osd.6 class ssd
    device 7 osd.7 class ssd
    device 8 osd.8 class ssd
    device 9 osd.9 class ssd
    device 10 osd.10 class ssd
    device 11 osd.11 class ssd
    device 12 osd.12 class ssd
    device 13 osd.13 class ssd
    device 14 osd.14 class ssd
    device 15 osd.15 class ssd
    device 16 osd.16 class ssd
    device 17 osd.17 class ssd
    
    # types
    type 0 osd
    type 1 host
    type 2 chassis
    type 3 rack
    type 4 row
    type 5 pdu
    type 6 pod
    type 7 room
    type 8 datacenter
    type 9 region
    type 10
    
    root
    #buckets
    host pve1 {
    id -3 # do not change unnecessarily
    id -4 class ssd # do not change unnecessarily
    # weight 9.096
    alg straw2
    hash 0 # rjenkins1
    item osd.0 weight 1.819
    item osd.1 weight 1.819
    item osd.2 weight 1.819
    item osd.4 weight 1.819
    item osd.3 weight 1.819
    }
    host pve2 {
    id -5 # do not change unnecessarily
    id -6 class ssd # do not change unnecessarily
    # weight 9.096
    alg straw2
    hash 0 # rjenkins1
    item osd.7 weight 1.819
    item osd.8 weight 1.819
    item osd.9 weight 1.819
    item osd.10 weight 1.819
    item osd.11 weight 1.819
    }
    host pve3 {
    id -7 # do not change unnecessarily
    id -8 class ssd # do not change unnecessarily
    # weight 7.277
    alg straw2
    hash 0 # rjenkins1
    item osd.14 weight 1.819
    item osd.15 weight 1.819
    item osd.16 weight 1.819
    item osd.17 weight 1.819
    }
    host pve4 {
    id -13 # do not change unnecessarily
    id -14 class ssd # do not change unnecessarily
    # weight 7.277
    alg straw2
    hash 0 # rjenkins1
    item osd.5 weight 1.819
    item osd.6 weight 1.819
    item osd.12 weight 1.819
    item osd.13 weight 1.819
    }
    root default {
    id -1 # do not change unnecessarily
    id -2 class ssd # do not change unnecessarily
    # weight 32.747
    alg straw2
    hash 0 # rjenkins1
    item pve1 weight 9.096
    item pve2 weight 9.096
    item pve3 weight 7.277
    item pve4 weight 7.277
    }
    # rules
    rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
    }
    # end crush map
     
  2. sb-jw

    sb-jw Active Member

    Joined:
    Jan 23, 2018
    Messages:
    445
    Likes Received:
    37
    How many Pools, how many PGs and what Replica do you have? What about your usage? What about other metrics (eg the disk latency, I/O wait)? What controller do you use?

    On thing, what I completely not understand here, you put in your servers 1TB RAM and then you bought cheap desktop SSDs? This is maybe your root cause.
     
  3. holr

    holr New Member

    Joined:
    Jan 4, 2019
    Messages:
    6
    Likes Received:
    0
    Hi sb-jw,
    I have two pools, one of which is cephfs. For the VM disk pool, its size/min is 2/2 and PG is 1024. For the cephfs pool, its size/min is 3/2, and PG is 256. In terms of usage, ceph df provides:
    Code:
    GLOBAL:
        SIZE        AVAIL       RAW USED     %RAW USED
        27.3TiB     22.0TiB      5.28TiB         19.34
    POOLS:
        NAME                ID     USED        %USED     MAX AVAIL     OBJECTS
        ceph_vm             7      1.77TiB     13.19       11.6TiB      551560
        cephfs_data         10      555GiB      6.54       7.75TiB      142162
        cephfs_metadata     11     15.8MiB         0       7.75TiB          25 
    In terms of disk latency, ceph osd perf provides:
    Code:
    osd commit_latency(ms) apply_latency(ms)
      3                  8                 8
      1                 34                34
      0                  1                 1
      2                  6                 6
      4                  7                 7
     10                 44                44
     14                 27                27
     11                 19                19
     12                 10                10
      5                  8                 8
     16                 12                12
     13                  9                 9
      6                  0                 0
     17                  1                 1
      7                  9                 9
      8                 47                47
      9                  0                 0
     15                 10                10 

    I tried running through the benchmark tests to provide further information at:
    http://tracker.ceph.com/projects/ceph/wiki/Benchmark_Ceph_Cluster_Performance

    Benchmark disks
    Ran within a folder on cephfs (/mnt/pve/cephfs/template/iso)

    Code:
    # dd if=/dev/zero of=here bs=1G count=1 oflag=direct
    1+0 records in
    1+0 records out
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.5495 s, 93.0 MB/s
    Benchmark networks
    Code:
    10gbit network
    # iperf -c 192.168.1.101
    ------------------------------------------------------------
    Client connecting to 192.168.1.101, TCP port 5001
    TCP window size: 2.50 MByte (default)
    ------------------------------------------------------------
    [  3] local 192.168.1.102 port 60300 connected with 192.168.1.101 port 5001
    [ ID] Interval       Transfer     Bandwidth
    [  3]  0.0-10.0 sec  9.89 GBytes  8.49 Gbits/sec
    
    40gbit network
    # iperf -c 10.10.10.101
    ------------------------------------------------------------
    Client connecting to 10.10.10.101, TCP port 5001
    TCP window size: 85.0 KByte (default)
    ------------------------------------------------------------
    [  3] local 10.10.10.102 port 53264 connected with 10.10.10.101 port 5001
    [ ID] Interval       Transfer     Bandwidth
    [  3]  0.0-10.0 sec  14.4 GBytes  12.4 Gbits/sec
    Benchmark a Ceph Storage Cluster
    Code:
    # rados bench -p scbench 10 write --no-cleanup
    hints = 1
    Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
    Object prefix: benchmark_data_pve1_196664
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
        0       0         0         0         0         0           -           0
        1      16       152       136   543.961       544   0.0326539   0.0635914
        2      16       207       191   381.946       220   0.0502837    0.154164
        3      16       220       204   271.962        52   0.0353217    0.147315
        4      16       220       204   203.971         0           -    0.147315
        5      16       220       204   163.177         0           -    0.147315
        6      16       289       273   181.974        92   0.0330153    0.318871
        7      16       289       273   155.977         0           -    0.318871
        8      16       289       273    136.48         0           -    0.318871
        9      16       328       312   138.646        52   0.0715037    0.447853
       10      16       370       354   141.579       168   0.0320291    0.399981
    Total time run:         10.595065
    Total writes made:      371
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     140.065
    Stddev Bandwidth:       169.704
    Max bandwidth (MB/sec): 544
    Min bandwidth (MB/sec): 0
    Average IOPS:           35
    Stddev IOPS:            42
    Max IOPS:               136
    Min IOPS:               0
    Average Latency(s):     0.456343
    Stddev Latency(s):      0.992014
    Max latency(s):         4.89872
    Min latency(s):         0.0268852
    
    Read Sequential
    
    # rados bench -p scbench 10 seq
    
    hints = 1
    
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    
        0       0         0         0         0         0           -           0
    
        1      16       361       345   1378.22      1380  0.00436163   0.0421903
    
    Total time run:       1.318472
    
    Total reads made:     447
    
    Read size:            4194304
    
    Object size:          4194304
    
    Bandwidth (MB/sec):   1356.11
    
    Average IOPS:         339
    
    Stddev IOPS:          0
    
    Max IOPS:             345
    
    Min IOPS:             345
    
    Average Latency(s):   0.0455131
    
    Max latency(s):       0.295762
    
    Min latency(s):       0.0036308
    
    Read Random
    
    # rados bench -p scbench 10 rand
    
    hints = 1
    
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    
        0       0         0         0         0         0           -           0
    
        1      16       336       320   1279.54      1280   0.0210435   0.0460407
    
        2      16       666       650   1299.57      1320   0.0639649   0.0465566
    
        3      16      1047      1031   1374.27      1524   0.0158093   0.0447449
    
        4      16      1387      1371   1370.64      1360    0.168472   0.0449818
    
        5      16      1733      1717   1373.27      1384  0.00608685   0.0449307
    
        6      16      2072      2056   1370.35      1356   0.0464014   0.0451591
    
        7      15      2400      2385   1362.54      1316   0.0233917   0.0454411
    
        8      15      2743      2728   1363.69      1372   0.0243667   0.0455168
    
        9      16      3074      3058   1358.82      1320   0.0363461   0.0455592
    
       10      16      3413      3397   1358.51      1356   0.0159213   0.0456878
    
    Total time run:       10.087308
    
    Total reads made:     3413
    
    Read size:            4194304
    
    Object size:          4194304
    
    Bandwidth (MB/sec):   1353.38
    
    Average IOPS:         338
    
    Stddev IOPS:          16
    
    Max IOPS:             381
    
    Min IOPS:             320
    
    Average Latency(s):   0.045995
    
    Max latency(s):       0.464308
    
    Min latency(s):       0.00316918
    
    Benchmark a ceph block device using fio
    Code:
    # fio rbd.fio
    
    rbd_iodepth32: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
    
    fio-2.16
    
    Starting 1 process
    
    rbd engine: RBD version: 1.12.0
    
    Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/524KB/0KB /s] [0/131/0 iops] [eta 00m:00s]
    
    rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=208140: Fri Jun 14 21:16:39 2019
    
      write: io=1024.0MB, bw=5182.9KB/s, iops=1295, runt=202317msec
    
        slat (usec): min=1, max=1319, avg= 6.52, stdev= 5.25
    
        clat (msec): min=2, max=1742, avg=24.69, stdev=95.42
    
        lat (msec): min=2, max=1742, avg=24.70, stdev=95.42
    
        clat percentiles (msec):
    
        |  1.00th=[    6],  5.00th=[    8], 10.00th=[    9], 20.00th=[   10],
    
        | 30.00th=[   11], 40.00th=[   12], 50.00th=[   12], 60.00th=[   13],
    
        | 70.00th=[   13], 80.00th=[   14], 90.00th=[   15], 95.00th=[   20],
    
        | 99.00th=[  578], 99.50th=[  840], 99.90th=[ 1172], 99.95th=[ 1287],
    
        | 99.99th=[ 1516]
    
        lat (msec) : 4=0.37%, 10=21.40%, 20=73.28%, 50=1.42%, 100=1.30%
    
        lat (msec) : 250=0.55%, 500=0.50%, 750=0.49%, 1000=0.45%, 2000=0.23%
    
      cpu          : usr=36.96%, sys=62.45%, ctx=231721, majf=0, minf=9852
    
      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
    
        submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    
        complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
    
        issued    : total=r=0/w=262144/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
    
        latency   : target=0, window=0, percentile=100.00%, depth=32
    
    
    Run status group 0 (all jobs):
    
      WRITE: io=1024.0MB, aggrb=5182KB/s, minb=5182KB/s, maxb=5182KB/s, mint=202317msec, maxt=202317msec
    
    
    Disk stats (read/write):
    
        dm-1: ios=0/3621, merge=0/0, ticks=0/20620, in_queue=20620, util=5.58%, aggrios=840/1754, aggrmerge=0/1938, aggrticks=212/19108, aggrin_queue=19324, aggrutil=5.69%
    
      sda: ios=840/1754, merge=0/1938, ticks=212/19108, in_queue=19324, util=5.69%
    
    
    Controller used, the most relevant entries listed below.
    Code:
    # lspci | grep controller
    00:1f.2 SATA controller: Intel Corporation C610/X99 series chipset 6-Port SATA Controller [AHCI mode] (rev 05)
    03:00.0 Serial Attached SCSI controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)
    
    
    Finally, in terms of the SSD choice, it was really a matter of cost. I have the original platter-based SAS drives available, but figured the all-SSD approach would help, even if not the best SSD drives.
     
  4. sb-jw

    sb-jw Active Member

    Joined:
    Jan 23, 2018
    Messages:
    445
    Likes Received:
    37
    Can you show 'ceph -s' here?
    In my opinion the Apply and Commit latency a bit high. In run a CEPH Cluster with 16x WD Blue SSDs and only 4x 1GbE NIC, my latency at around 2 - 10.
    Did you use Jumbo Frames?

    Have you checked the health of your Drives? What about the TBW?

    Are you using Filestore OSDs? If yes, I recommend to change it to BlueStore, I run my OSDs all with BlueStore and don't have such problems, maybe this will help.
     
  5. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,172
    Likes Received:
    191
    Odd that iperf took such different TCP window sizes for the links. And for the 40 GbE, use more threads for the iperf test. If you did already and it still stays with those ~12 Gbit, then something is not proper with that 40 GbE network.

    Only the OSD data traffic is run on the 40 GbE network, every other communication with Ceph (eg. client, MON, OSD, ...) is run on the 10 GbE network. And if corosync is on that network too, then your cluster will be even more unstable (may suffer resets if HA enabled).

    The 0 writes are corresponding with the slow requests, at some point the data can't be written to the OSD.

    What hardware are you using exactly?
    Did you see our Ceph benchmark paper for comparison?
    https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  6. holr

    holr New Member

    Joined:
    Jan 4, 2019
    Messages:
    6
    Likes Received:
    0
    Thank you everyone for your thoughts. I'm thinking that, after sj-jw's comments that there may be some issues with my cluster's ssds. Because of an upcoming project I've reset the cluster in raid format and using the default LVM method without ceph. After running pveperf on a server with 10000rpm platter drives, and then again on the crucial 500 drives, I see some large differences in performance:

    SSD Crucial mx500 drives:
    Code:
    root@pve2:~# pveperf
    CPU BOGOMIPS:      294121.28
    REGEX/SECOND:      2847126
    HD SIZE:           93.99 GB (/dev/mapper/pve-root)
    BUFFERED READS:    276.17 MB/sec
    AVERAGE SEEK TIME: 0.21 ms
    FSYNCS/SECOND:     20.26
    DNS EXT:           36.22 ms
    DNS INT:           1.54 ms
    10000rpm SAS platter drives:
    Code:
    root@pve:~# pveperf
    CPU BOGOMIPS:      294122.24
    REGEX/SECOND:      2830959
    HD SIZE:           93.99 GB (/dev/mapper/pve-root)
    BUFFERED READS:    953.11 MB/sec
    AVERAGE SEEK TIME: 4.24 ms
    FSYNCS/SECOND:     2573.03
    DNS EXT:           34.81 ms
    DNS INT:           27.51 ms
    The FSYNCs/Sec on the SSDs are incredibly low compared to the SAS. In a few weeks when I have no demands on the servers, I'll go back and retry ceph clustering again - learning from this experience and your help. Thank you!
     
  7. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,312
    Likes Received:
    131
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  8. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,172
    Likes Received:
    191
    And don't use the RAID controller, not even for RAID0 or JBOD. Use a HBA, if you don't have one, then try to flash the RAID controller into IT-mode (better buy a new one).

    Even with low performing disks, Ceph would throttle and the rados bench would have a consistent write output.
    https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice