Ceph - Slow requests are blocked

holr

Well-Known Member
Jan 4, 2019
33
1
48
54
Hello,
I'm receiving the message "slow requests are blocked" several times a day, and I am having trouble :
identifying the root cause. I have tried various troubleshooting guides, such as:
https://access.redhat.com/documenta...ml/troubleshooting_guide/troubleshooting-osds
and
https://forum.proxmox.com/threads/ceph-slow-requests-are-blocked.48955/

but, sadly, running through the tweaks (e.g. tuning for SSD performance) and network tests haven't helped me identify or resolve the problem.

I was on Proxmox 5.3 which did not exhibit the issue (unless heavy workloads were underway). I wiped the servers and installed 5.4, used the new Wizard Ceph tools to get up and running.

Hardware-wise, I'm using four servers, each server has 2 x Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, 1TB ram, 5 x Crucial MX300 2TB (two servers have 4 Crucial drives, two have 5). 18 OSDs altogether. Networking-wise, there is a 10gb management network, and a 40gb ceph network for the osds to communicate. I'll be adding a fifth note later (though not to this spec, just for quorum!). Running the network throughput tests as described in the redhat guide shows expected throughput for the 10gb and 40gb networks.

I'm not running many VMs right now, only 6 or so to get things going (windows AD server, pxe server etc) and a few test workstation VMs. When the cluster has health ok, it works well, but often the slow requests are blocked message appears and causes the few VMs to hang.

Leaving the cluster alone for a long while will clear the log, but I cannot figure out what is the root cause, considering the Proxmox usage tends to be light. I'd rather clear it with half a dozen VMs, before hundreds of VMs cause major bottlenecks!

Any advice anyone can provide would be greatly appreciated.

Code:
[global]
         auth client required = none
         auth cluster required = none
         auth service required = none
         cluster network = 10.10.10.0/24
         debug_asok = 0/0
         debug_auth = 0/0
         debug_buffer = 0/0
         debug_client = 0/0
         debug_context = 0/0
         debug_crush = 0/0
         debug_filer = 0/0
         debug_filestore = 0/0
         debug_finisher = 0/0
         debug_heartbeatmap = 0/0
         debug_journal = 0/0
         debug_journaler = 0/0
         debug_lockdep = 0/0
         debug_mon = 0/0
         debug_monc = 0/0
         debug_ms = 0/0
         debug_objclass = 0/0
         debug_objectcatcher = 0/0
         debug_objecter = 0/0
         debug_optracker = 0/0
         debug_osd = 0/0
         debug_paxos = 0/0
         debug_perfcounter = 0/0
         debug_rados = 0/0
         debug_rbd = 0/0
         debug_rgw = 0/0
         debug_throttle = 0/0
         debug_timer = 0/0
         debug_tp = 0/0
         fsid = c86a15fc-8c29-42f4-a332-3d7b76822502
         keyring = /etc/pve/priv/$cluster.$name.keyring
         mon allow pool delete = true
         osd journal size = 5120
         osd pool default min size = 2
         osd pool default size = 2
         public network = 192.168.1.0/24

[mds]
         keyring = /var/lib/ceph/mds/ceph-$id/keyring

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mds.pve2]
         host = pve2
         mds standby for name = pve

[mds.pve3]
         host = pve3
         mds standby for name = pve

[mds.pve1]
         host = pve1
         mds standby for name = pve

[mds.pve4]
         host = pve4
         mds standby for name = pve

[mon.pve2]
         host = pve2
         mon addr = 192.168.1.102:6789

[mon.pve4]
         host = pve4
         mon addr = 192.168.1.104:6789

[mon.pve3]
         host = pve3
         mon addr = 192.168.1.103:6789

[mon.pve1]
         host = pve1
         mon addr = 192.168.1.101:6789

Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd
device 15 osd.15 class ssd
device 16 osd.16 class ssd
device 17 osd.17 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10

root
#buckets
host pve1 {
id -3 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 9.096
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.819
item osd.1 weight 1.819
item osd.2 weight 1.819
item osd.4 weight 1.819
item osd.3 weight 1.819
}
host pve2 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 9.096
alg straw2
hash 0 # rjenkins1
item osd.7 weight 1.819
item osd.8 weight 1.819
item osd.9 weight 1.819
item osd.10 weight 1.819
item osd.11 weight 1.819
}
host pve3 {
id -7 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 7.277
alg straw2
hash 0 # rjenkins1
item osd.14 weight 1.819
item osd.15 weight 1.819
item osd.16 weight 1.819
item osd.17 weight 1.819
}
host pve4 {
id -13 # do not change unnecessarily
id -14 class ssd # do not change unnecessarily
# weight 7.277
alg straw2
hash 0 # rjenkins1
item osd.5 weight 1.819
item osd.6 weight 1.819
item osd.12 weight 1.819
item osd.13 weight 1.819
}
root default {
id -1 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 32.747
alg straw2
hash 0 # rjenkins1
item pve1 weight 9.096
item pve2 weight 9.096
item pve3 weight 7.277
item pve4 weight 7.277
}
# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
 
How many Pools, how many PGs and what Replica do you have? What about your usage? What about other metrics (eg the disk latency, I/O wait)? What controller do you use?

On thing, what I completely not understand here, you put in your servers 1TB RAM and then you bought cheap desktop SSDs? This is maybe your root cause.
 
Hi sb-jw,
I have two pools, one of which is cephfs. For the VM disk pool, its size/min is 2/2 and PG is 1024. For the cephfs pool, its size/min is 3/2, and PG is 256. In terms of usage, ceph df provides:
Code:
GLOBAL:
    SIZE        AVAIL       RAW USED     %RAW USED
    27.3TiB     22.0TiB      5.28TiB         19.34
POOLS:
    NAME                ID     USED        %USED     MAX AVAIL     OBJECTS
    ceph_vm             7      1.77TiB     13.19       11.6TiB      551560
    cephfs_data         10      555GiB      6.54       7.75TiB      142162
    cephfs_metadata     11     15.8MiB         0       7.75TiB          25

In terms of disk latency, ceph osd perf provides:
Code:
osd commit_latency(ms) apply_latency(ms)
  3                  8                 8
  1                 34                34
  0                  1                 1
  2                  6                 6
  4                  7                 7
 10                 44                44
 14                 27                27
 11                 19                19
 12                 10                10
  5                  8                 8
 16                 12                12
 13                  9                 9
  6                  0                 0
 17                  1                 1
  7                  9                 9
  8                 47                47
  9                  0                 0
 15                 10                10


I tried running through the benchmark tests to provide further information at:
http://tracker.ceph.com/projects/ceph/wiki/Benchmark_Ceph_Cluster_Performance

Benchmark disks
Ran within a folder on cephfs (/mnt/pve/cephfs/template/iso)

Code:
# dd if=/dev/zero of=here bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.5495 s, 93.0 MB/s

Benchmark networks
Code:
10gbit network
# iperf -c 192.168.1.101
------------------------------------------------------------
Client connecting to 192.168.1.101, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 192.168.1.102 port 60300 connected with 192.168.1.101 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  9.89 GBytes  8.49 Gbits/sec

40gbit network
# iperf -c 10.10.10.101
------------------------------------------------------------
Client connecting to 10.10.10.101, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 10.10.10.102 port 53264 connected with 10.10.10.101 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  14.4 GBytes  12.4 Gbits/sec

Benchmark a Ceph Storage Cluster
Code:
# rados bench -p scbench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_pve1_196664
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       152       136   543.961       544   0.0326539   0.0635914
    2      16       207       191   381.946       220   0.0502837    0.154164
    3      16       220       204   271.962        52   0.0353217    0.147315
    4      16       220       204   203.971         0           -    0.147315
    5      16       220       204   163.177         0           -    0.147315
    6      16       289       273   181.974        92   0.0330153    0.318871
    7      16       289       273   155.977         0           -    0.318871
    8      16       289       273    136.48         0           -    0.318871
    9      16       328       312   138.646        52   0.0715037    0.447853
   10      16       370       354   141.579       168   0.0320291    0.399981
Total time run:         10.595065
Total writes made:      371
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     140.065
Stddev Bandwidth:       169.704
Max bandwidth (MB/sec): 544
Min bandwidth (MB/sec): 0
Average IOPS:           35
Stddev IOPS:            42
Max IOPS:               136
Min IOPS:               0
Average Latency(s):     0.456343
Stddev Latency(s):      0.992014
Max latency(s):         4.89872
Min latency(s):         0.0268852

Read Sequential

# rados bench -p scbench 10 seq

hints = 1

  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)

    0       0         0         0         0         0           -           0

    1      16       361       345   1378.22      1380  0.00436163   0.0421903

Total time run:       1.318472

Total reads made:     447

Read size:            4194304

Object size:          4194304

Bandwidth (MB/sec):   1356.11

Average IOPS:         339

Stddev IOPS:          0

Max IOPS:             345

Min IOPS:             345

Average Latency(s):   0.0455131

Max latency(s):       0.295762

Min latency(s):       0.0036308

Read Random

# rados bench -p scbench 10 rand

hints = 1

  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)

    0       0         0         0         0         0           -           0

    1      16       336       320   1279.54      1280   0.0210435   0.0460407

    2      16       666       650   1299.57      1320   0.0639649   0.0465566

    3      16      1047      1031   1374.27      1524   0.0158093   0.0447449

    4      16      1387      1371   1370.64      1360    0.168472   0.0449818

    5      16      1733      1717   1373.27      1384  0.00608685   0.0449307

    6      16      2072      2056   1370.35      1356   0.0464014   0.0451591

    7      15      2400      2385   1362.54      1316   0.0233917   0.0454411

    8      15      2743      2728   1363.69      1372   0.0243667   0.0455168

    9      16      3074      3058   1358.82      1320   0.0363461   0.0455592

   10      16      3413      3397   1358.51      1356   0.0159213   0.0456878

Total time run:       10.087308

Total reads made:     3413

Read size:            4194304

Object size:          4194304

Bandwidth (MB/sec):   1353.38

Average IOPS:         338

Stddev IOPS:          16

Max IOPS:             381

Min IOPS:             320

Average Latency(s):   0.045995

Max latency(s):       0.464308

Min latency(s):       0.00316918

Benchmark a ceph block device using fio
Code:
# fio rbd.fio

rbd_iodepth32: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32

fio-2.16

Starting 1 process

rbd engine: RBD version: 1.12.0

Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/524KB/0KB /s] [0/131/0 iops] [eta 00m:00s]

rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=208140: Fri Jun 14 21:16:39 2019

  write: io=1024.0MB, bw=5182.9KB/s, iops=1295, runt=202317msec

    slat (usec): min=1, max=1319, avg= 6.52, stdev= 5.25

    clat (msec): min=2, max=1742, avg=24.69, stdev=95.42

    lat (msec): min=2, max=1742, avg=24.70, stdev=95.42

    clat percentiles (msec):

    |  1.00th=[    6],  5.00th=[    8], 10.00th=[    9], 20.00th=[   10],

    | 30.00th=[   11], 40.00th=[   12], 50.00th=[   12], 60.00th=[   13],

    | 70.00th=[   13], 80.00th=[   14], 90.00th=[   15], 95.00th=[   20],

    | 99.00th=[  578], 99.50th=[  840], 99.90th=[ 1172], 99.95th=[ 1287],

    | 99.99th=[ 1516]

    lat (msec) : 4=0.37%, 10=21.40%, 20=73.28%, 50=1.42%, 100=1.30%

    lat (msec) : 250=0.55%, 500=0.50%, 750=0.49%, 1000=0.45%, 2000=0.23%

  cpu          : usr=36.96%, sys=62.45%, ctx=231721, majf=0, minf=9852

  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%

    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%

    issued    : total=r=0/w=262144/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0

    latency   : target=0, window=0, percentile=100.00%, depth=32


Run status group 0 (all jobs):

  WRITE: io=1024.0MB, aggrb=5182KB/s, minb=5182KB/s, maxb=5182KB/s, mint=202317msec, maxt=202317msec


Disk stats (read/write):

    dm-1: ios=0/3621, merge=0/0, ticks=0/20620, in_queue=20620, util=5.58%, aggrios=840/1754, aggrmerge=0/1938, aggrticks=212/19108, aggrin_queue=19324, aggrutil=5.69%

  sda: ios=840/1754, merge=0/1938, ticks=212/19108, in_queue=19324, util=5.69%

Controller used, the most relevant entries listed below.
Code:
# lspci | grep controller
00:1f.2 SATA controller: Intel Corporation C610/X99 series chipset 6-Port SATA Controller [AHCI mode] (rev 05)
03:00.0 Serial Attached SCSI controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)

Finally, in terms of the SSD choice, it was really a matter of cost. I have the original platter-based SAS drives available, but figured the all-SSD approach would help, even if not the best SSD drives.
 
Can you show 'ceph -s' here?
In my opinion the Apply and Commit latency a bit high. In run a CEPH Cluster with 16x WD Blue SSDs and only 4x 1GbE NIC, my latency at around 2 - 10.
Did you use Jumbo Frames?

Have you checked the health of your Drives? What about the TBW?

Are you using Filestore OSDs? If yes, I recommend to change it to BlueStore, I run my OSDs all with BlueStore and don't have such problems, maybe this will help.
 
Benchmark networks
Odd that iperf took such different TCP window sizes for the links. And for the 40 GbE, use more threads for the iperf test. If you did already and it still stays with those ~12 Gbit, then something is not proper with that 40 GbE network.

Networking-wise, there is a 10gb management network, and a 40gb ceph network for the osds to communicate.
Only the OSD data traffic is run on the 40 GbE network, every other communication with Ceph (eg. client, MON, OSD, ...) is run on the 10 GbE network. And if corosync is on that network too, then your cluster will be even more unstable (may suffer resets if HA enabled).

# rados bench -p scbench 10 write --no-cleanup
4 16 220 204 203.971 0 - 0.147315
5 16 220 204 163.177 0 - 0.147315
The 0 writes are corresponding with the slow requests, at some point the data can't be written to the OSD.

What hardware are you using exactly?
Did you see our Ceph benchmark paper for comparison?
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/
 
Thank you everyone for your thoughts. I'm thinking that, after sj-jw's comments that there may be some issues with my cluster's ssds. Because of an upcoming project I've reset the cluster in raid format and using the default LVM method without ceph. After running pveperf on a server with 10000rpm platter drives, and then again on the crucial 500 drives, I see some large differences in performance:

SSD Crucial mx500 drives:
Code:
root@pve2:~# pveperf
CPU BOGOMIPS:      294121.28
REGEX/SECOND:      2847126
HD SIZE:           93.99 GB (/dev/mapper/pve-root)
BUFFERED READS:    276.17 MB/sec
AVERAGE SEEK TIME: 0.21 ms
FSYNCS/SECOND:     20.26
DNS EXT:           36.22 ms
DNS INT:           1.54 ms

10000rpm SAS platter drives:
Code:
root@pve:~# pveperf
CPU BOGOMIPS:      294122.24
REGEX/SECOND:      2830959
HD SIZE:           93.99 GB (/dev/mapper/pve-root)
BUFFERED READS:    953.11 MB/sec
AVERAGE SEEK TIME: 4.24 ms
FSYNCS/SECOND:     2573.03
DNS EXT:           34.81 ms
DNS INT:           27.51 ms

The FSYNCs/Sec on the SSDs are incredibly low compared to the SAS. In a few weeks when I have no demands on the servers, I'll go back and retry ceph clustering again - learning from this experience and your help. Thank you!
 
Because of an upcoming project I've reset the cluster in raid format and using the default LVM method without ceph.
And don't use the RAID controller, not even for RAID0 or JBOD. Use a HBA, if you don't have one, then try to flash the RAID controller into IT-mode (better buy a new one).

Even with low performing disks, Ceph would throttle and the rados bench would have a consistent write output.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!