unexplained regular drops in ceph performance

mouk · Sep 1, 2016

Hi all,

We see the following output of ceph bench:

Code:

> root@ceph1:~# rados bench -p scbench 600 write --no-cleanup
>  Maintaining 16 concurrent writes of 4194304 bytes for up to 600 seconds or 0 objects
>  Object prefix: benchmark_data_pm1_36584
>  sec Cur ops  started  finished  avg MB/s  cur MB/s  last lat  avg lat
>  0  0  0  0  0  0  -  0
>  1  16  124  108  431.899  432  0.138315  0.139077
>  2  16  237  221  441.928  452  0.169759  0.140138
>  3  16  351  335  446.598  456  0.105837  0.139844
>  4  16  466  450  449.938  460  0.140141  0.139716
>  5  16  569  553  442.337  412  0.025245  0.139328
>  6  16  634  618  411.943  260 0.0302609  0.147129
>  7  16  692  676  386.233  232  1.01843  0.15158
>  8  16  721  705  352.455  116 0.0224958  0.159924
>  9  16  721  705  313.293  0  -  0.159924
>  10  16  764  748  299.163  86 0.0629263  0.20961
>  11  16  869  853  310.144  420 0.0805086  0.204707
>  12  16  986  970  323.295  468  0.175718  0.196822
>  13  16  1100  1084  333.5  456  0.171172  0.19105
>  14  16  1153  1137  324.819  212 0.0468416  0.188643
>  15  16  1225  1209  322.363  288 0.0421159  0.195791
>  16  16  1236  1220  304.964  44  1.28629  0.195499
>  17  16  1236  1220  287.025  0  -  0.195499
>  18  16  1236  1220  271.079  0  -  0.195499
>  19  16  1324  1308  275.336  117.333  0.148679  0.231708
>  20  16  1436  1420  283.967  448  0.120878  0.224367
>  21  16  1552  1536  292.538  464  0.173587  0.218141
>  22  16  1662  1646  299.238  440  0.141544  0.212946
>  23  16  1720  1704  296.314  232 0.0273257  0.211416
>  24  16  1729  1713  285.467  36 0.0215821  0.211308
>  25  16  1729  1713  274.048  0  -  0.211308
>  26  16  1729  1713  263.508  0  -  0.211308
>  27  16  1787  1771  262.34  77.3333 0.0338129  0.241103
>  28  16  1836  1820  259.97  196  0.183042  0.245665
>  29  16  1949  1933  266.59  452  0.129397  0.239445
>  30  16  2058  2042  272.235  436  0.165108  0.234447
>  31  16  2159  2143  276.484  404 0.0466259  0.229704
>  32  16  2189  2173  271.594  120 0.0206958  0.231772

With regular intervals, the "cur MB/s" drops to zero. If meanwhile we ALSO run iperf, we can tell that the network is fuctioning perfectly: while ceph bench goes to zero, iperf continues at max speed. (10G ethernet)

So there is something slowing down ceph at regular intervals.

Anyone some clues what to look at?

This is on a three-node proxmox network, 65 Gig ram per server, journals on ssd (default proxmox config: 5GB per journal) connected through 10G ethernet. Each node has four 4TB disks installed, total of 12 osd's.

During the 0 MB/sec, there is NO increased cpu usage: it stays around 15 - 20% for the four ceph-osd processes.

Anyone with a suggestions where to look at?

MJ

mouk · Sep 1, 2016

Using atop I can also see that on avarage 3 disks (osd/journal disks) generate 90 - 100% usage during ceph bench. This drops to only one disk during the 0 MB/sec moments.

Search

Search

unexplained regular drops in ceph performance

mouk

Renowned Member

mouk

Renowned Member