[SOLVED] Ceph - slow Recovery/ Rebalance on fast sas ssd

ilia987

Active Member
Sep 9, 2019
275
13
38
37
We have suffered some server Failure

but when the server came back ceph had to restore\rebalance around 10-30TB of data,

  • Ssds are based on relativly high end sas ssds (4TB segate nitro and hp\dell 8TB pm1643 based)
  • Network is 2x40Gb Ethernet (one dedicated for sync\replication and one for ussage)
  • 3 servers -8 ssds each server, in super-micro 3008 lsi (it mode) sas12
  • pool is set on replication 3 (min 2 )

the problem is that the recovery\rebalance as in average at 500MBs with peeks of 1100MBs.

is it ok?
 
Do you have any benchmarks taken on an empty cluster to have a reference?
 
Do you have any benchmarks taken on an empty cluster to have a reference?
on under load we have around 8-10GBS read throughput (but most of our files are large and our iops are under 2k all the time)
( we dont have anoth cpus to consume the entire read bandwidth ( not yet :) )


created a pool 128 3\2 replica and here is the results:

Code:
root@pve-srv2:~# echo 3 | tee /proc/sys/vm/drop_caches && sync && rados -p bench bench 60 write --no-cleanup &&rados -p bench bench 60 seq && rados -p bench bench 60 rand && rados -p bench cleanup
3
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60 seconds or 0 objects
Object prefix: benchmark_data_pve-srv2_2943958
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)

    1      16       489       473   1891.87      1892   0.0376208    0.033391
...
   19      16      8826      8810   1854.45      1808   0.0288513   0.0344738
2020-12-07 12:50:01.099817 min lat: 0.0145275 max lat: 0.29231 avg lat: 0.0345481
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   20      16      9272      9256   1850.91      1784   0.0301816   0.0345481
...
   39      16     18219     18203   1866.68      1896   0.0308282   0.0342669
2020-12-07 12:50:21.103043 min lat: 0.013738 max lat: 0.29231 avg lat: 0.0342716
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   40      16     18682     18666   1866.31      1852   0.0383676   0.0342716
..
   59      16     27694     27678   1876.19      1868   0.0382379   0.0340996
2020-12-07 12:50:41.105770 min lat: 0.013738 max lat: 0.312927 avg lat: 0.03408
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      16     28176     28160   1877.05      1928   0.0292324     0.03408
Total time run:         60.0226
Total writes made:      28176
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1877.69
Stddev Bandwidth:       40.9094
Max bandwidth (MB/sec): 1948
Min bandwidth (MB/sec): 1776
Average IOPS:           469
Stddev IOPS:            10.2274
Max IOPS:               487
Min IOPS:               444
Average Latency(s):     0.034083
Stddev Latency(s):      0.0139051
Max latency(s):         0.312927
Min latency(s):         0.013738
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)

    1      16       492       476   1903.61      1904   0.0144739    0.032361
   ...
   19      16      9528      9512   2002.05      1884   0.0114696   0.0311411
2020-12-07 12:51:02.163130 min lat: 0.00960033 max lat: 0.235241 avg lat: 0.0313088
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   20      15      9977      9962   1991.84      1800   0.0152404   0.0313088
...
   39      15     19328     19313   1980.15      1844   0.0181161   0.0315141
2020-12-07 12:51:22.171570 min lat: 0.00939878 max lat: 0.235241 avg lat: 0.0315748
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   40      15     19789     19774   1976.71      1844   0.0352595   0.0315748
...
   57      15     27790     27775   1948.48      1796   0.0101361   0.0320328
Total time run:       57.8999
Total reads made:     28176
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1946.53
Average IOPS:         486
Stddev IOPS:          21.9527
Max IOPS:             528
Min IOPS:             433
Average Latency(s):   0.0320763
Max latency(s):       0.235241
Min latency(s):       0.00939878
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)

    1      16       467       451    1803.6      1804   0.0118634    0.032883
...
   19      15      8139      8124   1710.07      1720   0.0143589   0.0366271
2020-12-07 12:52:00.233533 min lat: 0.00333213 max lat: 0.644385 avg lat: 0.036662
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   20      15      8557      8542   1708.16      1672  0.00995928    0.036662
  ...
   39      16     16944     16928   1735.96      1848   0.0155378   0.0360801
2020-12-07 12:52:20.236352 min lat: 0.00333213 max lat: 0.644385 avg lat: 0.0359872
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   40      16     17416     17400   1739.75      1888   0.0171109   0.0359872
   ...
   59      16     25686     25670   1740.09      1628     0.10603   0.0359901
2020-12-07 12:52:40.239109 min lat: 0.00333092 max lat: 0.644385 avg lat: 0.0360132
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      16     26099     26083   1738.62      1652     0.12876   0.0360132
Total time run:       60.0498
Total reads made:     26099
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1738.49
Average IOPS:         434
Stddev IOPS:          22.0409
Max IOPS:             486
Min IOPS:             388
Average Latency(s):   0.0360341
Max latency(s):       0.644385
Min latency(s):       0.00333092
Removed 28176 objects
 
Last edited:
We have suffered some server Failure

but when the server came back ceph had to restore\rebalance around 10-30TB of data,

  • Ssds are based on relativly high end sas ssds (4TB segate nitro and hp\dell 8TB pm1643 based)
  • Network is 2x40Gb Ethernet (one dedicated for sync\replication and one for ussage)
  • 3 servers -8 ssds each server, in super-micro 3008 lsi (it mode) sas12
  • pool is set on replication 3 (min 2 )

the problem is that the recovery\rebalance as in average at 500MBs with peeks of 1100MBs.

is it ok?
It is intentional, that CEPH does not fill up all available bandwith during recovery/rebalancing. If you want to speed it up:

You can also set these values if you want a quick recovery for your cluster, helping OSDs to perform recovery faster.
  • osd max backfills: This is the maximum number of backfill operations allowed to/from OSD. The higher the number, the quicker the recovery, which might impact overall cluster performance until recovery finishes.
  • osd recovery max active: This is the maximum number of active recover requests. Higher the number, quicker the recovery, which might impact the overall cluster performance until recovery finishes.
  • osd recovery op priority: This is the priority set for recovery operation. Lower the number, higher the recovery priority. Higher recovery priority might cause performance degradation until recovery completes.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!