ceph isn't meeting performance expectations

kayson

New Member
Feb 13, 2024
11
1
3
I just set up my first ceph pool. I have 4 identical nodes each with an identical 16TB HDD. The HDDs in isolation get about 270MB/s read and write. Nodes are connected with a ceph-dedicated 10Gb network. I set up the pool with 2 replicas and with pg autoscaling enabled (though changing number of pgs doesn't seem to significantly impact performance), and I disabled write-caching on the drives themselves.

Ideally, I would expect that for writes, I get double the performance of a single drive, since the same data could be written to two drives, and each pair would get half the data. For reads, you could get quadruple the performance of a single drive since a fourth of the data could be read from each drive in parallel.

In practice, I'm seeing the following
Pool writes of ~370MB/s
Pool reads of ~430MB/s

Interestingly, I can see the network traffic correlating to these speeds. Meaning on the node where I'm running the command, I can see 370MB/s of traffic going out, and roughly 1/3 of that going into each of the other 3 nodes. Similar, but reversed, for reads. Overall, though, the speeds are quite a bit lower than I expected in both cases. Meanwhile, CPU and memory usage is low.

If I create a block device following the ceph wiki, I see:
Block writes of ~170MB/s
Block reads of ~15MiB/s

I'm guessing there might be something wrong with the commands I'm using here. I'm not sure why the speeds are so poor.

I've copied the full commands and outputs I've used below. If anyone has any feedback on how to further improve performance, I'd greatly appreciate it!

Code:
root@cortana01:~# rados bench -p HDD 10 write -b 16M --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 16777216 bytes to objects of size 16777216 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_cortana01_39329
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      15        26        11   175.921       176     0.61204    0.666382
    2      15        49        34   271.923       368    0.595901    0.686431
    3      15        72        57   303.932       368    0.702953    0.686895
    4      15        94        79   315.934       352    0.792737    0.692411
    5      15       120       105   335.936       416    0.638328    0.687194
    6      15       140       125   333.274       320     0.65639    0.696982
    7      15       165       150   342.799       400    0.619978    0.693383
    8      15       186       171   341.947       336    0.790205    0.686658
    9      15       212       197    350.17       416     0.61829    0.689477
   10      15       234       219    350.35       352    0.748165    0.687685
Total time run:         10.1302
Total writes made:      235
Write size:             16777216
Object size:            16777216
Bandwidth (MB/sec):     371.168
Stddev Bandwidth:       69.3128
Max bandwidth (MB/sec): 416
Min bandwidth (MB/sec): 176
Average IOPS:           23
Stddev IOPS:            4.33205
Max IOPS:               26
Min IOPS:               11
Average Latency(s):     0.668439
Stddev Latency(s):      0.105804
Max latency(s):         0.849491
Min latency(s):         0.128145

Code:
root@cortana01:~# rados bench -p HDD 10 seq
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      15        31        16    255.91       256    0.598025    0.610392
    2      15        58        43   343.921       432    0.521511     0.59645
    3      15        87        72   383.924       464    0.559089     0.56917
    4      15       119       104   415.926       512    0.507587    0.542226
    5      15       144       129   412.732       400    0.714784    0.553088
    6      15       169       154   410.605       400    0.719124    0.569878
    7      15       195       180   411.371       416    0.669092    0.578182
    8      15       222       207    413.94       432    0.685254    0.580606
Total time run:       8.74796
Total reads made:     238
Read size:            16777216
Object size:          16777216
Bandwidth (MB/sec):   435.301
Average IOPS:         27
Stddev IOPS:          4.61171
Max IOPS:             32
Min IOPS:             16
Average Latency(s):   0.564066
Max latency(s):       1.23279
Min latency(s):       0.143175
Code:
root@cortana01:/# rbd bench --io-type read -p HDD image01
bench  type read io_size 4096 io_threads 16 bytes 1073741824 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
    1      4656   4681.34    18 MiB/s
    2     11984   5990.98    23 MiB/s
    3     12304   4102.54    16 MiB/s
    4     12624   3157.61    12 MiB/s
    5     17024   3409.34    13 MiB/s
    6     24048   3861.39    15 MiB/s
    7     29312   3455.21    13 MiB/s
    8     29952   3516.92    14 MiB/s
    9     36032   4683.45    18 MiB/s
   10     36592   3893.33    15 MiB/s
   11     42272   3652.82    14 MiB/s
   12     42896   2726.05    11 MiB/s
   13     44528   2925.13    11 MiB/s
   14     45008   1794.47   7.0 MiB/s
   15     52560   3210.28    13 MiB/s
   16     55952   2735.98    11 MiB/s
   17     56432   2702.86    11 MiB/s
   18     62784   3642.44    14 MiB/s
   19     63168   3622.56    14 MiB/s
   20     69936   3475.18    14 MiB/s
   21     85296   5873.47    23 MiB/s
   22     87152   6157.51    24 MiB/s
   23     93376   6136.78    24 MiB/s
   24     94016   6178.21    24 MiB/s
   25    100416   6093.53    24 MiB/s
   26    103072   3560.16    14 MiB/s
   27    109616   4488.29    18 MiB/s
   28    114640   4254.48    17 MiB/s
   29    122608   5716.08    22 MiB/s
   30    123248   4556.35    18 MiB/s
   31    130432   5459.96    21 MiB/s
   32    135952   5271.39    21 MiB/s
   33    138064   4681.03    18 MiB/s
   34    142096   3907.74    15 MiB/s
   35    154496   6265.86    24 MiB/s
   36    155696   5029.64    20 MiB/s
   37    158864   4576.88    18 MiB/s
   38    159344   4253.42    17 MiB/s
   39    159824   3525.84    14 MiB/s
   40    167184   2528.99   9.9 MiB/s
   41    167648   2406.27   9.4 MiB/s
   42    168480   1925.11   7.5 MiB/s
   43    170688   2265.16   8.8 MiB/s
   44    171168   2274.25   8.9 MiB/s
   45    172800   1122.97   4.4 MiB/s
   46    174064   1278.84   5.0 MiB/s
   47    174464    1190.6   4.7 MiB/s
   48    174848   826.869   3.2 MiB/s
   49    180800   1931.41   7.5 MiB/s
   50    182240   1891.77   7.4 MiB/s
   51    184000   1993.57   7.8 MiB/s
   52    186624   2443.72   9.5 MiB/s
   53    187104   2474.45   9.7 MiB/s
   54    194672   2771.06    11 MiB/s
   55    201872   3927.95    15 MiB/s
   56    202512   3700.16    14 MiB/s
   57    209824   4642.76    18 MiB/s
   58    220592   6697.56    26 MiB/s
   59    224816   6028.77    24 MiB/s
   60    225296   4681.96    18 MiB/s
   61    226624   4817.56    19 MiB/s
   62    229744   3956.28    15 MiB/s
   63    232192   2316.74   9.0 MiB/s
   64    246880   4418.96    17 MiB/s
   65    250240    4980.8    19 MiB/s
   66    260832   6851.15    27 MiB/s
elapsed: 66   ops: 262144   ops/sec: 3949.6   bytes/sec: 15 MiB/s

Code:
root@cortana01:~# rbd bench --io-type write -p HDD --io-size 16K  image01
bench  type write io_size 16384 io_threads 16 bytes 1073741824 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
    1     12128   12143.9   190 MiB/s
    2     23360   11664.6   182 MiB/s
    3     34720     11521   180 MiB/s
    4     45008   11188.8   175 MiB/s
    5     55648   11106.1   174 MiB/s
elapsed: 6   ops: 65536   ops/sec: 10810.9   bytes/sec: 169 MiB/s
 
Read the recommend specs for unsing Ceph.

In short: Ceph is *very* sensitive for latency, especially for the <ok> feedback after a write command.
The key is to use Datacenter SSDs with *real* power loss protection PCIe connector like u.2, because those drives return the <ok> instant after write, because no problem on power loss due to capcitators.

Consumer/pseudo PLP drives have a much longer delay.

For this reson, HDDs are a bad choice if you want a fast ceph pool.
 
Ideally, I would expect that for writes, I get double the performance of a single drive, since the same data could be written to two drives, and each pair would get half the data. For reads, you could get quadruple the performance of a single drive since a fourth of the data could be read from each drive in parallel.

Ceph isn't working as raid, so you can't calculate this way.

Your network latency is very high, so you have (don't counting "slow" HDD) slow CPU or power savings enabled etc.
 
Ceph isn't working as raid, so you can't calculate this way.
I know it's not actually raid, but I'm curious why it wouldn't optimize throughput this way. Especially for reads.


Your network latency is very high, so you have (don't counting "slow" HDD) slow CPU or power savings enabled etc.
Yeah I do have CPU set to drop to lowest idle state, if it can... Surprised that it matters so much for sequential operations.


Read the recommend specs for unsing Ceph.

In short: Ceph is *very* sensitive for latency, especially for the <ok> feedback after a write command.
The key is to use Datacenter SSDs with *real* power loss protection PCIe connector like u.2, because those drives return the <ok> instant after write, because no problem on power loss due to capcitators.

Consumer/pseudo PLP drives have a much longer delay.

For this reson, HDDs are a bad choice if you want a fast ceph pool.
I'm actually fine with this performance for this pool, since its much faster than a single drive anyways. I don't understand, though, why the pool speeds are so good, but the block device is so bad.
 
You could also use a DB/WAL disk to speed things up. Something like 1 SSD per 2 HDD. If the SSD fails, it’s the same as both HDD’s failing. We use this with 2.5” 10k spinners to boost our backup pool.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!