ceph rbd slow down read/write

Alibek · Jun 11, 2019

Summary:
pve-manager/5.3-5/97ae681d (running kernel: 4.15.18-9-pve)
ceph version 12.2.8 (6f01265ca03a6b9d7f3b7f759d8894bb9dbb6840) luminous (stable)
4 nodes (per node: 4 nvme ssd & 2 sas ssd, bluestore) + 1 node with 4 sata ssd
interconnect - 2x 10Gbps
Created pool (512 PGs, replicated 3/2) on sas ssd
On pool created image for lxc-container

In container do copy big file (more then 300GiB) and is after 2-3 minutes write speed is down to 4-10 MiB/s

After end copy in a few hours, write speed is return to high value, but next copy have the same result
And until is present slow write speed, all operations with mapped rbd device have very low speed (I did run fio and dd tests) but utilzation of this rdb device is very low - 1%-3%!

I try to create new pools, create new containers, change privileges on containers - no effect.
I try tune some kernel params - no effect.

Need help!

UPD: attached View attachment conf.tar.gz
UPD: added 1 node with 4 sata ssd

sb-jw · Jun 11, 2019

Please post your CEPH Configs, CEPH Statud, VM Configs and Storage configs.

Alibek · Jun 11, 2019

sb-jw said:
Please post your CEPH Configs, CEPH Statud, VM Configs and Storage configs.

I add file conf.tar.gz

Alwin · Jun 12, 2019

Can you please also post a rados bench?

Also you can check our Ceph benchmark paper and the corresponding test commands.
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

Alibek · Jun 14, 2019

Alwin said:
Can you please also post a rados bench?

Also you can check our Ceph benchmark paper and the corresponding test commands.
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

Code:

# rados -p bench2 bench 60 write -b 4M -t 16 --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60 seconds or 0 objects
Object prefix: benchmark_data_lpr8_632662
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
...
2019-06-14 13:59:55.312188 min lat: 0.0256227 max lat: 0.280766 avg lat: 0.0743873
...
2019-06-14 14:00:15.314177 min lat: 0.0256227 max lat: 0.280766 avg lat: 0.0742069
...
2019-06-14 14:00:35.316154 min lat: 0.0256227 max lat: 0.313224 avg lat: 0.0733676
...
Total time run:         60.071086
Total writes made:      13092
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     871.767
Stddev Bandwidth:       31.3655
Max bandwidth (MB/sec): 980
Min bandwidth (MB/sec): 816
Average IOPS:           217
Stddev IOPS:            7
Max IOPS:               245
Min IOPS:               204
Average Latency(s):     0.0734098
Stddev Latency(s):      0.0266476
Max latency(s):         0.313224
Min latency(s):         0.0256227

Code:

# rados -p bench2 bench 60 seq -t 16
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
...
2019-06-14 16:20:44.291662 min lat: 0.0114718 max lat: 0.205243 avg lat: 0.031458
...
Total time run:       25.296936
Total reads made:     13092
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2070.13
Average IOPS:         517
Stddev IOPS:          97
Max IOPS:             629
Min IOPS:             344
Average Latency(s):   0.0302544
Max latency(s):       0.205243
Min latency(s):       0.0114718

Code:

# rados -p bench2 bench 60 rand -t 16
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
...
2019-06-14 16:21:17.304923 min lat: 0.00192472 max lat: 0.191512 avg lat: 0.0291881
...
2019-06-14 16:21:37.307196 min lat: 0.00182423 max lat: 0.191512 avg lat: 0.0293725
...
2019-06-14 16:21:57.309487 min lat: 0.00182423 max lat: 0.191512 avg lat: 0.0300554
...
Total time run:       60.050730
Total reads made:     31289
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2084.17
Average IOPS:         521
Stddev IOPS:          84
Max IOPS:             661
Min IOPS:             357
Average Latency(s):   0.0300742
Max latency(s):       0.191512
Min latency(s):       0.00182423

And 4K block write:

Code:

# rados -p bench2 bench 60 write -b 4K -t 128 --no-cleanup
hints = 1
Maintaining 128 concurrent writes of 4096 bytes to objects of size 4096 for up to 60 seconds or 0 objects
Object prefix: benchmark_data_lpr8_1999777
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
...
2019-06-14 16:34:13.577341 min lat: 0.00129878 max lat: 0.450519 avg lat: 0.00547605
...
2019-06-14 16:34:33.579377 min lat: 0.00129878 max lat: 0.450519 avg lat: 0.00618825
...
2019-06-14 16:34:53.581568 min lat: 0.0012379 max lat: 0.450519 avg lat: 0.00678395
...
Total time run:         60.107514
Total writes made:      1131512
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     73.5344
Stddev Bandwidth:       23.7944
Max bandwidth (MB/sec): 107.195
Min bandwidth (MB/sec): 21.9023
Average IOPS:           18824
Stddev IOPS:            6091
Max IOPS:               27442
Min IOPS:               5607
Average Latency(s):     0.00679856
Stddev Latency(s):      0.0188425
Max latency(s):         0.450519
Min latency(s):         0.0012379

And 4K read:

Code:

# rados -p bench2 bench 60 rand -t 128
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
...
2019-06-14 16:37:40.254569 min lat: 0.000193211 max lat: 0.0609103 avg lat: 0.00247485
...
2019-06-14 16:38:00.256711 min lat: 0.000177343 max lat: 0.197939 avg lat: 0.00240127
...
Total time run:       60.001678
Total reads made:     3209996
Read size:            4096
Object size:          4096
Bandwidth (MB/sec):   208.978
Average IOPS:         53498
Stddev IOPS:          4567
Max IOPS:             60641
Min IOPS:             40014
Average Latency(s):   0.00238715
Max latency(s):       0.197939
Min latency(s):       0.000177343

Alwin · Jun 14, 2019

What specific hardware do the servers have (eg. ssd model, controler, ...)?

Alibek · Jun 14, 2019

Trouble not on osd or pool level.
I think problem have on rbd and map level in lxc-container.

Alibek · Jun 14, 2019

Alwin said:
What specific hardware do the servers have (eg. ssd model, controler, ...)?

Code:

host-8:
sda  1:0:0:0    disk ATA      XA3840ME10063    00ZU sata   sda          0   4096      0    4096     512    0 deadline     128 128    0B sda    3.5T root  disk  brw-rw----
sdb  2:0:0:0    disk ATA      XA3840ME10063    00ZU sata   sdb          0   4096      0    4096     512    0 deadline     128 128    0B sdb    3.5T root  disk  brw-rw----
sdc  3:0:0:0    disk ATA      XA3840ME10063    00ZU sata   sdc          0   4096      0    4096     512    0 deadline     128 128    0B sdc    3.5T root  disk  brw-rw----
sdd  4:0:0:0    disk ATA      XA3840ME10063    00ZU sata   sdd          0   4096      0    4096     512    0 deadline     128 128    0B sdd    3.5T root  disk  brw-rw----
host-11a:
sda  8:0:0:0    disk SAMSUNG  MZILS7T6HMLS/007 GXH0 sas  sda          0   8192   8192    4096     512    0 deadline     128 2048   32M sda    7T root  disk  brw-rw----
sdb  8:0:1:0    disk SAMSUNG  MZILS7T6HMLS/007 GXH0 sas  sdb          0   8192   8192    4096     512    0 deadline     128 2048   32M sdb    7T root  disk  brw-rw----
host-11b:
sda  9:0:0:0    disk SAMSUNG  MZILS7T6HMLS/007 GXH0 sas  sda          0   8192   8192    4096     512    0 deadline     128 2048   32M sda    7T root  disk  brw-rw----
sdb  9:0:1:0    disk SAMSUNG  MZILS7T6HMLS/007 GXH0 sas  sdb          0   8192   8192    4096     512    0 deadline     128 2048   32M sdb    7T root  disk  brw-rw----
host-11c:
sda  8:0:0:0    disk SAMSUNG  MZILS7T6HMLS/007 GXH0 sas  sda          0   8192   8192    4096     512    0 deadline     128 2048   32M sda    7T root  disk  brw-rw----
sdb  8:0:1:0    disk SAMSUNG  MZILS7T6HMLS/007 GXH0 sas  sdb          0   8192   8192    4096     512    0 deadline     128 2048   32M sdb    7T root  disk  brw-rw----
host-11d:
sda  9:0:0:0    disk SAMSUNG  MZILS7T6HMLS/007 GXH0 sas  sda          0   8192   8192    4096     512    0 deadline     128 2048   32M sda    7T root  disk  brw-rw----
sdb  9:0:1:0    disk SAMSUNG  MZILS7T6HMLS/007 GXH0 sas  sdb          0   8192   8192    4096     512    0 deadline     128 2048   32M sdb    7T root  disk  brw-rw----

Alibek · Jun 14, 2019

and fio on rbd (outside lxc-container):

Code:

# fio ceph-rbd-read.fio
read-seq-4K: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
read-seq-4M: (g=1): rw=read, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=16
fio-2.16
Starting 2 processes
Jobs: 1 (f=1): [_(1),R(1)] [72.8% done] [760.8MB/0KB/0KB /s] [190/0/0 iops] [eta 01m:30s]  
read-seq-4K: (groupid=0, jobs=1): err= 0: pid=3794340: Fri Jun 14 20:15:04 2019
  read : io=1849.9MB, bw=21043KB/s, iops=5260, runt= 90016msec
    slat (usec): min=1, max=3946, avg= 8.64, stdev= 8.36
    clat (usec): min=344, max=46994, avg=12154.44, stdev=4895.68
     lat (usec): min=354, max=47005, avg=12163.79, stdev=4896.46
    clat percentiles (usec):
     |  1.00th=[ 1592],  5.00th=[ 2928], 10.00th=[ 4320], 20.00th=[ 7520],
     | 30.00th=[ 8256], 40.00th=[12864], 50.00th=[13888], 60.00th=[14656],
     | 70.00th=[15424], 80.00th=[16064], 90.00th=[17024], 95.00th=[18048],
     | 99.00th=[20608], 99.50th=[22144], 99.90th=[33536], 99.95th=[38656],
     | 99.99th=[42752]
    lat (usec) : 500=0.01%, 750=0.10%, 1000=0.23%
    lat (msec) : 2=1.24%, 4=7.48%, 10=24.92%, 20=64.71%, 50=1.31%
  cpu          : usr=2.06%, sys=8.01%, ctx=423438, majf=0, minf=108
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=129.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=473486/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64
read-seq-4M: (groupid=1, jobs=1): err= 0: pid=3821358: Fri Jun 14 20:15:04 2019
  read : io=109908MB, bw=1218.7MB/s, iops=304, runt= 90192msec
    slat (usec): min=64, max=5275, avg=194.90, stdev=134.72
    clat (usec): min=772, max=470820, avg=52357.86, stdev=52697.72
     lat (usec): min=847, max=470977, avg=52552.56, stdev=52698.94
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[   12], 10.00th=[   13], 20.00th=[   21],
     | 30.00th=[   25], 40.00th=[   29], 50.00th=[   34], 60.00th=[   41],
     | 70.00th=[   51], 80.00th=[   73], 90.00th=[  123], 95.00th=[  172],
     | 99.00th=[  255], 99.50th=[  289], 99.90th=[  363], 99.95th=[  396],
     | 99.99th=[  449]
    lat (usec) : 1000=0.02%
    lat (msec) : 2=0.75%, 4=0.94%, 10=0.48%, 20=17.53%, 50=49.84%
    lat (msec) : 100=17.10%, 250=12.24%, 500=1.15%
  cpu          : usr=0.24%, sys=6.03%, ctx=26158, majf=0, minf=10254
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=139.7%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=27462/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: io=1849.9MB, aggrb=21042KB/s, minb=21042KB/s, maxb=21042KB/s, mint=90016msec, maxt=90016msec

Run status group 1 (all jobs):
   READ: io=109908MB, aggrb=1218.7MB/s, minb=1218.7MB/s, maxb=1218.7MB/s, mint=90192msec, maxt=90192msec

Disk stats (read/write):
  rbd6: ios=653598/57, merge=38393/0, ticks=9154172/164, in_queue=9159816, util=97.30%

Alibek · Jun 15, 2019

Even just reading the file to /dev/null degrades after a few minutes (in lxc container):

Code:

# dd if=/mnt/data/maps/planet-190513.osm.pbf of=/dev/null status=progress
48314671616 bytes (48 GB, 45 GiB) copied, 655,003 s, 73,8 MB/s
94384404+1 records in
94384404+1 records out
48324815168 bytes (48 GB, 45 GiB) copied, 655,909 s, 73,7 MB/s

Alibek · Jun 15, 2019

I was run in parallel one read on host from rbd and other read file in container from same rbd:
In container reading speed was drop down after 42 GiB:

Code:

# dd if=/mnt/data/maps/planet-190513.osm.pbf of=/dev/null status=progress
48297345536 bytes (48 GB, 45 GiB) copied, 595,004 s, 81,2 MB/s
94384404+1 records in
94384404+1 records out
48324815168 bytes (48 GB, 45 GiB) copied, 595,234 s, 81,2 MB/s

On host reading speed was still on 149 MB/s:

Code:

dd if=/dev/rbd6 of=/dev/null status=progress
103478187008 bytes (103 GB, 96 GiB) copied, 696 s, 149 MB/s^C
202239786+0 records in
202239785+0 records out
103546769920 bytes (104 GB, 96 GiB) copied, 696,458 s, 149 MB/s

Alibek · Jun 15, 2019

Same result after
echo 3 > /proc/sys/vm/drop_caches
and read with bs=4M (dd if=/mnt/data/maps/planet-190513.osm.pbf of=/dev/null status=progress bs=4M)
speed in lxc container is drop down after 42GiB read.

Utilization is drop too:

Code:

...
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
rbd6              0,00     0,00  267,00    0,00 272644,00     0,00  2042,28     1,94    7,22    7,22    0,00   3,75 100,00
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12,97    0,00    7,78    0,37    0,00   78,88
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
rbd6              0,00     0,00  264,00    1,00 270336,00     4,00  2040,30     1,92    7,26    7,29    0,00   3,73  98,80
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11,90    0,00    7,24    0,33    0,00   80,53
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
rbd6              0,00     0,00  248,00    0,00 253952,00     0,00  2048,00     1,94    7,76    7,76    0,00   4,03 100,00
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11,88    0,00    7,42    0,18    0,00   80,52
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
rbd6              0,00     0,00  120,00    1,00 122880,00     4,00  2031,14     0,89    7,47    7,50    4,00   4,03  48,80
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11,30    0,00    7,95    0,05    0,00   80,70
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
rbd6              0,00     0,00   10,00    0,00 10752,00     0,00  2150,40     0,05    4,80    4,80    0,00   4,80   4,80
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11,55    0,00    6,38    0,03    0,00   82,03
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
rbd6              0,00     0,00   11,00    0,00 11008,00     0,00  2001,45     0,06    5,45    5,45    0,00   5,45   6,00
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11,80    0,00    7,21    0,39    0,00   80,60
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
rbd6              0,00     0,00    9,00    1,00  8704,00     4,00  1741,60     0,07    7,20    7,56    4,00   7,20   7,20
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11,36    0,00    7,17    0,27    0,00   81,20
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
rbd6              0,00     0,00    8,00    1,00  8192,00     4,00  1821,33     0,06    6,67    7,50    0,00   6,67   6,00
...

sb-jw · Jun 15, 2019

Can you more specific what Hardware do you use?
Maybe your Disks have an cache of about 42G and if this is full, they drop the performance because they now write to the slower MLC instead of SLC.

P.S.: I can't open your file on my Android Smartphone, so it would be nice if your post such things in Code / Spoiler Tags.

Alibek · Jun 16, 2019

sb-jw said:
Can you more specific what Hardware do you use?
Maybe your Disks have an cache of about 42G and if this is full, they drop the performance because they now write to the slower MLC instead of SLC.

SSD models in post https://forum.proxmox.com/threads/ceph-rbd-slow-down-write.55055/#post-253876
If performance is drop on SSD's then utilization must grow. In opposite - ssd and rbd utilization is reduced to minimum.

P.S.: I can't open your file on my Android Smartphone, so it would be nice if your post such things in Code / Spoiler Tags.

To many info for post in code tag:

Code:

$ tar tfz conf.tar.gz
conf/
conf/ceph-crushmap.txt
conf/container-rbd-status.txt
conf/ceph-crush-tree.txt
conf/ceph-hosts.txt
conf/storage-rbd.txt
conf/lxc-config.txt
conf/lxc-pve-config.txt
conf/sysctl-state.txt
conf/ceph-status.txt

sb-jw · Jun 16, 2019

Alibek said:
SSD models in post

And the rest of your Hardware? Not only the SSDs are important.

Alibek · Jun 17, 2019

This trouble not in hw layer, I'm sure of it. I found next:
I increase RAM size for container and performance drop is moved to new size of RAM
And after drop caches - performance is grow up to normal level

Code:

find /mnt/data/maps/cache/coords -exec dd if={} of=/dev/null bs=1M \;
...
2108839 bytes (2,1 MB, 2,0 MiB) copied, 0,0131362 s, 161 MB/s
2108528 bytes (2,1 MB, 2,0 MiB) copied, 0,0147751 s, 143 MB/s
2108160 bytes (2,1 MB, 2,0 MiB) copied, 0,0117432 s, 180 MB/s
2110582 bytes (2,1 MB, 2,0 MiB) copied, 0,0134652 s, 157 MB/s
2110466 bytes (2,1 MB, 2,0 MiB) copied, 0,0133086 s, 159 MB/s
2109703 bytes (2,1 MB, 2,0 MiB) copied, 0,0125539 s, 168 MB/s

# memory CACHED reached size of container RAM

2109925 bytes (2,1 MB, 2,0 MiB) copied, 0,0875066 s, 24,1 MB/s
2108361 bytes (2,1 MB, 2,0 MiB) copied, 0,170315 s, 12,4 MB/s
2109374 bytes (2,1 MB, 2,0 MiB) copied, 0,175722 s, 12,0 MB/s
2109803 bytes (2,1 MB, 2,0 MiB) copied, 0,35708 s, 5,9 MB/s
2110739 bytes (2,1 MB, 2,0 MiB) copied, 0,18207 s, 11,6 MB/s
2110182 bytes (2,1 MB, 2,0 MiB) copied, 0,20411 s, 10,3 MB/s
2110315 bytes (2,1 MB, 2,0 MiB) copied, 0,22433 s, 9,4 MB/s
2110772 bytes (2,1 MB, 2,0 MiB) copied, 0,333537 s, 6,3 MB/s
...
2108609 bytes (2,1 MB, 2,0 MiB) copied, 0,39618 s, 5,3 MB/s
2108104 bytes (2,1 MB, 2,0 MiB) copied, 0,424484 s, 5,0 MB/s
2110326 bytes (2,1 MB, 2,0 MiB) copied, 0,345107 s, 6,1 MB/s
2110194 bytes (2,1 MB, 2,0 MiB) copied, 0,169959 s, 12,4 MB/s

# in other tty:  echo 3 > /proc/sys/vm/drop_caches

2109320 bytes (2,1 MB, 2,0 MiB) copied, 0,0134732 s, 157 MB/s
2107555 bytes (2,1 MB, 2,0 MiB) copied, 0,0118676 s, 178 MB/s
2109893 bytes (2,1 MB, 2,0 MiB) copied, 0,00752797 s, 280 MB/s
2111092 bytes (2,1 MB, 2,0 MiB) copied, 0,00675199 s, 313 MB/s
2109341 bytes (2,1 MB, 2,0 MiB) copied, 0,0140296 s, 150 MB/s
2106612 bytes (2,1 MB, 2,0 MiB) copied, 0,0126313 s, 167 MB/s

It seems that the reached RAM limit for cached memory in the container is not accepted by the kernel and it tries to write to cache further, but the restriction for the entire occupied container RAM is activated and a small block is released, which is then filled with the reading operation, and so on until the read operation completes...

Perhaps this is due to the container settings:

Code:

lxc.apparmor.profile: unconfined
lxc.cgroup.devices.allow: a
lxc.cap.drop:
lxc.mount.auto: proc:rw sys:rw

sb-jw · Jun 17, 2019

Alibek said:
This trouble not in hw layer, I'm sure of it.

Okay, then I will wait for the solution.

Nobody can't help you, if you are not willing to gave them some information they ask for. Maybe your right, maybe there's another guy out there who already solved such a problem and it was a problem with the Hardware. But if you do not want to gave some Infos, you will not find out

Alibek · Jun 17, 2019

sb-jw said:
Nobody can't help you, if you are not willing to gave them some information they ask for.

Thank you. If I consider that the problem is in the hw, I will make a post with this information. At the moment I do not think so.

Alibek · Jun 27, 2019

This is not CEPH trouble.
The problem hidden in cached memory.
If drop caches - the performance is return up until cache is not filled:

Code:

# dstat -clrd --disk-util -D rbd3 -i 10
----total-cpu-usage---- ---load-avg--- --io/rbd3-- --dsk/rbd3- rbd3 ----interrupts---
usr sys idl wai hiq siq| 1m   5m  15m | read  writ| read  writ|util| 477   478   480
  6   3  90   0   0   1|37.2 36.4 35.0|10.2  1.29 | 797k  232k|0.89|  17     0     0
 18   7  72   0   0   2|38.3 36.6 35.1|13.7  2.30 |3507k 5552k|5.28| 345     0     0
 18   8  73   0   0   2|37.5 36.5 35.1|13.8  1.90 |3533k 3584k|4.12| 407     0     0
 18   7  72   0   0   2|38.2 36.7 35.2|14.5  1.90 |3712k 3844k|5.44| 247     0     0
 18   8  72   1   0   2|38.5 36.8 35.2|15.2  1.90 |3866k 3998k|5.44| 160     0     0
 18   7  72   0   0   2|39.1 37.0 35.3|13.7  1.90 |3482k 3600k|5.08| 251     0     0
 18   8  72   0   0   2|39.2 37.1 35.4|15.2  1.90 |3891k 3951k|5.64| 259     0     0
 18   7  73   0   0   2|40.3 37.4 35.5|13.9  1.80 |3558k 3693k|5.96| 192     0     0
 18   7  73   0   0   2|40.8 37.6 35.5|14.3  1.80 |3661k 3735k|4.72| 172     0     0
 18   7  73   0   0   2|41.1 37.8 35.6|14.4  1.40 |3635k 1774k|5.28| 275     0     0
 18   8  72   1   0   2|42.1 38.1 35.8|15.2  1.90 |3891k 3736k|5.04| 352     0     0
 18   7  72   0   0   2|41.3 38.1 35.8|14.2  1.90 |3635k 4102k|4.64| 353     0     0
 17   7  73   0   0   2|41.0 38.1 35.8|15.1  2.00 |3814k 3786k|5.40| 386     0     0
 16   6  75   0   0   2|41.7 38.3 35.9|14.0  1.90 |3584k 3933k|5.00| 287     0     0
 # execute in other tty: echo 3 > /proc/sys/vm/drop_caches
 16   6  75   1   0   2|40.8 38.3 35.9| 377  23.5 |  94M   82M|71.3| 230     0     0
 16   7  73   1   0   2|40.9 38.3 36.0| 533  37.1 | 133M  132M|99.2| 435     0     0
 19   9  68   2   0   3|42.5 38.8 36.1| 516  36.8 | 129M  129M|98.8| 515     0     0
 17   6  73   1   0   2|41.9 38.8 36.2| 542  38.3 | 135M  137M|99.1| 406     0     0
 18   7  73   1   0   2|41.6 38.8 36.2| 545  37.8 | 136M  133M|98.8| 420     0     0
 17   6  74   1   0   2|42.0 39.0 36.3| 521  36.7 | 130M  130M|98.8| 235     0     0
 16   6  75   1   0   2|41.3 38.9 36.3| 537  38.4 | 134M  137M|99.5| 260     0     0
 16   6  74   1   0   2|40.9 38.9 36.3| 549  37.3 | 137M  134M|99.4| 298     0     0
 17   6  74   1   0   2|40.7 39.0 36.4| 531  38.1 | 132M  137M|99.2| 440     0     0
 17   6  74   1   0   2|39.5 38.8 36.3| 526  38.1 | 131M  135M|99.3| 551     0     0
 17   6  74   1   0   2|39.1 38.7 36.3| 529  37.1 | 132M  133M|99.6| 265     0     0
 17   6  74   1   0   2|39.3 38.8 36.4| 540  35.8 | 135M  127M|98.9| 328     0     0
 16   6  75   1   0   2|37.3 38.4 36.3| 560  38.6 | 140M  138M|99.8| 498     0     0
 17   6  74   1   0   2|37.1 38.3 36.3| 554  39.0 | 138M  139M|99.4| 505     0     0
 17   6  73   1   0   2|39.5 38.8 36.4| 502  35.1 | 124M  125M|91.5| 243     0     0

I try to tune vm.vfs_cache_pressure - set values from 100 to 10000 - no effect or effect very low

Code:

/proc/sys/vm# for v in $(ls -1 dirty_*); do echo -n "$v: "; cat $v; done
dirty_background_bytes: 0
dirty_background_ratio: 10
dirty_bytes: 300000000
dirty_expire_centisecs: 3000
dirty_ratio: 0
dirty_writeback_centisecs: 500

Alwin · Jun 28, 2019

How much memory does the node have? And does the behavior change when you put the RBD image onto a different pool?

And please update to the latest packages, eg. currently there is Ceph 12.2.12.

ceph rbd slow down read/write

Well-Known Member

Famous Member

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Famous Member

Well-Known Member

Famous Member

Well-Known Member

Famous Member

Well-Known Member

Well-Known Member

Proxmox Retired Staff