Abysmal Ceph write performance

jslanier · Feb 16, 2021

I am getting extremely low write speeds on my minimal Ceph cluster of 3 nodes with 2 1TB Samsung QVO 860 SSDs each (total of 6 SSDs across 3 nodes). My 3 nodes each have 4 10G links in a LAG group separated into 5 VLANs. iperf3 results show I am getting 10 Gbits between each node on all VLANs.

Ceph config:

Code:

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.60.10.1/28
     fsid = cb9aebb9-2aef-4797-872f-75a138c81ac0
     mon_allow_pool_delete = true
     mon_host = 10.60.10.2 10.60.10.3 10.60.10.1
     osd_pool_default_min_size = 2
     osd_pool_default_size = 2
     public_network = 10.60.10.1/28

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.prox1]
     public_addr = 10.60.10.1

crush map:

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host prox3 {
    id -3        # do not change unnecessarily
    id -4 class ssd        # do not change unnecessarily
    # weight 1.818
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 0.909
    item osd.1 weight 0.909
}
host prox2 {
    id -5        # do not change unnecessarily
    id -6 class ssd        # do not change unnecessarily
    # weight 1.818
    alg straw2
    hash 0    # rjenkins1
    item osd.2 weight 0.909
    item osd.3 weight 0.909
}
host prox1 {
    id -7        # do not change unnecessarily
    id -8 class ssd        # do not change unnecessarily
    # weight 1.818
    alg straw2
    hash 0    # rjenkins1
    item osd.4 weight 0.909
    item osd.5 weight 0.909
}
root default {
    id -1        # do not change unnecessarily
    id -2 class ssd        # do not change unnecessarily
    # weight 5.455
    alg straw2
    hash 0    # rjenkins1
    item prox3 weight 1.818
    item prox2 weight 1.818
    item prox1 weight 1.818
}

# rules
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map
Server View
Logs

Here are the results of my fio test (I got similar results inside of a Windows VM):

Code:

root@prox1:~# fio -ioengine=rbd -name=test -direct=1 -rw=read -bs=4M -iodepth=16 -pool=Ceph1 -rbdname=vm-111-disk-0
test: (g=0): rw=read, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=rbd, iodepth=16
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [R(1)][95.6%][r=2258MiB/s][r=564 IOPS][eta 00m:02s]
test: (groupid=0, jobs=1): err= 0: pid=3192071: Tue Feb 16 15:59:22 2021
  read: IOPS=192, BW=771MiB/s (809MB/s)(32.0GiB/42477msec)
    slat (nsec): min=1867, max=169154, avg=15109.84, stdev=8685.58
    clat (msec): min=6, max=2865, avg=82.94, stdev=148.54
     lat (msec): min=6, max=2865, avg=82.95, stdev=148.54
    clat percentiles (msec):
     |  1.00th=[   13],  5.00th=[   15], 10.00th=[   17], 20.00th=[   20],
     | 30.00th=[   23], 40.00th=[   26], 50.00th=[   27], 60.00th=[   47],
     | 70.00th=[   82], 80.00th=[  112], 90.00th=[  165], 95.00th=[  279],
     | 99.00th=[  735], 99.50th=[ 1083], 99.90th=[ 1703], 99.95th=[ 1921],
     | 99.99th=[ 2869]
   bw (  KiB/s): min=49152, max=2932736, per=98.53%, avg=778365.46, stdev=673798.19, samples=83
   iops        : min=   12, max=  716, avg=189.99, stdev=164.53, samples=83
  lat (msec)   : 10=0.56%, 20=21.19%, 50=39.25%, 100=15.05%, 250=18.21%
  lat (msec)   : 500=3.58%, 750=1.21%, 1000=0.38%
  cpu          : usr=0.92%, sys=0.39%, ctx=8209, majf=13, minf=134
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.8%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=8192,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=771MiB/s (809MB/s), 771MiB/s-771MiB/s (809MB/s-809MB/s), io=32.0GiB (34.4GB), run=42477-42477msec
root@prox1:~# fio -ioengine=rbd -name=test -direct=1 -rw=write -bs=4M -iodepth=16 -pool=Ceph1 -rbdname=vm-111-disk-0
test: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=rbd, iodepth=16
fio-3.12
Starting 1 process
Jobs: 1 (f=0): [f(1)][100.0%][w=60.0MiB/s][w=15 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3195178: Tue Feb 16 16:34:25 2021
  write: IOPS=3, BW=15.9MiB/s (16.7MB/s)(32.0GiB/2056095msec); 0 zone resets
    slat (usec): min=707, max=15717, avg=2536.78, stdev=1700.41
    clat (msec): min=246, max=14489, avg=4013.08, stdev=2070.79
     lat (msec): min=248, max=14491, avg=4015.62, stdev=2071.01
    clat percentiles (msec):
     |  1.00th=[  506],  5.00th=[  894], 10.00th=[ 1401], 20.00th=[ 2165],
     | 30.00th=[ 2802], 40.00th=[ 3306], 50.00th=[ 3842], 60.00th=[ 4396],
     | 70.00th=[ 4933], 80.00th=[ 5738], 90.00th=[ 6812], 95.00th=[ 7617],
     | 99.00th=[ 9597], 99.50th=[10537], 99.90th=[12013], 99.95th=[13489],
     | 99.99th=[14429]
   bw (  KiB/s): min= 8175, max=139264, per=100.00%, avg=30679.23, stdev=25862.97, samples=2183
   iops        : min=    1, max=   34, avg= 7.41, stdev= 6.31, samples=2183
  lat (msec)   : 250=0.01%, 500=0.96%, 750=2.04%, 1000=2.89%
  cpu          : usr=0.76%, sys=0.26%, ctx=4217, majf=1, minf=2430160
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.8%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8192,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=15.9MiB/s (16.7MB/s), 15.9MiB/s-15.9MiB/s (16.7MB/s-16.7MB/s), io=32.0GiB (34.4GB), run=2056095-2056095msec

It does not seem to be my network, because read speeds are pretty good and definitely approach 10 Gbit speeds. Any ideas why the write speeds are so crippled?

Thanks,
Stan

jslanier · Feb 17, 2021

For those following but staying quiet on this thread, I have done a tone of troubleshooting and found that 2 of my OSD disks were reporting high commit/apply latencies. I have replaced one of them and the data is being redistributed now to this new disk. I am attaching the troubleshooting data:

Prox3 iostat:

Code:

root@prox3:~# iostat -x
Linux 5.4.78-2-pve (prox3)      02/17/2021      _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.93    0.00    2.72    1.25    0.00   93.09

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdc              3.90   73.63    297.55    644.26     3.28     1.14  45.71   1.52    4.37    2.62   0.14    76.37     8.75   1.83  14.18
sda              0.08   19.63     31.31    326.51     0.00     0.00   0.00   0.00    9.01    7.86   0.12   394.50    16.63   4.75   9.36
sdd              0.13   19.70     31.35    326.51     0.00     0.00   0.00   0.00    4.15    8.44   0.13   246.24    16.58   5.13  10.16
sdb              4.63   77.57    359.90    712.02     3.95     1.38  46.08   1.74    4.61    2.88   0.17    77.81     9.18   1.80  14.81
dm-0             7.22   74.74    297.55    644.26     0.00     0.00   0.00   0.00    4.27    5.01   0.41    41.19     8.62   1.73  14.18
dm-1             8.63   78.91    359.89    712.02     0.00     0.00   0.00   0.00    4.45    5.91   0.50    41.70     9.02   1.69  14.81

root@prox3:~# lsblk
NAME                                                                                                  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                                                                                     8:0    0 136.8G  0 disk
├─sda1                                                                                                  8:1    0  1007K  0 part
├─sda2                                                                                                  8:2    0   512M  0 part
└─sda3                                                                                                  8:3    0 136.2G  0 part
sdb                                                                                                     8:16   0 931.5G  0 disk
└─ceph--5e016168--84b5--4cc5--aab2--4cd44f463e2d-osd--block--1a4d2dbd--df9a--4631--8e13--b785d1c64dc9 253:1    0   931G  0 lvm 
sdc                                                                                                     8:32   0 931.5G  0 disk
└─ceph--c3014551--91fa--49ed--934f--3d214e6f205f-osd--block--8a91cc61--393c--4160--a1f8--06bceefc214e 253:0    0   931G  0 lvm 
sdd                                                                                                     8:48   0 136.8G  0 disk
├─sdd1                                                                                                  8:49   0  1007K  0 part
├─sdd2                                                                                                  8:50   0   512M  0 part
└─sdd3                                                                                                  8:51   0 136.2G  0 part
sr0                                                                                                    11:0    1  1024M  0 rom

prox2 iostat:

Code:

root@prox2:~# iostat -x
Linux 5.4.78-2-pve (prox2)      02/17/2021      _x86_64_        (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.53    0.00    1.29    2.86    0.00   93.33

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdd             11.30   63.03   1100.45   1320.78     9.28    13.01  45.08  17.11   28.51   37.87   2.62    97.38    20.96  10.22  75.93
sdc             10.81   29.08    113.10    900.74     0.95     3.73   8.09  11.37    1.14    1.71   0.05    10.46    30.97   1.13   4.50
sda             11.58   73.58   1266.89   1229.71    12.46    10.67  51.83  12.67    2.70    2.94   0.16   109.36    16.71   1.75  14.89
sdb             11.11   28.78    138.71    918.28     0.95     3.86   7.87  11.83    1.52    3.38   0.10    12.49    31.91   1.33   5.30
dm-0            20.68   76.00   1099.94   1320.78     0.00     0.00   0.00   0.00   30.48   47.27   4.22    53.18    17.38   7.85  75.93
dm-1            24.22   84.23   1266.38   1229.71     0.00     0.00   0.00   0.00    3.09    4.52   0.46    52.28    14.60   1.37  14.89

root@prox2:~# lsblk
NAME                                                                                                  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                                                                                     8:0    0 931.5G  0 disk
└─ceph--e34542e1--95e8--453a--a84f--cddf7444acca-osd--block--231e0475--d7e5--48fb--9904--e07698b25f33 253:1    0   931G  0 lvm 
sdb                                                                                                     8:16   0 111.8G  0 disk
├─sdb1                                                                                                  8:17   0  1007K  0 part
├─sdb2                                                                                                  8:18   0   512M  0 part
└─sdb3                                                                                                  8:19   0 110.5G  0 part
sdc                                                                                                     8:32   0 111.8G  0 disk
├─sdc1                                                                                                  8:33   0  1007K  0 part
├─sdc2                                                                                                  8:34   0   512M  0 part
└─sdc3                                                                                                  8:35   0 110.5G  0 part
sdd                                                                                                     8:48   0 931.5G  0 disk
└─ceph--d8c151d8--18c9--4178--bba9--36cc997596b0-osd--block--255b4894--1d99--4a99--858b--86effaed2227 253:0    0   931G  0 lvm 
sr0                                                                                                    11:0    1  1024M  0 rom

prox1 iostat:

Code:

root@prox1:~# iostat -x
Linux 5.4.78-2-pve (prox1)      02/17/2021      _x86_64_        (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.04    0.00    1.24    2.21    0.00   93.51

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda              0.00    0.00      0.03      0.00     0.00     0.00   0.00   0.00    4.78    0.00   0.00    22.37     0.00   3.38   0.00
sdd              2.72   25.91      0.73    321.94     0.00     0.00   0.00   0.00    3.66    1.32   0.02     0.27    12.42   1.33   3.81
sde              4.16   83.19    388.24    899.57     4.26     2.39  50.61   2.79   27.65   29.32   2.47    93.31    10.81   8.43  73.64
sdb              3.13   40.29    283.91    484.05     3.33     2.18  51.57   5.14    4.09    4.07   0.14    90.79    12.01   2.01   8.71
sdc              2.72   25.93      0.70    321.94     0.00     0.00   0.00   0.00    3.65    1.33   0.02     0.26    12.42   1.33   3.80
dm-0             8.48   85.54    388.23    899.57     0.00     0.00   0.00   0.00   25.22   40.19   3.65    45.81    10.52   7.83  73.63
dm-1             6.50   42.45    283.90    484.05     0.00     0.00   0.00   0.00    3.94   12.43   0.55    43.67    11.40   1.78   8.71

root@prox1:~# lsblk
NAME                                                                                                  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                                                                                     8:0    0   972M  0 disk
sdb                                                                                                     8:16   0 931.5G  0 disk
└─ceph--f047e924--9735--4bf9--b462--bbae6e131232-osd--block--d472f816--896b--4aaa--8c29--34616b8833c9 253:1    0   931G  0 lvm 
sdc                                                                                                     8:32   0  93.2G  0 disk
├─sdc1                                                                                                  8:33   0  1007K  0 part
├─sdc2                                                                                                  8:34   0   512M  0 part
└─sdc3                                                                                                  8:35   0  92.7G  0 part
sdd                                                                                                     8:48   0  93.2G  0 disk
├─sdd1                                                                                                  8:49   0  1007K  0 part
├─sdd2                                                                                                  8:50   0   512M  0 part
└─sdd3                                                                                                  8:51   0  92.7G  0 part
sde                                                                                                     8:64   0 931.5G  0 disk
└─ceph--f0b5145a--5778--4894--bf81--50948786818d-osd--block--9cf49650--8909--4369--9ae9--ee7485c8fa5b 253:0    0   931G  0 lvm 
sr0                                                                                                    11:0    1  1024M  0 rom

Here is what the latencies look like now after replacing that one disk:

I will post an update again with read/write stats after the second drive is replaced and the data is all moved.

Zombie · Feb 17, 2021

Are those consumer ssd’s? If so that might be part of the problem. There is a lot of posts here on the forums about consumer vs enterprise when it comes to ceph.

jslanier · Feb 17, 2021

Zombie said:
Are those consumer ssd’s? If so that might be part of the problem. There is a lot of posts here on the forums about consumer vs enterprise when it comes to ceph.

Yes. This is our dev cluster, and I just bought some affordable consumer SSDs to play with Proxmox HCI via Ceph.

aaron · Feb 17, 2021

Yeah, try to stay away from QLC (quad level cells) SSDs. They usually have terrible write performance once their SLC (single level cell) cache is full.

jslanier · Feb 18, 2021

aaron said:
Yeah, try to stay away from QLC (quad level cells) SSDs. They usually have terrible write performance once their SLC (single level cell) cache is full.

You think it might be a better idea to use ZFS on each individual node instead of Ceph for a smaller cluster? I would like to move from VMWare to Proxmox, and we currently have a Dell Compellent SAN. It would be nice to get 4 new hosts and set them up with either Ceph or zpools on each node (in addition to setting up multipath iscsi to the Compellent). If I buy enterprise nvme drives, am I still going to see poor write performance with Ceph? What kind of write performance should I expect with 4 hosts with 16 enterprise nvmes total?

I was hoping to test this setup with consumer SSDs as a proof of concept, but I am thrown off by the abysmal write performance from these drives. These same SSDs perform really well in ZFS.

Jakest · Feb 19, 2021

Try and shoot for a min of 4 OSD's per host as well, ceph benefits massively from being able to parallelize

t.lamprecht · Feb 19, 2021

jslanier said:
You think it might be a better idea to use ZFS on each individual node instead of Ceph for a smaller cluster?

You loose (easy) shared storage and cluster-wide rebalancing, depending on your workload that may be fine, but personally I'd go for ceph in clusters with three nodes or bigger. I just still marvel at the fact that I can just rip out any failed drive, plug in a new one, ceph does a bit of rebalancing, and I'm done while never getting even close to a "death zone/spiral".

jslanier said:
If I buy enterprise nvme drives, am I still going to see poor write performance with Ceph? What kind of write performance should I expect with 4 hosts with 16 enterprise nvmes total?

You may want to check out our Ceph benchmark paper from 2020/09
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/

You will already far much better if you use enterprise TLC SSDs, as your QLC + consumer-grade ones are probably the worst performance you'll get, only spinners and SDCards can be slower.

Note, that with 4x4 enterprise NVMes your 10G cluster backbone gets the new bottleneck. I mean, ceph can try to favour local, but on write OPs at least one object needs to go over the network before ceph can confirm the write OP to be safe anyway, so that will always play into your total write performance.

jslanier said:
I was hoping to test this setup with consumer SSDs as a proof of concept, but I am thrown off by the abysmal write performance from these drives. These same SSDs perform really well in ZFS.

With ZFS the ARC ram cache probably saves your writes. But yeah, if you only need local storage (replication exists, but its just not realtime replication like ceph has) and can say that you won't need to scale too much, then ZFS has a bit less overhead and can be faster.

Search

Search

Abysmal Ceph write performance

jslanier

Well-Known Member

jslanier

Well-Known Member

Zombie

Well-Known Member

jslanier

Well-Known Member

aaron

Proxmox Staff Member

jslanier

Well-Known Member

Jakest

Member

t.lamprecht

Proxmox Staff Member

We value your privacy