Ceph performance advices

Slyrack · Feb 28, 2023

Hi,

We are on the point of changing our current IT infrastructure with a small Proxmox cluster of 5 nodes :

Dell Precision R7920
Xeon Gold 5118 or Xeon Silver 4112
128 GB ECC RAM
Data/VM disks (Ceph) on 5x3 16 TB Exos HDD.
System/ISO disk on NVMe SSD via a PCIe adapter. Currently these are Kingston SFYRD/2000G, but we'll change them before going to production for enterprise grade NVMe SSD.

One of the issues we have with our current infrastructure is that it is difficult to scale up (we have one 'big' Dell R520 node running everything under ESXi 6.0, attached to a PowerVault MD1200 DAS that stores everything). There are a lot of single point of failure, we can't add more disks and all the hardware is ~8 years old anyway. We don't have any problem with it but it's high time to move on and the management finally agreed to it.

Among all that Proxmox embeds, Ceph is very interesting because in the next years we will need to increase space more than anything else. Unfortunately, after several weeks of testing and benchmarking in every direction, I can't seem to get a satisfactory IO+bandwidth performance. I wasn't expecting it to be ultra fast (I don't need it) considering I use HDDs, but I would have liked to have something similar to a basic laptop HDD in terms of IO and bandwidth for each VM, but we are far from it (~15 min to boot a Windows 10 with nothing installed on it).

Here are the optimisations I've already (interesting read: https://yourcmc.ru/wiki/Ceph_performance) :

Proxmox (7.3-6) and Ceph (17.2.5) + bluestore are in their latest version.
Tried EC3+2, EC2+2 and 3x replicated pools, with none to aggressive lz4, zlib or zstd or snappy compression.
Disabled cephx
Disabled ceph debugging
MTU set to 9000
All VMs using the virtio block driver. Cache in default, none, writeback or unsafe.
All HDDs connect to a PERC H330 card in HBA mode.
Using a 30 GB LVM volume on the NVMe SSD for wal+db of every disk (I know 30 GB is an old advise and I can allow more now, but I use very little data for my tests so that should be OK).

I noticed my SSDs were bad with fsync() since they don't have supercapacitors, so I changed (systemctl edit --full ceph-osd@N.service) the Ceph service startup command line from : ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph
to : ExecStart=/usr/bin/eatmydata -- /usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph
for every OSD on every node to ignore fsync, hoping to emulate an enterprise SSD without buying one yet (it should have an effect on HDD too so it should be even better performance-wise).

This last manipulation clearly helped with the FIO benchmarks but it still doesn't feel like it in practice (~15 minutes to start a blank Windows 10...).

Bash:

root@test:~# fio -ioengine=libaio -direct=1 -name=test --size=1GB -bs=4M -iodepth=16 -rw=write -runtime=60 -filename=./fio.file
test: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=libaio, iodepth=16
fio-3.28
Starting 1 process
test: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=116MiB/s][w=29 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=19603: Tue Feb 28 13:14:43 2023
  write: IOPS=36, BW=144MiB/s (151MB/s)(1024MiB/7100msec); 0 zone resets
    slat (usec): min=140, max=270825, avg=1351.00, stdev=16908.32
    clat (msec): min=155, max=1385, avg=440.46, stdev=195.80
     lat (msec): min=155, max=1385, avg=441.81, stdev=195.45
    clat percentiles (msec):
     |  1.00th=[  178],  5.00th=[  224], 10.00th=[  243], 20.00th=[  284],
     | 30.00th=[  317], 40.00th=[  351], 50.00th=[  397], 60.00th=[  439],
     | 70.00th=[  498], 80.00th=[  550], 90.00th=[  709], 95.00th=[  793],
     | 99.00th=[ 1267], 99.50th=[ 1334], 99.90th=[ 1385], 99.95th=[ 1385],
     | 99.99th=[ 1385]
   bw (  KiB/s): min=65536, max=196608, per=95.49%, avg=141019.43, stdev=37633.66, samples=14
   iops        : min=   16, max=   48, avg=34.43, stdev= 9.19, samples=14
  lat (msec)   : 250=11.33%, 500=60.55%, 750=21.88%, 1000=4.30%, 2000=1.95%
  cpu          : usr=0.94%, sys=0.20%, ctx=271, majf=0, minf=13
  IO depths    : 1=0.4%, 2=0.8%, 4=1.6%, 8=3.1%, 16=94.1%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.6%, 8=0.0%, 16=0.4%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,256,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=144MiB/s (151MB/s), 144MiB/s-144MiB/s (151MB/s-151MB/s), io=1024MiB (1074MB), run=7100-7100msec

Disk stats (read/write):
  rbd0: ios=0/258, merge=0/324, ticks=0/108522, in_queue=108523, util=96.99%





root@test:~# fio -ioengine=libaio -direct=1 -sync=1 -name=test --size=1GB -bs=4k -iodepth=1 -rw=randwrite -runtime=60 -filename=./fio.file
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=108KiB/s][w=27 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=19611: Tue Feb 28 13:16:14 2023
  write: IOPS=35, BW=142KiB/s (146kB/s)(8548KiB/60010msec); 0 zone resets
    slat (usec): min=10, max=196, avg=35.01, stdev=22.38
    clat (msec): min=3, max=378, avg=28.04, stdev=55.99
     lat (msec): min=3, max=378, avg=28.07, stdev=55.99
    clat percentiles (msec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[    8], 50.00th=[    9], 60.00th=[   10],
     | 70.00th=[   12], 80.00th=[   17], 90.00th=[  120], 95.00th=[  174],
     | 99.00th=[  257], 99.50th=[  313], 99.90th=[  368], 99.95th=[  372],
     | 99.99th=[  380]
   bw (  KiB/s): min=   32, max=  336, per=99.69%, avg=142.66, stdev=57.55, samples=119
   iops        : min=    8, max=   84, avg=35.66, stdev=14.39, samples=119
  lat (msec)   : 4=0.05%, 10=64.86%, 20=20.82%, 50=3.42%, 100=0.56%
  lat (msec)   : 250=9.12%, 500=1.17%
  cpu          : usr=0.07%, sys=0.15%, ctx=2237, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,2137,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=142KiB/s (146kB/s), 142KiB/s-142KiB/s (146kB/s-146kB/s), io=8548KiB (8753kB), run=60010-60010msec

Disk stats (read/write):
  rbd0: ios=0/6415, merge=0/2134, ticks=0/60142, in_queue=60142, util=98.62%





root@test:~# fio -ioengine=libaio -direct=1 -name=test --size=1GB -bs=4k -iodepth=128 -rw=randwrite -runtime=60 -filename=./fio.file
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=1737KiB/s][w=434 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=19620: Tue Feb 28 13:17:38 2023
  write: IOPS=410, BW=1642KiB/s (1682kB/s)(96.9MiB/60397msec); 0 zone resets
    slat (usec): min=2, max=606157, avg=178.42, stdev=8653.83
    clat (usec): min=1148, max=1038.5k, avg=311549.85, stdev=200609.71
     lat (usec): min=1151, max=1094.1k, avg=311728.43, stdev=200740.08
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    6], 10.00th=[   16], 20.00th=[  117],
     | 30.00th=[  201], 40.00th=[  259], 50.00th=[  313], 60.00th=[  363],
     | 70.00th=[  422], 80.00th=[  481], 90.00th=[  558], 95.00th=[  676],
     | 99.00th=[  802], 99.50th=[  835], 99.90th=[  978], 99.95th=[ 1003],
     | 99.99th=[ 1036]
   bw (  KiB/s): min=    8, max= 2952, per=100.00%, avg=1644.73, stdev=427.84, samples=120
   iops        : min=    2, max=  738, avg=411.18, stdev=106.96, samples=120
  lat (msec)   : 2=0.37%, 4=2.57%, 10=4.64%, 20=3.34%, 50=3.50%
  lat (msec)   : 100=4.10%, 250=19.79%, 500=44.29%, 750=15.29%, 1000=2.06%
  lat (msec)   : 2000=0.05%
  cpu          : usr=0.19%, sys=0.38%, ctx=7376, majf=0, minf=12
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.7%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,24798,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=1642KiB/s (1682kB/s), 1642KiB/s-1642KiB/s (1682kB/s-1682kB/s), io=96.9MiB (102MB), run=60397-60397msec

Disk stats (read/write):
  rbd0: ios=0/24830, merge=0/11, ticks=0/7438238, in_queue=7438238, util=99.36%

The only thing I see to improve the situation now is the only thing I did not talk about : I'm using 4x1GBe balance-alb LAG on each server, which I know is not a recommended speed.
==> Do you think buying 10 or 25 GBe equipment for every server would change my situation by a wide margin ?
I still have budget to spend for this upgrade but I can't afford to potentially spend everything for nothing if it does not help that much.

If you have any other idea, or if I missed something, I'd gladly read it too

Many thanks for your time.

itNGO · Sep 15, 2023

Did you ever get this solved?

Ceph performance advices

Slyrack

Member

itNGO

Famous Member

We value your privacy