Ceph 4K Performance

Feb 19, 2019
26
3
23
42
Hallo zusammen,

ich teste seit einigen Wochen ein hyperkonvergentes Proxmox/Ceph Setup bestehend aus 4 Nodes mit je 10 SSD als OSD.
Die Performance von Ceph ist bei 4M ganz gut und entspricht auch ungefähr den Ergebnissen aus dem Proxmox Ceph Benchmark.
Bei 4K sieht es jedoch schon anders aus.

Ich habe bisher folgendes Tuning vorgenommen und konnte damit die 4K IOPS ungefähr verdoppeln:
  • Powersaving im BIOS auf Disabled/Best Performace gestellt
  • NUMA pinning der Ceph OSD auf den Socket, an den auch HBA und NIC angebunden sind
  • cephx/Debug deaktiviert
  • IO Scheduler auf noop gestellt
  • MTU auf 9000 gestellt
Und hier meine Benchmark Ergebnisse:
Ceph RBD mit Proxmox Standard Einstellungen
Code:
write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=1
fio-2.16
Starting 1 process
rbd engine: RBD version: 1.12.0
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/2168KB/0KB /s] [0/542/0 iops] [eta 00m:00s]
write: (groupid=0, jobs=1): err= 0: pid=14359: Wed Feb 27 10:11:56 2019
  write: io=129492KB, bw=2158.2KB/s, iops=539, runt= 60001msec
    slat (usec): min=0, max=4, avg= 0.12, stdev= 0.33
    clat (usec): min=1065, max=10355, avg=1852.47, stdev=347.35
     lat (usec): min=1065, max=10356, avg=1852.60, stdev=347.36
    clat percentiles (usec):
     |  1.00th=[ 1272],  5.00th=[ 1400], 10.00th=[ 1480], 20.00th=[ 1592],
     | 30.00th=[ 1672], 40.00th=[ 1752], 50.00th=[ 1816], 60.00th=[ 1896],
     | 70.00th=[ 1976], 80.00th=[ 2064], 90.00th=[ 2224], 95.00th=[ 2352],
     | 99.00th=[ 2928], 99.50th=[ 3472], 99.90th=[ 4128], 99.95th=[ 4320],
     | 99.99th=[ 8768]
    lat (msec) : 2=72.34%, 4=27.51%, 10=0.15%, 20=0.01%
  cpu          : usr=34.73%, sys=64.95%, ctx=31624, majf=0, minf=10765
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=32373/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=129492KB, aggrb=2158KB/s, minb=2158KB/s, maxb=2158KB/s, mint=60001msec, maxt=60001msec
write: io=129492KB, bw=2158.2KB/s, iops=539, runt= 60001msec

Code:
Total time run:         60.004419
Total writes made:      22256
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     1.44885
Stddev Bandwidth:       0.0305641
Max bandwidth (MB/sec): 1.53125
Min bandwidth (MB/sec): 1.33594
Average IOPS:           370
Stddev IOPS:            7
Max IOPS:               392
Min IOPS:               342
Average Latency(s):     0.00269406
Stddev Latency(s):      0.000369857
Max latency(s):         0.00857078
Min latency(s):         0.00153859
Average IOPS: 370

Ceph RBD mit Tuning (siehe oben)
Code:
write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=1
fio-2.16
Starting 1 process
rbd engine: RBD version: 1.12.0
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/3940KB/0KB /s] [0/985/0 iops] [eta 00m:00s]
write: (groupid=0, jobs=1): err= 0: pid=802491: Wed Feb 27 08:57:06 2019
  write: io=266580KB, bw=4442.1KB/s, iops=1110, runt= 60001msec
    slat (usec): min=0, max=5, avg= 0.08, stdev= 0.27
    clat (usec): min=621, max=130157, avg=899.69, stdev=751.33
     lat (usec): min=621, max=130157, avg=899.77, stdev=751.33
    clat percentiles (usec):
     |  1.00th=[  684],  5.00th=[  708], 10.00th=[  724], 20.00th=[  748],
     | 30.00th=[  772], 40.00th=[  796], 50.00th=[  836], 60.00th=[  884],
     | 70.00th=[  932], 80.00th=[  988], 90.00th=[ 1144], 95.00th=[ 1224],
     | 99.00th=[ 1720], 99.50th=[ 2128], 99.90th=[ 2992], 99.95th=[ 4832],
     | 99.99th=[44288]
    lat (usec) : 750=21.46%, 1000=60.06%
    lat (msec) : 2=17.86%, 4=0.53%, 10=0.07%, 20=0.01%, 50=0.01%
    lat (msec) : 100=0.01%, 250=0.01%
  cpu          : usr=30.66%, sys=69.06%, ctx=37694, majf=0, minf=180
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=66645/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=266580KB, aggrb=4442KB/s, minb=4442KB/s, maxb=4442KB/s, mint=60001msec, maxt=60001msec
write: io=266580KB, bw=4442.1KB/s, iops=1110, runt= 60001msec

Code:
Total time run:         60.002227
Total writes made:      44485
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     2.89605
Stddev Bandwidth:       0.0290069
Max bandwidth (MB/sec): 2.94141
Min bandwidth (MB/sec): 2.73438
Average IOPS:           741
Stddev IOPS:            7
Max IOPS:               753
Min IOPS:               700
Average Latency(s):     0.00134834
Stddev Latency(s):      0.000230399
Max latency(s):         0.00339552
Min latency(s):         0.00087189
Average IOPS: 741

VM mit Debian 9.7 und RBD Storage
Code:
write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/3372KB/0KB /s] [0/843/0 iops] [eta 00m:00s]
write: (groupid=0, jobs=1): err= 0: pid=872: Wed Feb 27 09:18:42 2019
  write: io=206284KB, bw=3438.9KB/s, iops=859, runt= 60001msec
    clat (usec): min=778, max=129701, avg=1157.11, stdev=1055.83
     lat (usec): min=779, max=129703, avg=1158.61, stdev=1055.83
    clat percentiles (usec):
     |  1.00th=[  844],  5.00th=[  868], 10.00th=[  892], 20.00th=[  940],
     | 30.00th=[  996], 40.00th=[ 1064], 50.00th=[ 1128], 60.00th=[ 1192],
     | 70.00th=[ 1256], 80.00th=[ 1304], 90.00th=[ 1368], 95.00th=[ 1432],
     | 99.00th=[ 1992], 99.50th=[ 2416], 99.90th=[ 4320], 99.95th=[ 5280],
     | 99.99th=[57088]
    lat (usec) : 1000=30.73%
    lat (msec) : 2=68.28%, 4=0.87%, 10=0.10%, 20=0.01%, 50=0.01%
    lat (msec) : 100=0.01%, 250=0.01%
  cpu          : usr=0.56%, sys=1.43%, ctx=103153, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=51571/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=206284KB, aggrb=3438KB/s, minb=3438KB/s, maxb=3438KB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
  sdb: ios=46/102949, merge=0/0, ticks=4/58076, in_queue=58072, util=96.95%
write: io=206284KB, bw=3438.9KB/s, iops=859, runt= 60001msec

VM mit Debian 9.7 und local-zfs Storage
Code:
write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/10716KB/0KB /s] [0/2679/0 iops] [eta 00m:00s]
write: (groupid=0, jobs=1): err= 0: pid=886: Wed Feb 27 09:36:39 2019
  write: io=550356KB, bw=9172.5KB/s, iops=2293, runt= 60001msec
    clat (usec): min=315, max=21379, avg=429.92, stdev=228.03
     lat (usec): min=317, max=21380, avg=431.34, stdev=228.03
    clat percentiles (usec):
     |  1.00th=[  334],  5.00th=[  342], 10.00th=[  350], 20.00th=[  358],
     | 30.00th=[  366], 40.00th=[  374], 50.00th=[  390], 60.00th=[  442],
     | 70.00th=[  454], 80.00th=[  470], 90.00th=[  502], 95.00th=[  564],
     | 99.00th=[  820], 99.50th=[ 1720], 99.90th=[ 2352], 99.95th=[ 2576],
     | 99.99th=[ 3504]
    lat (usec) : 500=89.55%, 750=9.34%, 1000=0.27%
    lat (msec) : 2=0.57%, 4=0.26%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=1.33%, sys=3.71%, ctx=275198, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=137589/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=550356KB, aggrb=9172KB/s, minb=9172KB/s, maxb=9172KB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
  sda: ios=43/274639, merge=0/0, ticks=0/55660, in_queue=55628, util=92.76%
write: io=550356KB, bw=9172.5KB/s, iops=2293, runt= 60001msec

Ich habe festgestellt, dass unsere Lagerverwaltungs-Software exterm von den höheren IOPS des local-zfs Storage profitiert. Auf dem Ceph RBD Storage ist die Performance kaum besser als auf unserem 5 Jahre alten Centos KVM Host mit HDD Raid10. :-(

Gibt es noch weitere Möglichkeiten, um die Performance zu verbessern oder gibt es vielleicht generell ein Bottleneck in meinem Setup?

Hier noch die Daten zu meinem Setup:
4 Nodes je
  • Mainboard Supermicro X11DPi-N
  • 2x Intel Xeon Silver 4116 12x2,1GHz
  • 256 GB RAM DDR4 PC2666 Reg.
  • 2x SSD Intel DC 4610 480 GB (ZFS RAID1 für Proxmox)
  • 10x SSD Samsung SM836a 2TB (Ceph OSD)
  • Broadcom/LSI 9305-16I HBA
  • Intel 10G X710/X557 Quad Port (2x10 GE Ceph Private, 2x10 GE Ceph Public/Proxmox Cluster Network jeweils auf einem eigenen Netgear XS716T Switch)
Code:
CPU BOGOMIPS:      201642.96
REGEX/SECOND:      2122779
HD SIZE:           372.49 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND:     2775.04
DNS EXT:           39.75 ms
DNS INT:           1.00 ms

Code:
iperf -c 10.0.1.1 -P 2
------------------------------------------------------------
Client connecting to 10.0.1.1, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  4] local 10.0.1.2 port 41918 connected with 10.0.1.1 port 5001
[  3] local 10.0.1.2 port 41920 connected with 10.0.1.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  10.2 GBytes  8.80 Gbits/sec
[  3]  0.0-10.0 sec  10.6 GBytes  9.14 Gbits/sec
[SUM]  0.0-10.0 sec  20.9 GBytes  17.9 Gbits/sec

Code:
rtt min/avg/max/mdev = 0.024/0.045/0.113/0.014 ms, ipg/ewma 0.053/0.042 ms

Besten Dank im Voraus.
Viele Grüße
Patrick
 
Mit etwas Ceph und sysctl Tuning konnte ich die IOPS noch etwas steigern.

Code:
write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=1
fio-2.16
Starting 1 process
rbd engine: RBD version: 1.12.0
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/5084KB/0KB /s] [0/1271/0 iops] [eta 00m:00s]
write: (groupid=0, jobs=1): err= 0: pid=2347396: Thu Feb 28 16:46:53 2019
  write: io=309260KB, bw=5154.3KB/s, iops=1288, runt= 60001msec
    slat (usec): min=0, max=7, avg= 0.07, stdev= 0.26
    clat (usec): min=635, max=42796, avg=775.45, stdev=196.09
     lat (usec): min=635, max=42796, avg=775.52, stdev=196.09
    clat percentiles (usec):
     |  1.00th=[  668],  5.00th=[  692], 10.00th=[  700], 20.00th=[  716],
     | 30.00th=[  724], 40.00th=[  732], 50.00th=[  748], 60.00th=[  780],
     | 70.00th=[  812], 80.00th=[  828], 90.00th=[  860], 95.00th=[  884],
     | 99.00th=[  972], 99.50th=[ 1448], 99.90th=[ 2384], 99.95th=[ 3056],
     | 99.99th=[ 3664]
    lat (usec) : 750=51.41%, 1000=47.76%
    lat (msec) : 2=0.60%, 4=0.23%, 10=0.01%, 50=0.01%
  cpu          : usr=31.09%, sys=68.52%, ctx=48067, majf=0, minf=145
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=77315/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=309260KB, aggrb=5154KB/s, minb=5154KB/s, maxb=5154KB/s, mint=60001msec, maxt=60001msec
write: io=309260KB, bw=5154.3KB/s, iops=1288, runt= 60001msec

Leider fehlen mir die Vergleichswerte und ich kann das Ergebnis nicht quantifizieren.

Könnte vielleicht jemand mit einem vergleichbaren Setup seine Benchmark Ergebnisse posten?
 
Ja.
Ich habe verstanden, dass die 4k Single Thread IO Perormance bei Ceph sehr stark von der Latenz abhängt.
Man muss also versuchen, die Latenz zu veringern. Das habe ich bereits durch die Tuning Maßnahmen erreicht.
Weitere Performancesteigerung lässt sich also nur durch schnellere CPUs (höherer Takt) und Netzwerk mit niedrigerer Latenz (40/100GbE) erreichern.

Insofern gehe ich nun davon aus, dass die Performance der von mir genutzen Hardware soweit ganz in Ordnung ist.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!