How to optimize ceph performance in proxmox？

TseKing · Jan 11, 2025

I use 4 Dell R740 8 SSD disk slot servers to deploy Proxmox in the lab. 2 of the disk slots use RAID1 for installing the system, and the other 6 disk slots use Samsung 870EVO as CEPH storage.
But after deployment, CEPH's performance is very poor. I used fio to test it, and the iops was very low.

Code:

root@localhost:~# fio --rw=write --ioengine=sync --fdatasync=1 --size=22m --bs=2300 --name=mytest --filename=/dev/sdc
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=152KiB/s][w=68 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=1762: Sat Jan 11 10:14:05 2025
  write: IOPS=70, BW=159KiB/s (163kB/s)(22.0MiB/141957msec); 0 zone resets
    clat (usec): min=3, max=19760, avg=1006.17, stdev=1187.41
     lat (usec): min=4, max=19761, avg=1007.18, stdev=1187.41
    clat percentiles (usec):
     |  1.00th=[   12],  5.00th=[   14], 10.00th=[   15], 20.00th=[   17],
     | 30.00th=[   20], 40.00th=[   24], 50.00th=[  857], 60.00th=[  963],
     | 70.00th=[ 1205], 80.00th=[ 2073], 90.00th=[ 2704], 95.00th=[ 3064],
     | 99.00th=[ 4686], 99.50th=[ 5735], 99.90th=[ 7308], 99.95th=[ 8586],
     | 99.99th=[10290]
   bw (  KiB/s): min=    8, max=  202, per=99.56%, avg=158.47, stdev=23.34, samples=283
   iops        : min=    4, max=   90, avg=70.73, stdev=10.38, samples=283
  lat (usec)   : 4=0.01%, 10=0.58%, 20=32.04%, 50=11.01%, 100=0.07%
  lat (usec)   : 250=0.13%, 750=1.62%, 1000=17.14%
  lat (msec)   : 2=16.54%, 4=19.27%, 10=1.58%, 20=0.02%
  fsync/fdatasync/sync_file_range:
    sync (msec): min=5, max=620, avg=13.14, stdev= 8.20
    sync percentiles (msec):
     |  1.00th=[    7],  5.00th=[    9], 10.00th=[   10], 20.00th=[   11],
     | 30.00th=[   11], 40.00th=[   12], 50.00th=[   13], 60.00th=[   14],
     | 70.00th=[   15], 80.00th=[   16], 90.00th=[   18], 95.00th=[   20],
     | 99.00th=[   25], 99.50th=[   29], 99.90th=[   67], 99.95th=[  127],
     | 99.99th=[  230]
  cpu          : usr=0.24%, sys=1.12%, ctx=35726, majf=0, minf=16
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10030,0,0 short=10029,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=159KiB/s (163kB/s), 159KiB/s-159KiB/s (163kB/s-163kB/s), io=22.0MiB (23.1MB), run=141957-141957msec

Disk stats (read/write):
  sdc: ios=5680/20049, merge=0/5620, ticks=9650/130874, in_queue=142146, util=100.00%

My optimization configuration is as follows：
- enabled krbd
- discard: true
- IO thread: true
- Backup: false
- Skip replication: true
- Async IO: native

Are there any omissions or errors in my optimizations?

I see that many people do not recommend using consumer-grade SSDs, so are there any recommended enterprise-grade SSDs?

guruevi · Jan 11, 2025

870EVO are the very bottom of consumer grade, even desktops won’t run well on it. That being said, what do you expect, what is your networking setup etc.

You’re also just doing 1 FIO test, Ceph has its own performance testing and there are lots of tweaks and knobs on FIO, what is your workload, what is your expectation, what is your architecture and then are you testing for that workload.

Note that in a default Ceph setup, a single block write will just result in 1 write to 3 different disks on 3 different nodes, so you’ll get at worst, the slowest disk in the slowest node for simple single-depth tests and a 3x amplification of network traffic. But most IO is not one-off block writes, most IO is done with significant queue depths, mix of reads and writes with caches etc.

You don’t even have to go that high grade, but DC SSD are a must for anything resembling a functional cluster (regardless of file system, ZFS has the same problem). The 870EVO has latencies of 2-5ms under load, they’re basically spinning drives at that point.

Johannes S · Jan 11, 2025

guruevi said:
You don’t even have to go that high grade, but DC SSD are a must for anything resembling a functional cluster (regardless of file system, ZFS has the same problem). The 870EVO has latencies of 2-5ms under load, they’re basically spinning drives at that point.

This, used enterprise SSDs with Power loss protection should work and are quite affordable. But Ceph also needs fast network.

@UdoB did a great writeup on Ceph in such environments: https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/

Search

Search

How to optimize ceph performance in proxmox？

TseKing

New Member

guruevi

Well-Known Member

Johannes S

Renowned Member

We value your privacy