How to optimize ceph performance in proxmox?

TseKing

New Member
Sep 3, 2024
1
0
1
I use 4 Dell R740 8 SSD disk slot servers to deploy Proxmox in the lab. 2 of the disk slots use RAID1 for installing the system, and the other 6 disk slots use Samsung 870EVO as CEPH storage.
But after deployment, CEPH's performance is very poor. I used fio to test it, and the iops was very low.
Code:
root@localhost:~# fio --rw=write --ioengine=sync --fdatasync=1 --size=22m --bs=2300 --name=mytest --filename=/dev/sdc
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=152KiB/s][w=68 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=1762: Sat Jan 11 10:14:05 2025
  write: IOPS=70, BW=159KiB/s (163kB/s)(22.0MiB/141957msec); 0 zone resets
    clat (usec): min=3, max=19760, avg=1006.17, stdev=1187.41
     lat (usec): min=4, max=19761, avg=1007.18, stdev=1187.41
    clat percentiles (usec):
     |  1.00th=[   12],  5.00th=[   14], 10.00th=[   15], 20.00th=[   17],
     | 30.00th=[   20], 40.00th=[   24], 50.00th=[  857], 60.00th=[  963],
     | 70.00th=[ 1205], 80.00th=[ 2073], 90.00th=[ 2704], 95.00th=[ 3064],
     | 99.00th=[ 4686], 99.50th=[ 5735], 99.90th=[ 7308], 99.95th=[ 8586],
     | 99.99th=[10290]
   bw (  KiB/s): min=    8, max=  202, per=99.56%, avg=158.47, stdev=23.34, samples=283
   iops        : min=    4, max=   90, avg=70.73, stdev=10.38, samples=283
  lat (usec)   : 4=0.01%, 10=0.58%, 20=32.04%, 50=11.01%, 100=0.07%
  lat (usec)   : 250=0.13%, 750=1.62%, 1000=17.14%
  lat (msec)   : 2=16.54%, 4=19.27%, 10=1.58%, 20=0.02%
  fsync/fdatasync/sync_file_range:
    sync (msec): min=5, max=620, avg=13.14, stdev= 8.20
    sync percentiles (msec):
     |  1.00th=[    7],  5.00th=[    9], 10.00th=[   10], 20.00th=[   11],
     | 30.00th=[   11], 40.00th=[   12], 50.00th=[   13], 60.00th=[   14],
     | 70.00th=[   15], 80.00th=[   16], 90.00th=[   18], 95.00th=[   20],
     | 99.00th=[   25], 99.50th=[   29], 99.90th=[   67], 99.95th=[  127],
     | 99.99th=[  230]
  cpu          : usr=0.24%, sys=1.12%, ctx=35726, majf=0, minf=16
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10030,0,0 short=10029,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=159KiB/s (163kB/s), 159KiB/s-159KiB/s (163kB/s-163kB/s), io=22.0MiB (23.1MB), run=141957-141957msec

Disk stats (read/write):
  sdc: ios=5680/20049, merge=0/5620, ticks=9650/130874, in_queue=142146, util=100.00%

My optimization configuration is as follows:
- enabled krbd
- discard: true
- IO thread: true
- Backup: false
- Skip replication: true
- Async IO: native

Are there any omissions or errors in my optimizations?

I see that many people do not recommend using consumer-grade SSDs, so are there any recommended enterprise-grade SSDs?
 
870EVO are the very bottom of consumer grade, even desktops won’t run well on it. That being said, what do you expect, what is your networking setup etc.

You’re also just doing 1 FIO test, Ceph has its own performance testing and there are lots of tweaks and knobs on FIO, what is your workload, what is your expectation, what is your architecture and then are you testing for that workload.

Note that in a default Ceph setup, a single block write will just result in 1 write to 3 different disks on 3 different nodes, so you’ll get at worst, the slowest disk in the slowest node for simple single-depth tests and a 3x amplification of network traffic. But most IO is not one-off block writes, most IO is done with significant queue depths, mix of reads and writes with caches etc.

You don’t even have to go that high grade, but DC SSD are a must for anything resembling a functional cluster (regardless of file system, ZFS has the same problem). The 870EVO has latencies of 2-5ms under load, they’re basically spinning drives at that point.
 
  • Like
Reactions: Johannes S
You don’t even have to go that high grade, but DC SSD are a must for anything resembling a functional cluster (regardless of file system, ZFS has the same problem). The 870EVO has latencies of 2-5ms under load, they’re basically spinning drives at that point.

This, used enterprise SSDs with Power loss protection should work and are quite affordable. But Ceph also needs fast network.

@UdoB did a great writeup on Ceph in such environments: https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/
 
  • Like
Reactions: UdoB

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!