Proxmox Ceph configuration

patrickc

New Member
Nov 15, 2024
2
0
1
I have the following cluster I have set up:

Three Dell R740 with 512GB RAM, each with 2 2.6g/3.9g turbo processors. Each has two Micron 9300 7.84TB NVME drives.
One Threadripper Pro with 24-cores, 4.5ghz boost, 128GB RAM. Also has two Micron 9300 7.84TB NVME drives.

They are currently connected on a shared 25g network, though with no other traffic presently running, and I will have ceph on its own network once I get more DAC cables.

The goal of the ceph cluster is to have have good performance for running VMs, serving basic development/testing websites, and running MySQL within the VMs. Data is currently set for 2 replicas because there are recent backups for everything and being mostly dev/testing right now, the durability is not the most critical factor. I have created OSDs for every desk and added them into a pretty basic setup. I have created a metadata server and monitor on each node. For testing, I only have two of the R740s and the one threadripper live. I am seeing pretty decent IO to writing a large file, ~ 325MB/s, which I think looks fairly good. This is running with the default MTU - I realize that I will be setting it to 9000 for the dedicated Ceph network.

However, random IO seems fairly low, and doing things like installing the VM, updating packages, or doing a large MySQL insert seems pretty laggy and seems to "studder". I ran a few benchmarks I found here:

Code:
fio --name=randread --ioengine=libaio --iodepth=16 --rw=randread --bs=4k --direct=0 --size=512M --numjobs=4 --runtime=240 --group_reporting
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.36
Starting 4 processes
randread: Laying out IO file (1 file / 512MiB)
randread: Laying out IO file (1 file / 512MiB)
randread: Laying out IO file (1 file / 512MiB)
randread: Laying out IO file (1 file / 512MiB)
Jobs: 1 (f=1): [_(2),r(1),_(1)][100.0%][r=3516KiB/s][r=879 IOPS][eta 00m:00s]
randread: (groupid=0, jobs=4): err= 0: pid=2445: Tue Nov 19 11:44:43 2024
  read: IOPS=4625, BW=18.1MiB/s (18.9MB/s)(2048MiB/113347msec)
    slat (usec): min=304, max=8550, avg=802.97, stdev=274.21
    clat (usec): min=6, max=25200, avg=12228.31, stdev=1643.44
     lat (usec): min=661, max=26788, avg=13031.28, stdev=1725.81
    clat percentiles (usec):
     |  1.00th=[ 8979],  5.00th=[ 9896], 10.00th=[10290], 20.00th=[10945],
     | 30.00th=[11338], 40.00th=[11731], 50.00th=[12125], 60.00th=[12518],
     | 70.00th=[12911], 80.00th=[13304], 90.00th=[14091], 95.00th=[14746],
     | 99.00th=[17695], 99.50th=[19792], 99.90th=[22676], 99.95th=[23200],
     | 99.99th=[24249]
   bw (  KiB/s): min=15334, max=23264, per=100.00%, avg=19695.06, stdev=324.71, samples=852
   iops        : min= 3833, max= 5816, avg=4922.99, stdev=81.16, samples=852
  lat (usec)   : 10=0.01%, 20=0.01%, 50=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=6.11%, 20=93.40%, 50=0.48%
  cpu          : usr=1.98%, sys=9.30%, ctx=524352, majf=0, minf=175
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=524288,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16


Run status group 0 (all jobs):
   READ: bw=18.1MiB/s (18.9MB/s), 18.1MiB/s-18.1MiB/s (18.9MB/s-18.9MB/s), io=2048MiB (2147MB), run=113347-113347msec


Disk stats (read/write):
    dm-0: ios=524239/286, sectors=4193912/3080, merge=0/0, ticks=385207/801, in_queue=386008, util=99.58%, aggrios=524288/158, aggsectors=4194304/3080, aggrmerge=0/128, aggrticks=391943/424, aggrin_queue=392393, aggrutil=74.70%
  sda: ios=524288/158, sectors=4194304/3080, merge=0/128, ticks=391943/424, in_queue=392393, util=74.70%


Write:

Code:
# fio --directory=/home/ansible --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio --size=1G --numjobs=1
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.36
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=472KiB/s][w=118 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=2490: Tue Nov 19 12:04:55 2024
  write: IOPS=118, BW=473KiB/s (485kB/s)(27.7MiB/60005msec); 0 zone resets
    clat (usec): min=5368, max=23985, avg=8440.05, stdev=1175.61
     lat (usec): min=5369, max=23986, avg=8441.07, stdev=1175.64
    clat percentiles (usec):
     |  1.00th=[ 6325],  5.00th=[ 6915], 10.00th=[ 7177], 20.00th=[ 7504],
     | 30.00th=[ 7767], 40.00th=[ 8094], 50.00th=[ 8291], 60.00th=[ 8586],
     | 70.00th=[ 8979], 80.00th=[ 9372], 90.00th=[ 9896], 95.00th=[10159],
     | 99.00th=[11994], 99.50th=[13042], 99.90th=[17695], 99.95th=[20841],
     | 99.99th=[23987]
   bw (  KiB/s): min=  376, max=  576, per=100.00%, avg=474.47, stdev=50.16, samples=119
   iops        : min=   94, max=  144, avg=118.55, stdev=12.56, samples=119
  lat (msec)   : 10=92.97%, 20=6.95%, 50=0.07%
  cpu          : usr=0.31%, sys=1.84%, ctx=14209, majf=0, minf=15
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,7103,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1


Run status group 0 (all jobs):
  WRITE: bw=473KiB/s (485kB/s), 473KiB/s-473KiB/s (485kB/s-485kB/s), io=27.7MiB (29.1MB), run=60005-60005msec


Disk stats (read/write):
    dm-0: ios=0/35774, sectors=0/229984, merge=0/0, ticks=0/74379, in_queue=74379, util=93.69%, aggrios=0/28475, aggsectors=0/230432, aggrmerge=0/7366, aggrticks=0/56960, aggrin_queue=60836, aggrutil=93.35%
  sda: ios=0/28475, sectors=0/230432, merge=0/7366, ticks=0/56960, in_queue=60836, util=93.35%

I realize I do not yet have the 4th node (and I can add a 5th/6th if it will help, but they will be older Dell R730s) but is this all the IOPS performance I can expect? I'm new to Ceph, so I am certain there are configuration settings and questions I should have answered before posting... my apologies.
 
Large file write, which I am totally fine with:

Code:
# dd if=/dev/urandom of=test.bin bs=1M count=8000
8000+0 records in
8000+0 records out
8388608000 bytes (8.4 GB, 7.8 GiB) copied, 25.8055 s, 325 MB/s
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!