Proxmox Ceph configuration

patrickc

New Member
Nov 15, 2024
2
0
1
I have the following cluster I have set up:

Three Dell R740 with 512GB RAM, each with 2 2.6g/3.9g turbo processors. Each has two Micron 9300 7.84TB NVME drives.
One Threadripper Pro with 24-cores, 4.5ghz boost, 128GB RAM. Also has two Micron 9300 7.84TB NVME drives.

They are currently connected on a shared 25g network, though with no other traffic presently running, and I will have ceph on its own network once I get more DAC cables.

The goal of the ceph cluster is to have have good performance for running VMs, serving basic development/testing websites, and running MySQL within the VMs. Data is currently set for 2 replicas because there are recent backups for everything and being mostly dev/testing right now, the durability is not the most critical factor. I have created OSDs for every desk and added them into a pretty basic setup. I have created a metadata server and monitor on each node. For testing, I only have two of the R740s and the one threadripper live. I am seeing pretty decent IO to writing a large file, ~ 325MB/s, which I think looks fairly good. This is running with the default MTU - I realize that I will be setting it to 9000 for the dedicated Ceph network.

However, random IO seems fairly low, and doing things like installing the VM, updating packages, or doing a large MySQL insert seems pretty laggy and seems to "studder". I ran a few benchmarks I found here:

Code:
fio --name=randread --ioengine=libaio --iodepth=16 --rw=randread --bs=4k --direct=0 --size=512M --numjobs=4 --runtime=240 --group_reporting
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.36
Starting 4 processes
randread: Laying out IO file (1 file / 512MiB)
randread: Laying out IO file (1 file / 512MiB)
randread: Laying out IO file (1 file / 512MiB)
randread: Laying out IO file (1 file / 512MiB)
Jobs: 1 (f=1): [_(2),r(1),_(1)][100.0%][r=3516KiB/s][r=879 IOPS][eta 00m:00s]
randread: (groupid=0, jobs=4): err= 0: pid=2445: Tue Nov 19 11:44:43 2024
  read: IOPS=4625, BW=18.1MiB/s (18.9MB/s)(2048MiB/113347msec)
    slat (usec): min=304, max=8550, avg=802.97, stdev=274.21
    clat (usec): min=6, max=25200, avg=12228.31, stdev=1643.44
     lat (usec): min=661, max=26788, avg=13031.28, stdev=1725.81
    clat percentiles (usec):
     |  1.00th=[ 8979],  5.00th=[ 9896], 10.00th=[10290], 20.00th=[10945],
     | 30.00th=[11338], 40.00th=[11731], 50.00th=[12125], 60.00th=[12518],
     | 70.00th=[12911], 80.00th=[13304], 90.00th=[14091], 95.00th=[14746],
     | 99.00th=[17695], 99.50th=[19792], 99.90th=[22676], 99.95th=[23200],
     | 99.99th=[24249]
   bw (  KiB/s): min=15334, max=23264, per=100.00%, avg=19695.06, stdev=324.71, samples=852
   iops        : min= 3833, max= 5816, avg=4922.99, stdev=81.16, samples=852
  lat (usec)   : 10=0.01%, 20=0.01%, 50=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=6.11%, 20=93.40%, 50=0.48%
  cpu          : usr=1.98%, sys=9.30%, ctx=524352, majf=0, minf=175
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=524288,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16


Run status group 0 (all jobs):
   READ: bw=18.1MiB/s (18.9MB/s), 18.1MiB/s-18.1MiB/s (18.9MB/s-18.9MB/s), io=2048MiB (2147MB), run=113347-113347msec


Disk stats (read/write):
    dm-0: ios=524239/286, sectors=4193912/3080, merge=0/0, ticks=385207/801, in_queue=386008, util=99.58%, aggrios=524288/158, aggsectors=4194304/3080, aggrmerge=0/128, aggrticks=391943/424, aggrin_queue=392393, aggrutil=74.70%
  sda: ios=524288/158, sectors=4194304/3080, merge=0/128, ticks=391943/424, in_queue=392393, util=74.70%


Write:

Code:
# fio --directory=/home/ansible --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio --size=1G --numjobs=1
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.36
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=472KiB/s][w=118 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=2490: Tue Nov 19 12:04:55 2024
  write: IOPS=118, BW=473KiB/s (485kB/s)(27.7MiB/60005msec); 0 zone resets
    clat (usec): min=5368, max=23985, avg=8440.05, stdev=1175.61
     lat (usec): min=5369, max=23986, avg=8441.07, stdev=1175.64
    clat percentiles (usec):
     |  1.00th=[ 6325],  5.00th=[ 6915], 10.00th=[ 7177], 20.00th=[ 7504],
     | 30.00th=[ 7767], 40.00th=[ 8094], 50.00th=[ 8291], 60.00th=[ 8586],
     | 70.00th=[ 8979], 80.00th=[ 9372], 90.00th=[ 9896], 95.00th=[10159],
     | 99.00th=[11994], 99.50th=[13042], 99.90th=[17695], 99.95th=[20841],
     | 99.99th=[23987]
   bw (  KiB/s): min=  376, max=  576, per=100.00%, avg=474.47, stdev=50.16, samples=119
   iops        : min=   94, max=  144, avg=118.55, stdev=12.56, samples=119
  lat (msec)   : 10=92.97%, 20=6.95%, 50=0.07%
  cpu          : usr=0.31%, sys=1.84%, ctx=14209, majf=0, minf=15
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,7103,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1


Run status group 0 (all jobs):
  WRITE: bw=473KiB/s (485kB/s), 473KiB/s-473KiB/s (485kB/s-485kB/s), io=27.7MiB (29.1MB), run=60005-60005msec


Disk stats (read/write):
    dm-0: ios=0/35774, sectors=0/229984, merge=0/0, ticks=0/74379, in_queue=74379, util=93.69%, aggrios=0/28475, aggsectors=0/230432, aggrmerge=0/7366, aggrticks=0/56960, aggrin_queue=60836, aggrutil=93.35%
  sda: ios=0/28475, sectors=0/230432, merge=0/7366, ticks=0/56960, in_queue=60836, util=93.35%

I realize I do not yet have the 4th node (and I can add a 5th/6th if it will help, but they will be older Dell R730s) but is this all the IOPS performance I can expect? I'm new to Ceph, so I am certain there are configuration settings and questions I should have answered before posting... my apologies.
 
Large file write, which I am totally fine with:

Code:
# dd if=/dev/urandom of=test.bin bs=1M count=8000
8000+0 records in
8000+0 records out
8388608000 bytes (8.4 GB, 7.8 GiB) copied, 25.8055 s, 325 MB/s