I have the following cluster I have set up:
Three Dell R740 with 512GB RAM, each with 2 2.6g/3.9g turbo processors. Each has two Micron 9300 7.84TB NVME drives.
One Threadripper Pro with 24-cores, 4.5ghz boost, 128GB RAM. Also has two Micron 9300 7.84TB NVME drives.
They are currently connected on a shared 25g network, though with no other traffic presently running, and I will have ceph on its own network once I get more DAC cables.
The goal of the ceph cluster is to have have good performance for running VMs, serving basic development/testing websites, and running MySQL within the VMs. Data is currently set for 2 replicas because there are recent backups for everything and being mostly dev/testing right now, the durability is not the most critical factor. I have created OSDs for every desk and added them into a pretty basic setup. I have created a metadata server and monitor on each node. For testing, I only have two of the R740s and the one threadripper live. I am seeing pretty decent IO to writing a large file, ~ 325MB/s, which I think looks fairly good. This is running with the default MTU - I realize that I will be setting it to 9000 for the dedicated Ceph network.
However, random IO seems fairly low, and doing things like installing the VM, updating packages, or doing a large MySQL insert seems pretty laggy and seems to "studder". I ran a few benchmarks I found here:
Write:
I realize I do not yet have the 4th node (and I can add a 5th/6th if it will help, but they will be older Dell R730s) but is this all the IOPS performance I can expect? I'm new to Ceph, so I am certain there are configuration settings and questions I should have answered before posting... my apologies.
Three Dell R740 with 512GB RAM, each with 2 2.6g/3.9g turbo processors. Each has two Micron 9300 7.84TB NVME drives.
One Threadripper Pro with 24-cores, 4.5ghz boost, 128GB RAM. Also has two Micron 9300 7.84TB NVME drives.
They are currently connected on a shared 25g network, though with no other traffic presently running, and I will have ceph on its own network once I get more DAC cables.
The goal of the ceph cluster is to have have good performance for running VMs, serving basic development/testing websites, and running MySQL within the VMs. Data is currently set for 2 replicas because there are recent backups for everything and being mostly dev/testing right now, the durability is not the most critical factor. I have created OSDs for every desk and added them into a pretty basic setup. I have created a metadata server and monitor on each node. For testing, I only have two of the R740s and the one threadripper live. I am seeing pretty decent IO to writing a large file, ~ 325MB/s, which I think looks fairly good. This is running with the default MTU - I realize that I will be setting it to 9000 for the dedicated Ceph network.
However, random IO seems fairly low, and doing things like installing the VM, updating packages, or doing a large MySQL insert seems pretty laggy and seems to "studder". I ran a few benchmarks I found here:
Code:
fio --name=randread --ioengine=libaio --iodepth=16 --rw=randread --bs=4k --direct=0 --size=512M --numjobs=4 --runtime=240 --group_reporting
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.36
Starting 4 processes
randread: Laying out IO file (1 file / 512MiB)
randread: Laying out IO file (1 file / 512MiB)
randread: Laying out IO file (1 file / 512MiB)
randread: Laying out IO file (1 file / 512MiB)
Jobs: 1 (f=1): [_(2),r(1),_(1)][100.0%][r=3516KiB/s][r=879 IOPS][eta 00m:00s]
randread: (groupid=0, jobs=4): err= 0: pid=2445: Tue Nov 19 11:44:43 2024
read: IOPS=4625, BW=18.1MiB/s (18.9MB/s)(2048MiB/113347msec)
slat (usec): min=304, max=8550, avg=802.97, stdev=274.21
clat (usec): min=6, max=25200, avg=12228.31, stdev=1643.44
lat (usec): min=661, max=26788, avg=13031.28, stdev=1725.81
clat percentiles (usec):
| 1.00th=[ 8979], 5.00th=[ 9896], 10.00th=[10290], 20.00th=[10945],
| 30.00th=[11338], 40.00th=[11731], 50.00th=[12125], 60.00th=[12518],
| 70.00th=[12911], 80.00th=[13304], 90.00th=[14091], 95.00th=[14746],
| 99.00th=[17695], 99.50th=[19792], 99.90th=[22676], 99.95th=[23200],
| 99.99th=[24249]
bw ( KiB/s): min=15334, max=23264, per=100.00%, avg=19695.06, stdev=324.71, samples=852
iops : min= 3833, max= 5816, avg=4922.99, stdev=81.16, samples=852
lat (usec) : 10=0.01%, 20=0.01%, 50=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=6.11%, 20=93.40%, 50=0.48%
cpu : usr=1.98%, sys=9.30%, ctx=524352, majf=0, minf=175
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=524288,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
READ: bw=18.1MiB/s (18.9MB/s), 18.1MiB/s-18.1MiB/s (18.9MB/s-18.9MB/s), io=2048MiB (2147MB), run=113347-113347msec
Disk stats (read/write):
dm-0: ios=524239/286, sectors=4193912/3080, merge=0/0, ticks=385207/801, in_queue=386008, util=99.58%, aggrios=524288/158, aggsectors=4194304/3080, aggrmerge=0/128, aggrticks=391943/424, aggrin_queue=392393, aggrutil=74.70%
sda: ios=524288/158, sectors=4194304/3080, merge=0/128, ticks=391943/424, in_queue=392393, util=74.70%
Write:
Code:
# fio --directory=/home/ansible --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio --size=1G --numjobs=1
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.36
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=472KiB/s][w=118 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=2490: Tue Nov 19 12:04:55 2024
write: IOPS=118, BW=473KiB/s (485kB/s)(27.7MiB/60005msec); 0 zone resets
clat (usec): min=5368, max=23985, avg=8440.05, stdev=1175.61
lat (usec): min=5369, max=23986, avg=8441.07, stdev=1175.64
clat percentiles (usec):
| 1.00th=[ 6325], 5.00th=[ 6915], 10.00th=[ 7177], 20.00th=[ 7504],
| 30.00th=[ 7767], 40.00th=[ 8094], 50.00th=[ 8291], 60.00th=[ 8586],
| 70.00th=[ 8979], 80.00th=[ 9372], 90.00th=[ 9896], 95.00th=[10159],
| 99.00th=[11994], 99.50th=[13042], 99.90th=[17695], 99.95th=[20841],
| 99.99th=[23987]
bw ( KiB/s): min= 376, max= 576, per=100.00%, avg=474.47, stdev=50.16, samples=119
iops : min= 94, max= 144, avg=118.55, stdev=12.56, samples=119
lat (msec) : 10=92.97%, 20=6.95%, 50=0.07%
cpu : usr=0.31%, sys=1.84%, ctx=14209, majf=0, minf=15
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,7103,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=473KiB/s (485kB/s), 473KiB/s-473KiB/s (485kB/s-485kB/s), io=27.7MiB (29.1MB), run=60005-60005msec
Disk stats (read/write):
dm-0: ios=0/35774, sectors=0/229984, merge=0/0, ticks=0/74379, in_queue=74379, util=93.69%, aggrios=0/28475, aggsectors=0/230432, aggrmerge=0/7366, aggrticks=0/56960, aggrin_queue=60836, aggrutil=93.35%
sda: ios=0/28475, sectors=0/230432, merge=0/7366, ticks=0/56960, in_queue=60836, util=93.35%
I realize I do not yet have the 4th node (and I can add a 5th/6th if it will help, but they will be older Dell R730s) but is this all the IOPS performance I can expect? I'm new to Ceph, so I am certain there are configuration settings and questions I should have answered before posting... my apologies.