3 Server Ceph Cluster 100Gbit-Backend / 10 Gbit Frontend

mcz

New Member
May 2, 2025
2
0
1
Hallo,

My hyperconverged Proxmoxcluster with Ceph (19.2.1) have 3 Servers:
All have Threadripper Pro Zen3, Zen4 16-32 Cores, 256Gb Ram, with at first 1 NVME OSD (Kioxia CM7r) per Server.
Frontend Network has multiple redundant 10Gbit NICs for VMS und Clients.
Backend Network only for Ceph 100Gbit DAC without Switch direct attached in Broadcast mode.

I have tested the CephRBD Pool with fio and get good read speeds, write is slow.
My Backup of VMs to PBS (10Gbit Link, NVME Storage) takes long (VMs are big 1Tb, 3Tb) and if it runs, the Ceph shows maximal Readspeed (in Proxmox Gui) max. 900 MB/s, which I think is the limit of 10Gbit Connection of Frontend network VMs. It makes access to VMs slow. My PBS Server shows maximal Transfer Rate of 120 Mb/s( 1Gbit) which is 10% of phisical PBS cable speed. Everything is set with MTU 9000, newest firmware.

So I plan to buy some 100Gbit Cards and Microtik Switch CRS520-4XS-16XQ-RW that can be connected throug 25Gbit Cable to 10Gbit Ubiquiti switch that we have right now for all network.

My question is:
If VMs would be running in RAM and would be connected to 100Gbit Frontend network through 100Gbit switch, so they communicate with each other with 100gbit speed right?
And if I connect the PBS to the same Switch with 10Gbit link I will get 10Gbit transfer speed, but CEPH Reads and writes would be better with frontend 100Gbit Network right?

Or ist it better to invest at first in 3 more OSDs (NVME Kioxia CM7r) to get total of 6 OSDs in Cluster and get much better read and write speeds (storage would than expand to)?

Should I first upgrade the Forntend to 100Gbit
or buy some more OSDs
to get faster Ceph read and wirtes and faster Backup to PBS?

Thanks
 
My Backup of VMs to PBS (10Gbit Link, NVME Storage) takes long (VMs are big 1Tb, 3Tb) and if it runs, the Ceph shows maximal Readspeed (in Proxmox Gui) max. 900 MB/s, which I think is the limit of 10Gbit Connection of Frontend network VMs. It makes access to VMs slow. My PBS Server shows maximal Transfer Rate of 120 Mb/s( 1Gbit) which is 10% of phisical PBS cable speed. Everything is set with MTU 9000, newest firmware.
You're reading from the backend with 900 MB/s and only 120 MB/s are reaching the PBS? That sounds weird.

Frontend Network has multiple redundant 10Gbit NICs for VMS und Clients.
LACP?

If VMs would be running in RAM and would be connected to 100Gbit Frontend network through 100Gbit switch, so they communicate with each other with 100gbit speed right?
If they are on different PVE nodes, that may be the case. Locally, they can communicate as fast as the whole stack is capable.

And if I connect the PBS to the same Switch with 10Gbit link I will get 10Gbit transfer speed, but CEPH Reads and writes would be better with frontend 100Gbit Network right?
In my experience, in a three node setup with local NVMe, I would recommend to tweak the read setting of CEPH so that you ALWAYS just read from the local NVMe, which is faster than reading through CEPH from all nodes that have the data.
 
Thank you very much for very nice Idea of setting CEPH to use local NVME Data before it reaches other nodes!
How can I configure this? Is there some Tutorial that I can use?

NICs in Frontend are in Bond (balance-alb).

Those are the fio results on CephRBD pool (1 OSD per Node - Kioxia CM7 Hardware can do Read: 2,7Mio IOPS Write: 300k IOPS - all OSDs are encrypted), are those numbers fine?


fio --rw=randread --name=IOPS-read --bs=4k --direct=1 --filesize=1G --filename=/dev/RBD/Ce
phPool/test --numjobs=1 --ioengine=libaio --iodepth=1 --refill_buffers --group_reporting --runtime=60 --
time_based
IOPS-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
^X^C: 1 (f=1): [r(1)][13.3%][r=2259MiB/s][r=578k IOPS][eta 00m:52s]
fio: terminating on signal 2

IOPS-read: (groupid=0, jobs=1): err= 0: pid=2302862: Thu May 15 10:21:13 2025
read: IOPS=584k, BW=2280MiB/s (2391MB/s)(18.9GiB/8495msec)
slat (nsec): min=881, max=41913, avg=1139.85, stdev=191.08
clat (nsec): min=370, max=23806, avg=413.37, stdev=99.95
lat (nsec): min=1272, max=42464, avg=1553.22, stdev=220.54
clat percentiles (nsec):
| 1.00th=[ 390], 5.00th=[ 390], 10.00th=[ 402], 20.00th=[ 402],
| 30.00th=[ 402], 40.00th=[ 410], 50.00th=[ 410], 60.00th=[ 410],
| 70.00th=[ 422], 80.00th=[ 422], 90.00th=[ 430], 95.00th=[ 430],
| 99.00th=[ 442], 99.50th=[ 462], 99.90th=[ 812], 99.95th=[ 852],
| 99.99th=[ 4832]
bw ( MiB/s): min= 2254, max= 2310, per=100.00%, avg=2282.32, stdev=16.06, samples=16
iops : min=577060, max=591412, avg=584274.00, stdev=4110.21, samples=16
lat (nsec) : 500=99.65%, 750=0.20%, 1000=0.10%
lat (usec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
cpu : usr=21.34%, sys=78.63%, ctx=80, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=4957858,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
READ: bw=2280MiB/s (2391MB/s), 2280MiB/s-2280MiB/s (2391MB/s-2391MB/s), io=18.9GiB (20.3GB), run=8495-8495msec
root@node0:~# fio --rw=randwrite --name=IOPS-write --bs=4k --direct=1 --filesize=1G --filename=/dev/RBD/
CephPool/test --numjobs=1 --ioengine=libaio --iodepth=1 --refill_buffers --group_reporting --runtime=60
--time_based
IOPS-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
^CJobs: 1 (f=1): [w(1)][15.0%][w=1634MiB/s][w=418k IOPS][eta 00m:51s]
fio: terminating on signal 2

IOPS-write: (groupid=0, jobs=1): err= 0: pid=2303232: Thu May 15 10:21:32 2025
write: IOPS=417k, BW=1628MiB/s (1707MB/s)(14.4GiB/9061msec); 0 zone resets
slat (nsec): min=862, max=34762, avg=1256.55, stdev=206.35
clat (nsec): min=360, max=38228, avg=402.51, stdev=105.76
lat (nsec): min=1252, max=39480, avg=1659.06, stdev=239.87
clat percentiles (nsec):
| 1.00th=[ 370], 5.00th=[ 370], 10.00th=[ 382], 20.00th=[ 382],
| 30.00th=[ 382], 40.00th=[ 390], 50.00th=[ 390], 60.00th=[ 402],
| 70.00th=[ 422], 80.00th=[ 430], 90.00th=[ 430], 95.00th=[ 442],
| 99.00th=[ 450], 99.50th=[ 462], 99.90th=[ 772], 99.95th=[ 860],
| 99.99th=[ 4640]
bw ( MiB/s): min= 1569, max= 1658, per=100.00%, avg=1628.22, stdev=19.90, samples=18
iops : min=401822, max=424680, avg=416825.00, stdev=5094.48, samples=18
lat (nsec) : 500=99.74%, 750=0.14%, 1000=0.08%
lat (usec) : 2=0.01%, 4=0.02%, 10=0.02%, 20=0.01%, 50=0.01%
cpu : usr=39.54%, sys=60.45%, ctx=65, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,3776096,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=1628MiB/s (1707MB/s), 1628MiB/s-1628MiB/s (1707MB/s-1707MB/s), io=14.4GiB (15.5GB), run=9061-9061msec
 
How can I configure this? Is there some Tutorial that I can use?
Try to establish a baseline in your VM with a benchmark, change it and then benchmark again.

 
  • Like
Reactions: Johannes S