3 Server Ceph Cluster 100Gbit-Backend / 10 Gbit Frontend

mcz · May 14, 2025

Hallo,

My hyperconverged Proxmoxcluster with Ceph (19.2.1) have 3 Servers:
All have Threadripper Pro Zen3, Zen4 16-32 Cores, 256Gb Ram, with at first 1 NVME OSD (Kioxia CM7r) per Server.
Frontend Network has multiple redundant 10Gbit NICs for VMS und Clients.
Backend Network only for Ceph 100Gbit DAC without Switch direct attached in Broadcast mode.

I have tested the CephRBD Pool with fio and get good read speeds, write is slow.
My Backup of VMs to PBS (10Gbit Link, NVME Storage) takes long (VMs are big 1Tb, 3Tb) and if it runs, the Ceph shows maximal Readspeed (in Proxmox Gui) max. 900 MB/s, which I think is the limit of 10Gbit Connection of Frontend network VMs. It makes access to VMs slow. My PBS Server shows maximal Transfer Rate of 120 Mb/s( 1Gbit) which is 10% of phisical PBS cable speed. Everything is set with MTU 9000, newest firmware.

So I plan to buy some 100Gbit Cards and Microtik Switch CRS520-4XS-16XQ-RW that can be connected throug 25Gbit Cable to 10Gbit Ubiquiti switch that we have right now for all network.

My question is:
If VMs would be running in RAM and would be connected to 100Gbit Frontend network through 100Gbit switch, so they communicate with each other with 100gbit speed right?
And if I connect the PBS to the same Switch with 10Gbit link I will get 10Gbit transfer speed, but CEPH Reads and writes would be better with frontend 100Gbit Network right?

Or ist it better to invest at first in 3 more OSDs (NVME Kioxia CM7r) to get total of 6 OSDs in Cluster and get much better read and write speeds (storage would than expand to)?

Should I first upgrade the Forntend to 100Gbit
or buy some more OSDs
to get faster Ceph read and wirtes and faster Backup to PBS?

Thanks

LnxBil · May 15, 2025

mcz said:
My Backup of VMs to PBS (10Gbit Link, NVME Storage) takes long (VMs are big 1Tb, 3Tb) and if it runs, the Ceph shows maximal Readspeed (in Proxmox Gui) max. 900 MB/s, which I think is the limit of 10Gbit Connection of Frontend network VMs. It makes access to VMs slow. My PBS Server shows maximal Transfer Rate of 120 Mb/s( 1Gbit) which is 10% of phisical PBS cable speed. Everything is set with MTU 9000, newest firmware.

You're reading from the backend with 900 MB/s and only 120 MB/s are reaching the PBS? That sounds weird.

mcz said:
Frontend Network has multiple redundant 10Gbit NICs for VMS und Clients.

LACP?

mcz said:
If VMs would be running in RAM and would be connected to 100Gbit Frontend network through 100Gbit switch, so they communicate with each other with 100gbit speed right?

If they are on different PVE nodes, that may be the case. Locally, they can communicate as fast as the whole stack is capable.

mcz said:
And if I connect the PBS to the same Switch with 10Gbit link I will get 10Gbit transfer speed, but CEPH Reads and writes would be better with frontend 100Gbit Network right?

In my experience, in a three node setup with local NVMe, I would recommend to tweak the read setting of CEPH so that you ALWAYS just read from the local NVMe, which is faster than reading through CEPH from all nodes that have the data.

mcz · May 15, 2025

Thank you very much for very nice Idea of setting CEPH to use local NVME Data before it reaches other nodes!
How can I configure this? Is there some Tutorial that I can use?

NICs in Frontend are in Bond (balance-alb).

Those are the fio results on CephRBD pool (1 OSD per Node - Kioxia CM7 Hardware can do Read: 2,7Mio IOPS Write: 300k IOPS - all OSDs are encrypted), are those numbers fine?

fio --rw=randread --name=IOPS-read --bs=4k --direct=1 --filesize=1G --filename=/dev/RBD/Ce
phPool/test --numjobs=1 --ioengine=libaio --iodepth=1 --refill_buffers --group_reporting --runtime=60 --
time_based
IOPS-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
^X^C: 1 (f=1): [r(1)][13.3%][r=2259MiB/s][r=578k IOPS][eta 00m:52s]
fio: terminating on signal 2

IOPS-read: (groupid=0, jobs=1): err= 0: pid=2302862: Thu May 15 10:21:13 2025
read: IOPS=584k, BW=2280MiB/s (2391MB/s)(18.9GiB/8495msec)
slat (nsec): min=881, max=41913, avg=1139.85, stdev=191.08
clat (nsec): min=370, max=23806, avg=413.37, stdev=99.95
lat (nsec): min=1272, max=42464, avg=1553.22, stdev=220.54
clat percentiles (nsec):
| 1.00th=[ 390], 5.00th=[ 390], 10.00th=[ 402], 20.00th=[ 402],
| 30.00th=[ 402], 40.00th=[ 410], 50.00th=[ 410], 60.00th=[ 410],
| 70.00th=[ 422], 80.00th=[ 422], 90.00th=[ 430], 95.00th=[ 430],
| 99.00th=[ 442], 99.50th=[ 462], 99.90th=[ 812], 99.95th=[ 852],
| 99.99th=[ 4832]
bw ( MiB/s): min= 2254, max= 2310, per=100.00%, avg=2282.32, stdev=16.06, samples=16
iops : min=577060, max=591412, avg=584274.00, stdev=4110.21, samples=16
lat (nsec) : 500=99.65%, 750=0.20%, 1000=0.10%
lat (usec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
cpu : usr=21.34%, sys=78.63%, ctx=80, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=4957858,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
READ: bw=2280MiB/s (2391MB/s), 2280MiB/s-2280MiB/s (2391MB/s-2391MB/s), io=18.9GiB (20.3GB), run=8495-8495msec
root@node0:~# fio --rw=randwrite --name=IOPS-write --bs=4k --direct=1 --filesize=1G --filename=/dev/RBD/
CephPool/test --numjobs=1 --ioengine=libaio --iodepth=1 --refill_buffers --group_reporting --runtime=60
--time_based
IOPS-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
^CJobs: 1 (f=1): [w(1)][15.0%][w=1634MiB/s][w=418k IOPS][eta 00m:51s]
fio: terminating on signal 2

IOPS-write: (groupid=0, jobs=1): err= 0: pid=2303232: Thu May 15 10:21:32 2025
write: IOPS=417k, BW=1628MiB/s (1707MB/s)(14.4GiB/9061msec); 0 zone resets
slat (nsec): min=862, max=34762, avg=1256.55, stdev=206.35
clat (nsec): min=360, max=38228, avg=402.51, stdev=105.76
lat (nsec): min=1252, max=39480, avg=1659.06, stdev=239.87
clat percentiles (nsec):
| 1.00th=[ 370], 5.00th=[ 370], 10.00th=[ 382], 20.00th=[ 382],
| 30.00th=[ 382], 40.00th=[ 390], 50.00th=[ 390], 60.00th=[ 402],
| 70.00th=[ 422], 80.00th=[ 430], 90.00th=[ 430], 95.00th=[ 442],
| 99.00th=[ 450], 99.50th=[ 462], 99.90th=[ 772], 99.95th=[ 860],
| 99.99th=[ 4640]
bw ( MiB/s): min= 1569, max= 1658, per=100.00%, avg=1628.22, stdev=19.90, samples=18
iops : min=401822, max=424680, avg=416825.00, stdev=5094.48, samples=18
lat (nsec) : 500=99.74%, 750=0.14%, 1000=0.08%
lat (usec) : 2=0.01%, 4=0.02%, 10=0.02%, 20=0.01%, 50=0.01%
cpu : usr=39.54%, sys=60.45%, ctx=65, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,3776096,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=1628MiB/s (1707MB/s), 1628MiB/s-1628MiB/s (1707MB/s-1707MB/s), io=14.4GiB (15.5GB), run=9061-9061msec

LnxBil · May 15, 2025

mcz said:
How can I configure this? Is there some Tutorial that I can use?

Try to establish a baseline in your VM with a benchmark, change it and then benchmark again.

Post in thread 'Ceph Nested fault domains?'

Nov 28, 2024

LnxBil said:
wonder how to set this on PVE.

to answer my own question ...

Code:

root@proxmox ~ > rbd config image list ceph/vm-100-disk-0 | grep rbd_read_from_replica_policy
rbd_read_from_replica_policy                 default      config

root@proxmox ~ > rbd config image set ceph/vm-100-disk-0 rbd_read_from_replica_policy localize

root@proxmox ~ > rbd config image list ceph/vm-100-disk-0 | grep rbd_read_from_replica_policy
rbd_read_from_replica_policy                 localize     image

Yet it does not make that much difference as I thought.

Search

Search

3 Server Ceph Cluster 100Gbit-Backend / 10 Gbit Frontend

mcz

New Member

LnxBil

Distinguished Member

mcz

New Member

LnxBil

Distinguished Member

Post in thread 'Ceph Nested fault domains?'

We value your privacy