Proxmox VE Ceph Benchmark 2023/12 - Fast SSDs and network speeds in a Proxmox VE Ceph Reef cluster

ilia987 · Mar 17, 2024

does a benchmark exist with more then 3 nodes for example in the area of 10 nodes?
does the throughput scales accordingly for multi client usage?

guruevi · Mar 21, 2024

Here are some benchmarks for 100G LACP on a cluster with 7 nodes, 6 (Kioxia DC NVMe) OSDs per node, same bench as the one in the article. Sapphire Rapids Xeon Gold with 1TB of RAM per node, no load (VMs) on any node.
Avg. bandwidth:
write 7239.68 MB/s
rand 7539.36 MB/s

No matter what I do, I can't seem to get much past the 7.5GB/s writes with a single client. If I launch 3 clients, I can get to ~3x4GB/s or filling a single of the dual 100Gbps link. I need to launch all 7 at the same time to get a minimum of ~2GB/s and an average of ~3GB/s on each client (168Gbps across the cluster) and I'm assuming at that point I'm hitting a hard limit on my OSDs as my average IOPS dropped from ~2000 to ~1000 IOPS with 4M writes, benchmarks with 3 nodes maintain 2000 IOPS @ 4M, same as a single node.

For reads, I can get to 4-6GB/s on all nodes simultaneously, I'm assuming that this is because reads from blocks on the node running the bench itself are going to be really fast (1/3 of the reads if you're lucky). I do have 2 separate network switches as well with the backplanes connected over 400Gbps, so again, if you're lucky and you hit the "best" path you can get to ~300Gbps over the 7 nodes. There is lots of variability in that data so that bench is probably not very useful. In most cases, write is what really matters as reads are for most applications significantly cached in memory.

CPU usage never went above 15% (48 cores) although I do notice that for some reason neither the Proxmox nor my Prometheus has data on the 100G network interfaces.

I believe that with the right hardware (more OSD) you can probably fill more than 400G with Ceph with just a few nodes.

jtainio · Mar 22, 2024

Hope someone can clarify this for me.

Proxmox 8.1.5 Ceph Reef 18.2.1 cluster, with 4 Dell r630 hosts, 6 ssd drives each, Ceph network with dual 10Ge connectivity.

With rados bench I get what is expected: wirespeed performance for single host test:

rados bench -p ceph01 120 write -b 4M -t 16 --run-name `hostname` --no-cleanup

Total time run: 120.032
Total writes made: 35159
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1171.65
Stddev Bandwidth: 54.414
Max bandwidth (MB/sec): 1268
Min bandwidth (MB/sec): 1004
Average IOPS: 292
Stddev IOPS: 13.6035
Max IOPS: 317
Min IOPS: 251
Average Latency(s): 0.0546073
Stddev Latency(s): 0.019493
Max latency(s): 0.301827
Min latency(s): 0.0217319

rados bench -p ceph01 600 seq -t 16 --run-name `hostname`

Total time run: 88.336
Total reads made: 35159
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1592.06
Average IOPS: 398
Stddev IOPS: 28.5776
Max IOPS: 458
Min IOPS: 325
Average Latency(s): 0.0389745
Max latency(s): 0.450536
Min latency(s): 0.0121847

I think 1171MB/s write, 1592MB/s read is excellent, no complaints whatsoever!

The odd thing is, if I do a performance test on a VM that has it's disk on the Ceph pool on the cluster. I get the following results with fio tests:

fio --ioengine=psync --filename=/var/tmp/test_fio --size=5G --time_based --name=fio --group_reporting --runtime=15 --direct=1 --sync=1 --rw=write --bs=4M --numjobs=1 --iodepth=1

WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=75.5GiB (81.1GB), run=600017-600017msec

fio --ioengine=psync --filename=/var/tmp/test_fio --size=5G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --rw=read --bs=4M --numjobs=1 --iodepth=1

READ: bw=368MiB/s (386MB/s), 368MiB/s-368MiB/s (386MB/s-386MB/s), io=216GiB (232GB), run=600009-600009msec

Using bigger block size of 16M gives better results.

fio --ioengine=psync --filename=/var/tmp/test_fio --size=5G --time_based --name=fio --group_reporting --runtime=30 --direct=1 --sync=1 --rw=write --bs=16M --numjobs=1 --iodepth=1

WRITE: bw=317MiB/s (332MB/s), 317MiB/s-317MiB/s (332MB/s-332MB/s), io=9504MiB (9966MB), run=30002-30002msec

fio --ioengine=psync --filename=/var/tmp/test_fio --size=5G --time_based --name=fio --group_reporting --runtime=30 --direct=1 --sync=1 --rw=read --bs=16M --numjobs=1 --iodepth=1

READ: bw=845MiB/s (886MB/s), 845MiB/s-845MiB/s (886MB/s-886MB/s), io=24.8GiB (26.6GB), run=30005-30005msec

So. Good read performance especially with bigger block sizes, but why is the write performance so slow, if the underlying Ceph can easily do over 1000MB/s?

Test VM config:

agent: 1
boot: order=scsi0;ide2;net0
cores: 4
ide2: none,media=cdrom
memory: 2048
name: testvm
net0: virtio=B6:6B:4E:CA:91:16,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
scsi0: ceph01:vm-501-disk-0,iothread=1,size=32G
scsi1: ceph01:vm-501-disk-1,iothread=1,size=160G
scsihw: virtio-scsi-single
smbios1: uuid=807fc9e6-7b1d-4af1-a2c1-882b4a0c43b9
sockets: 1
tags:
vmgenid: 0f5e7ed6-e735-4f5b-9b80-8c2a71710a52

Search

Search

Proxmox VE Ceph Benchmark 2023/12 - Fast SSDs and network speeds in a Proxmox VE Ceph Reef cluster

ilia987

Active Member

guruevi

Member

jtainio

Member