Proxmox VE Ceph Benchmark 2018/02


Oct 8, 2019
I just ran a comparison with the benchmark running on just 1 node, and then the benchmark running on all 4 nodes to simulate heavy workloads across the entire cluster. Not only did the average IOPS drop as you'd expect, but the average latency jumped due to queueing.

1 x bench over 10GbE

Max bandwidth (MB/sec): 1476
Min bandwidth (MB/sec): 1280
Average IOPS:           344
Stddev IOPS:            9.61719
Max IOPS:               369
Min IOPS:               320
Average Latency(s):     0.0463861
4 x bench over 10GbE

Max bandwidth (MB/sec): 1412
Min bandwidth (MB/sec): 412
Average IOPS:           132
Stddev IOPS:            38.3574
Max IOPS:               353
Min IOPS:               103
Average Latency(s):     0.120387
I hope that using 40G will provide each node with enough bandwidth so that we don't don't see contention on the fabric. That should let each node run heavily loaded without degrading storage performance. I should have the hardware tomorrow so I'll post the results early next week.



Oct 8, 2019
I thought I'd share some further results as I think they're interesting and they may be of use to someone else. These are the results of the benchmark running over a 40GbE switched network (OM3 fibre). This is the same equipment as my post on 15 Oct with the network moved from 10GbE to 40GbE so you can see the direct comparison (4 node cluster, 4 x 2TB Intel P4510 NVMe drives per node).

We've added 2 x 40GbE ports per server and we're running them in an active / standby bond for the public network. The cluster network is still running over 10GbE. During the write tests the cluster network ran at between 5 and 8 gbps so it shouldn't be impacting performance.

There is an improvement in both sets of numbers, but as Alwin mentioned, moving from 10 to 40 doesn't decrease latency so you don't see anything like 4 times the performance. Below are the numbers from running a single instance of the benchmark.

# rados -p ceph_1 bench 60 write -b 4M -t 16

Total time run:         60.0269
Total writes made:      24970
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1663.92
Stddev Bandwidth:       49.3114
Max bandwidth (MB/sec): 1752
Min bandwidth (MB/sec): 1496
Average IOPS:           415
Stddev IOPS:            12.3279
Max IOPS:               438
Min IOPS:               374
Average Latency(s):     0.0384613
Stddev Latency(s):      0.0142218
Max latency(s):         0.248035
Min latency(s):         0.0150251

# rados -p ceph_1 bench 180 rand -t 16

Total time run:       180.049
Total reads made:     93945
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2087.1
Average IOPS:         521
Stddev IOPS:          15.8213
Max IOPS:             554
Min IOPS:             445
Average Latency(s):   0.0299043
Max latency(s):       0.272099
Min latency(s):       0.00322598

I was already pretty happy with the 10GbE numbers from a performance perspective. But running multiple instances of the benchmark at the same time (to simulate lots of VMs generating load in parallel) reduced the numbers dramatically. A single benchmark instance using NVMe drives can saturate the 10GbE link so any more load just decreased overall performance. Here's the results from running 3 random read benchmarks at the same time over the 40 GbE fabric. The numbers are only 15% lower than running just 1 benchmark process. So we see about 3 times the data volume at roughly the same data rate. The network port sat at around 35 gbps during the test. That's what I was hoping to prove with this testing. Moving to 40 GbE won't increase raw performance but it'll let you run a lot more load before you see any significant degradation of the performance.

# rados -p ceph_1 bench 180 rand -t 16

Total time run:       180.053
Total reads made:     78676
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1747.84
Average IOPS:         436
Stddev IOPS:          17.8279
Max IOPS:             470
Min IOPS:             379
Average Latency(s):   0.0356483
Max latency(s):       0.67529
Min latency(s):       0.00471686

And below is a graph of the switch port utilisation. The peaks from left to right are from 1 x write bench, 2 x write bench, 1 x read bench, 2 x read bench, and finally 3 x read bench running simultaneously on the node.

Screen Shot 2019-10-23 at 1.23.42 pm.png




Proxmox Staff Member
Staff member
Aug 1, 2017
as Alwin mentioned, moving from 10 to 40 doesn't decrease latency
Yes it does. I spoke about 25 to 40 GbE, there you most likely won't see a latency change.

Please also keep in mind that the cluster network is there for replication. Limiting Ceph on that end.
Put the Cluster network onto the 40 GbE as well, this will lower latency again and increase throughput. To get the best out of the network, you can use the standby link of the bond for this. As you can create a bond on top of a VLAN, you can use both links actively. In a disaster case, where one link is dead, the traffic would be put onto the same link but still work. Both switches would need to trunk all VLANs between each other.

As an example:
eth0.10 --> bond0 (primary) --> Ceph public
eth1.10 --> bond0

eth0.20 --> bond1
eth1.20 --> bond1 (primary) --> Ceph cluster
  • Like
Reactions: Romsch
Oct 17, 2019
Just stopping in to share something that might be useful.

I rolled the dice on 16 X 2TB Kingston DC500 SSDs for our new cluster at work. They claim to have full data path protection with capacitors on the board and were the lowest price of any drive in this class. No expectation of high end performance here (not needed for this application), just hoping these would be able to handle being their own db/journal device in a ceph enviroment and give decent performance.

I disabled the write cache and did what I believe is a 4K write sync performance test here:


journal-test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=136MiB/s][w=34.7k IOPS][eta 00m:00s]
journal-test: (groupid=0, jobs=1): err= 0: pid=27035: Mon Feb 17 16:22:56 2020
write: IOPS=34.7k, BW=135MiB/s (142MB/s)(8128MiB/60001msec); 0 zone resets
clat (usec): min=27, max=420, avg=28.44, stdev= 1.94
lat (usec): min=27, max=420, avg=28.50, stdev= 1.94
clat percentiles (nsec):
| 1.00th=[27776], 5.00th=[27776], 10.00th=[28032], 20.00th=[28032],
| 30.00th=[28032], 40.00th=[28032], 50.00th=[28288], 60.00th=[28288],
| 70.00th=[28288], 80.00th=[28544], 90.00th=[28800], 95.00th=[30080],
| 99.00th=[31104], 99.50th=[36608], 99.90th=[41216], 99.95th=[79360],
| 99.99th=[97792]
bw ( KiB/s): min=134104, max=139896, per=100.00%, avg=138709.74, stdev=744.84, samples=119
iops : min=33526, max=34974, avg=34677.46, stdev=186.27, samples=119
lat (usec) : 50=99.92%, 100=0.08%, 250=0.01%, 500=0.01%
cpu : usr=2.26%, sys=9.30%, ctx=2080674, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,2080669,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=135MiB/s (142MB/s), 135MiB/s-135MiB/s (142MB/s-142MB/s), io=8128MiB (8522MB), run=60001-60001msec

Disk stats (read/write):
sdb: ios=58/2076904, merge=0/0, ticks=11/54915, in_queue=0, util=99.91%


If I understand the results correctly, looks like 142MB/s and ~35K IOPs.

I re-enabled the write cache, and ran the test again... performance actually dropped to ~57MB/s. Does that make sense?

Either way, order of magnitude better than my homelab cluster full of consumer drives (kilobytes/s on write sync performance). I think these drives may prove to be a decent value after-all (they were ~$300 each).


The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!