Proxmox VE Ceph Benchmark 2018/02

spirit

Famous Member
Apr 2, 2010
3,562
164
83
www.odiso.com
Do you think this is as expected (maybe 4k write could go faster?)? With 2000 more bucks we could do a 40 GBit mesh network. Do you think it's worth trying this?
From my last bench, with fio on rbd,
I'm able to reach around 150000 iops 4k randwrite, limitation is the cpu of the ceph nodes. (3 nodes with 24cores 3ghz, 100% cpu).

For read, I'm around 700000 iops 4k randread, cpu is limiting too. (3 ceph nodes 100% cpu, and 2 clients nodes (same cpu config) cpu 100%).
 

ozdjh

New Member
Oct 8, 2019
24
6
3
I just ran a comparison with the benchmark running on just 1 node, and then the benchmark running on all 4 nodes to simulate heavy workloads across the entire cluster. Not only did the average IOPS drop as you'd expect, but the average latency jumped due to queueing.

Code:
1 x bench over 10GbE

Max bandwidth (MB/sec): 1476
Min bandwidth (MB/sec): 1280
Average IOPS:           344
Stddev IOPS:            9.61719
Max IOPS:               369
Min IOPS:               320
Average Latency(s):     0.0463861
Code:
4 x bench over 10GbE

Max bandwidth (MB/sec): 1412
Min bandwidth (MB/sec): 412
Average IOPS:           132
Stddev IOPS:            38.3574
Max IOPS:               353
Min IOPS:               103
Average Latency(s):     0.120387
I hope that using 40G will provide each node with enough bandwidth so that we don't don't see contention on the fabric. That should let each node run heavily loaded without degrading storage performance. I should have the hardware tomorrow so I'll post the results early next week.

David
...
 

ozdjh

New Member
Oct 8, 2019
24
6
3
I thought I'd share some further results as I think they're interesting and they may be of use to someone else. These are the results of the benchmark running over a 40GbE switched network (OM3 fibre). This is the same equipment as my post on 15 Oct with the network moved from 10GbE to 40GbE so you can see the direct comparison (4 node cluster, 4 x 2TB Intel P4510 NVMe drives per node).

We've added 2 x 40GbE ports per server and we're running them in an active / standby bond for the public network. The cluster network is still running over 10GbE. During the write tests the cluster network ran at between 5 and 8 gbps so it shouldn't be impacting performance.

There is an improvement in both sets of numbers, but as Alwin mentioned, moving from 10 to 40 doesn't decrease latency so you don't see anything like 4 times the performance. Below are the numbers from running a single instance of the benchmark.

Code:
# rados -p ceph_1 bench 60 write -b 4M -t 16

Total time run:         60.0269
Total writes made:      24970
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1663.92
Stddev Bandwidth:       49.3114
Max bandwidth (MB/sec): 1752
Min bandwidth (MB/sec): 1496
Average IOPS:           415
Stddev IOPS:            12.3279
Max IOPS:               438
Min IOPS:               374
Average Latency(s):     0.0384613
Stddev Latency(s):      0.0142218
Max latency(s):         0.248035
Min latency(s):         0.0150251


# rados -p ceph_1 bench 180 rand -t 16

Total time run:       180.049
Total reads made:     93945
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2087.1
Average IOPS:         521
Stddev IOPS:          15.8213
Max IOPS:             554
Min IOPS:             445
Average Latency(s):   0.0299043
Max latency(s):       0.272099
Min latency(s):       0.00322598

I was already pretty happy with the 10GbE numbers from a performance perspective. But running multiple instances of the benchmark at the same time (to simulate lots of VMs generating load in parallel) reduced the numbers dramatically. A single benchmark instance using NVMe drives can saturate the 10GbE link so any more load just decreased overall performance. Here's the results from running 3 random read benchmarks at the same time over the 40 GbE fabric. The numbers are only 15% lower than running just 1 benchmark process. So we see about 3 times the data volume at roughly the same data rate. The network port sat at around 35 gbps during the test. That's what I was hoping to prove with this testing. Moving to 40 GbE won't increase raw performance but it'll let you run a lot more load before you see any significant degradation of the performance.

Code:
# rados -p ceph_1 bench 180 rand -t 16

Total time run:       180.053
Total reads made:     78676
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1747.84
Average IOPS:         436
Stddev IOPS:          17.8279
Max IOPS:             470
Min IOPS:             379
Average Latency(s):   0.0356483
Max latency(s):       0.67529
Min latency(s):       0.00471686

And below is a graph of the switch port utilisation. The peaks from left to right are from 1 x write bench, 2 x write bench, 1 x read bench, 2 x read bench, and finally 3 x read bench running simultaneously on the node.

Screen Shot 2019-10-23 at 1.23.42 pm.png



Thanks

David
...
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,020
261
88
as Alwin mentioned, moving from 10 to 40 doesn't decrease latency
Yes it does. I spoke about 25 to 40 GbE, there you most likely won't see a latency change.

Please also keep in mind that the cluster network is there for replication. Limiting Ceph on that end.
1571811960972.png
Put the Cluster network onto the 40 GbE as well, this will lower latency again and increase throughput. To get the best out of the network, you can use the standby link of the bond for this. As you can create a bond on top of a VLAN, you can use both links actively. In a disaster case, where one link is dead, the traffic would be put onto the same link but still work. Both switches would need to trunk all VLANs between each other.

As an example:
Code:
eth0.10 --> bond0 (primary) --> Ceph public
eth1.10 --> bond0

eth0.20 --> bond1
eth1.20 --> bond1 (primary) --> Ceph cluster
 
  • Like
Reactions: Romsch

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!