Proxmox VE Ceph Benchmark 2018/02

Does that mean that the most IOPs you can attain on any 3 replica ceph installation on 10GbE with these Samsung SSDs, are at best 6500-9000 due to the latencies you just described?

Sorry for insisting on the same stuff, I am just trying to make sense of the results.
With our hardware, it's a yes. But you sure can use different hardware or depending on your hardware, different settings. Eg. fibre cables instead of copper for ethernet, NVMe instead of SSD or settings for network tuning. We are using DAC cables for our 100 GbE.
 
I hope, you may share some benchmarks with us.

Working on it. As a point of order, the parent ceph benchmark document describes the test methodology as "fio --ioengine=libaio –filename=/dev/sdx --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=fio --output-format=terse,json,normal --output=fio.log --bandwidth-log" but the results for this test is nowhere in the document (not that it would be of any actual use in this context since you're writing directly to disk instead of the file system)

stay tuned.
 
testbed: 3 nodes consisting of:

CPU: 2x Intel Xeon E5-2673 v4
RAM: 8x32GB, DDR4, 2400MHz
NIC: Connectx4, dual port, 100GBe operating mode
OSD: 12x Hynix HFS960GD0MEE-5410A, FW 40033A00

Comments: 4M write is not really a useful indicator for a hypervisor workload, but I understand people want to see the ooh-ahh MB/S numbers. So, without further ado:

rados bench -p rbd 60 write -b 4M -t 16 --no-cleanup
Code:
Total time run:         60.026194
Total writes made:      40743
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     2715.01
Stddev Bandwidth:       58.9678
Max bandwidth (MB/sec): 2832
Min bandwidth (MB/sec): 2512
Average IOPS:           678
Stddev IOPS:            14
Max IOPS:               708
Min IOPS:               628
Average Latency(s):     0.023569
Stddev Latency(s):      0.00984879
Max latency(s):         0.242294
Min latency(s):         0.0117974

rados bench -p rbd -t 16 60 seq
Code:
Total time run:       34.092660
Total reads made:     40495
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   4751.17
Average IOPS:         1187
Stddev IOPS:          24
Max IOPS:             1224
Min IOPS:             1100
Average Latency(s):   0.0127535
Max latency(s):       0.230566
Min latency(s):       0.00275464

rados bench -p rbd -t 16 60 rand
Code:
Total time run:       60.015327
Total reads made:     76553
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   5102.23
Average IOPS:         1275
Stddev IOPS:          33
Max IOPS:             1334
Min IOPS:             1191
Average Latency(s):   0.0118615
Max latency(s):       0.27716
Min latency(s):       0.00185301
 
Working on it. As a point of order, the parent ceph benchmark document describes the test methodology as "fio --ioengine=libaio –filename=/dev/sdx --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=fio --output-format=terse,json,normal --output=fio.log --bandwidth-log" but the results for this test is nowhere in the document (not that it would be of any actual use in this context since you're writing directly to disk instead of the file system)

stay tuned.
In the benchmark document, the chart above the fio command shows a subset of the results of the test.
 
OSD: 12x Hynix HFS960GD0MEE-5410A, FW 40033A00
How does the direct fio output look like (fio test taken from the benchmark paper)? This would give us a comparison of the NVMe, especially latency.

And would you mind to share a rados bench without dpdk?
Code:
Total time run:         60.028328

Total writes made:      25675
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1710.86
Stddev Bandwidth:       48.2301
Max bandwidth (MB/sec): 1768
Min bandwidth (MB/sec): 1492
Average IOPS:           427
Stddev IOPS:            12
Max IOPS:               442
Min IOPS:               373
Average Latency(s):     0.0374052
Stddev Latency(s):      0.00804253
Max latency(s):         0.104199
Min latency(s):         0.0119991
This is not really a comparable test, as we have only 3x Intel P3700 800GB (no dpdk), 'rados bench -p rbd 60 write -b 4M -t 16 --no-cleanup'. But I think, the Hynix has a higher latency itself and that's why the rados latency is close to ours. That's why I am interested in your results, without dpdk and of the NVMe itself.
 
Hi,
because SSD is such an essential part and on another hand it should be cost efficient (well at least for me), I did some benchmarking on several consumer SSDs.
All test have been made with fio on the same system, with disabled write cache.
Maybe it can be useful for somebody else too.
Command:
Code:
fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test

Results:

Code:
SSD    BW    IOPS
Transcend SSD370S 32GB    2143 KB/s    535
Samsung 750 EVO 500GB    2071 KB/s    517
Kingston SV300 240GB    93987 KB/s    23496
Sandisk SDSSDH3250G 250GB    8925 KB/s    2231
Toshiba TL100 240GB    3513 KB/s    878
Sandisk SDSSDA240G 240GB    6957 KB/s    1739
Transcend SSD320 256GB    5524 KB/s    1381
Intensio 240GB    3445 KB/s    861
Teamgroup L5 240GB    5034 KB/s    1258
Toshiba TR200 240GB    3919 KB/s    979
Micron 1100 256GB    5195 KB/s    1298
Adata SX950 240GB    5917 KB/s    1479
Sandisk SD8SB8U2561122 256GB    6936 KB/s    1734
Kingston SUV400S37480G 480GB    2615 KB/s    653
Corsair Force LE200 240GB    2970 KB/s    742
PNY CS900 240GB    3910 KB/s    977
Samsung 860 Pro 256GB    1883 KB/s    470
Crucial MX500 250GB    9878 KB/s    2469
Kingston SA400 240GB    2822 KB/s    705
 
  • Like
Reactions: chrone
I'm currently piecing together the parts for a 40Gbe production Ceph setup. Since we have the rack space I'm separating the Proxmox Ceph servers from the Proxmox VM servers. I think management will be easier for me that way and the load will be split up. To be honest I'm not sure how much of a CPU and RAM hog Ceph gets so it's better to be safe then sorry. I'll post some benchmarks when it's complete.

I have a question though, does the Ceph backend really need a separate switch or can the backend and frontend (with seperate NICs) connect to the same switch and be separated with VLANs? I'm planning on having one Arista 7050QX-32S which has plenty 40Gbe ports for the whole setup (plus two 1Gbe switches for the cluster network).
 
If you use our default setup then the cluster and public network is on the same IP range. A separation only makes sense if you really separate them physically (can be same switch), a simple separation by VLAN on the same NIC port will not bring any benefit.
 
@LightKnight
Ram = 1GB for 1TB diskspace
CPU Cores = 1x 64-bit AMD-64 via osd + 1x 64-bit AMD-64 via osd-mon + 1x 64-bit AMD-64 via mds

Why do not you take a 40Gb dual network card for the nodes? Then you make a 2x40 GBit Bond per node and separate the networks via VLAN. I would take two switches because of the failure safety and interconnect with MLAG. So you could use smaller switches to get the same number of ports.

@all
Here my Test
Node: Supermicro 2028BT-HNR+ 4xNode

Per Node
CPU: 2xE5-2650 v4 @ 2.20GHz
RAM: 512GB 2400Mhz
Network: 40Gb dual network card
OSD per Node: 6 Intel SSD P4500 4 TB

rados bench -p test 60 write -b 4M -t 16 --no-cleanup
Code:
Total time run:         60.019578
Total writes made:      53479
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     3564.1
Stddev Bandwidth:       77.0995
Max bandwidth (MB/sec): 3732
Min bandwidth (MB/sec): 3384
Average IOPS:           891
Stddev IOPS:            19
Max IOPS:               933
Min IOPS:               846
Average Latency(s):     0.0179543
Stddev Latency(s):      0.0118561
Max latency(s):         0.250379
Min latency(s):         0.00742251

rados bench -p test -t 16 60 seq
Code:
Total time run:       58.703606
Total reads made:     53479
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   3644
Average IOPS:         911
Stddev IOPS:          33
Max IOPS:             970
Min IOPS:             820
Average Latency(s):   0.0168323
Max latency(s):       0.425657
Min latency(s):       0.00464903

Best regards
Mario
 
@Alwin
Maybe I didn't explain it enough. The Ceph servers will have two 40gbe NIC cards, one for the Ceph network and one for the public VM network. The VM servers will have one 40Gbe card for the VM network. Everything will be on one switch but the Ceph traffic will be on its own VLAN, separate from the VM VLANs. But that's probably going to change with the idea below. I'm guessing a separate VLAN on a bonded link would still be beneficial?

@Mario Hosse
I know the minimum requirements, I'm not sure of the real world requirements. Take the recent Micron MySQL Ceph RBD article for example (can't post links yet, look up "Micron Ceph MySQL"). I know it's a benchmark aimed at maxing out resources, but with only 5 MySQL client servers the 44 core storage servers were at 30% CPU. It made me consider using E5s for my storage nodes but I think I'm going to save some of the budget and go with 8 core Xeon-Ds instead. We're running SATA SSDs not NVMe so the load should be smaller since they're slower. That limits me to one dual 40Gbe NIC card for the Ceph nodes which is fine I suppose. A dedicated dual 40Gbe card for a handful of SATA SSDs is overkill anyways.

Thanks for the MLAG idea, it's so obvious I can't believe I didn't think of it myself!
 
Last edited:
@LightKnight
The ram requirements of ceph are pretty close to the real world see an example of top without load:
Code:
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
   3256 ceph      20   0 5711404 4.664g  27048 S   5.9  0.9 170:31.73 ceph-osd
   3383 ceph      20   0 5602124 4.561g  26728 S   5.9  0.9 154:00.58 ceph-osd
   4041 ceph      20   0 5551536 4.499g  26732 S   5.9  0.9 144:38.02 ceph-osd
   3383 ceph      20   0 5602124 4.561g  26728 S   1.7  0.9 154:00.63 ceph-osd
   3256 ceph      20   0 5711404 4.664g  27048 S   1.3  0.9 170:31.77 ceph-osd
   3508 ceph      20   0 5545372 4.506g  26820 S   1.0  0.9 146:08.76 ceph-osd
   3630 ceph      20   0 5616528 4.567g  26592 S   1.0  0.9 140:29.78 ceph-osd
   3917 ceph      20   0 5518576 4.480g  26548 S   1.0  0.9 140:16.43 ceph-osd
   4041 ceph      20   0 5551536 4.499g  26732 S   1.0  0.9 144:38.05 ceph-osd
   3508 ceph      20   0 5545372 4.506g  26820 S   1.7  0.9 146:08.81 ceph-osd
   3630 ceph      20   0 5616528 4.567g  26592 S   1.7  0.9 140:29.83 ceph-osd
   3256 ceph      20   0 5711404 4.664g  27048 S   1.3  0.9 170:31.81 ceph-osd
   3383 ceph      20   0 5602124 4.561g  26728 S   1.3  0.9 154:00.67 ceph-osd
   4041 ceph      20   0 5551536 4.498g  26732 S   1.0  0.9 144:38.08 ceph-osd
   3917 ceph      20   0 5518576 4.480g  26548 S   0.7  0.9 140:16.45 ceph-osd
   2843 ceph      20   0 3240136 541728  22464 S   0.3  0.1  50:46.19 ceph-mon
The CPU load is in normal operation between 3-15% at 48 cores.
One dual 40Gbe NIC in MLAG-Mode is perfect.
 
my 2 cents ...
on 56Gbit/s network configuration see my signature
Code:
Total time run:         60.022982
Total writes made:      41366
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     2756.68
Stddev Bandwidth:       174.339
Max bandwidth (MB/sec): 2976
Min bandwidth (MB/sec): 2228
Average IOPS:           689
Stddev IOPS:            43
Max IOPS:               744
Min IOPS:               557
Average Latency(s):     0.0232139
Stddev Latency(s):      0.00889217
Max latency(s):         0.246315
Min latency(s):         0.00900196
 
I wonder if forcing lz4 compression in bluestore helps real world performance like it does with ZFS. It would be interesting if someone could run a database benchmark with and without lz4 in a VM.
 
Is there a way to hack in support for direct-write EC pools in Ceph? I think the barrier presently is that we can't specify the data pool (since direct write EC pools still need to use a replicated pool for metadata). I feel for smaller networks this might help with throughput (halving the amount of data for replication in some cases). I'm using SSDs with Ceph only for database volumes where I need the IOPS (but not the throughput really) I'd love to be able to reduce network throughput a bit using EC (while saving on disk space). The extra CPU isn't really a concern because my hosts tend to run short on RAM well before CPU. (but I may also have a very different use case than most)
 
How does the direct fio output look like (fio test taken from the benchmark paper)? This would give us a comparison of the NVMe, especially latency.

bw=170912KB/s, iops=42728,
lat (usec): min=19, max=557, avg=23.09, stdev= 2.97

And would you mind to share a rados bench without dpdk?
I'm not using dpdk yet. I've been busy putting out fires :p
 
Aprox app has been constantly crashing on android oreo. I would appreciate if you can fix it. Thanks.

this is off-topic, please contact the aprox app developer for help.
 
Also, the introduction discusses the possibility of using a 3-node cluster with the comment that in a three node cluster "the data is still available after the loss of a node". While true, this is sorely incomplete and misleading. If you are going to make this statement you really owe your readers at last a slighly more detailed treatment of failure modes showing why it takes "replication count"+1 nodes (four in this case) to maintain fully stable operation with a failed node and some treatment of why odd-numbers of nodes create more resilient outcomes.

Why is that? Can you explain? I was contemplating a 3-node cluster and started doing the necessary reading when I stumbled upon your post. Why do you need "replication count" + 1?
 
Why is that? Can you explain? I was contemplating a 3-node cluster and started doing the necessary reading when I stumbled upon your post. Why do you need "replication count" + 1?
Hi,
from my point of view it's not true - you can build an 3-Node ceph cluster without issues.
One node can fail, without data loss.

But the downtime of the failed node should not be to long. Because ceph can't remap the data to other osds to reach the replica-count of three again.
But this depends on the amount of data. Often, in much bigger ceph-setups, it's makes not realy sense to map all data to other nodes, because you are faster to bring the failed node back (spare server...). E.g. if one node have 10 4-TB OSDs you need a long time the rebalance the data across the other nodes.
And you need the free space on the other nodes of couse!

But ceph win with more nodes (more speed, less trouble during rebalance).

Udo
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!