SSD CEPH and network planning

BelCloud

Renowned Member
Dec 15, 2015
96
5
73
www.belcloud.net
I'm considering switching to ceph in order to add HA, but i'm not yet sure about the network requirements.

How much does ceph's performance degrades compared to raid10?

Considering the following setup:
3 servers with 8x 480GB Intel S3500 SSD (500M/s max seq. read, 400M/s max seq write)

In theory, one such system would be able to get 4GB/s read and 3.2GB/s write maximum (no performance degradation)
At maximum usage (utopia), 7.2G/s would mean 57.6Gbps (6x10ge cards)

However, in real life, how much would the disks performance degrade? What network connection would i require per server?

Kind regards
 
How much does ceph's performance degrades compared to raid10?

Depends on the underlying replication used, the write patterns and wether or not your dedicated ceph network will be able to deal with it.


I suggest you have a look at this ceph documentation:
http://docs.ceph.com/docs/hammer/architecture/
specifically this diagram:
ditaa-54719cc959473e68a317f6578f9a2f0f3a8345ee.png


When you write to a ceph OSD that write is only registered as complete by the client once it has been registered on all relevant journals of OSD(1/2/3)




What this means for you I'm gonna try and dumb down a little so bear with me:
  • Now in your example you you have 24 SSD's.
    • Replication of 1 (no single drive may break)
    • Replication of 2 (you can loose one drive): you essentially write 2 copies for each file. ==> 12 drives
    • Replication of 3 (you can loose 2 drives): you essentially write 3 copies for each file ==> 8 Drives
  • You have 3 Nodes, so you probably wanna replicate on the bucketType of "host", rather then the BucketType of "OSD"
    • Replication 1: File gets written to next OSD in line (if in local node, then no network utilisation)
    • Replication 1: File gets written to next OSD in line (if in not on local node, then network utilisation occurs - including latency)
    • Replication 2: File gets written to next OSD in line (network utilisation likely), once that is done it gets written to a second OSD on a second node (network utilisation on one link + added latency)
    • Replication 2: File gets written to next OSD in Line (network utilisation likely), once that is done it gets written to a second OSD and a second and in parallel to a third OSD (network utilisation on 2 links + added latency)
  • Now ceph has something called journals for every OSD.
  • But fear not, ceph has something called Placement Groups (PG's) - check http://docs.ceph.com/docs/master/rados/operations/placement-groups/#how-are-placement-groups-used for reference
    • It is basically a way for Ceph to "know" where an object is located
    • That number is configureable - but afaik the guidance is to aim for a total of 100 PG's per OSD.
    • With this it is statistically very likely that a large amount of parallel writes of multiple objects happens to be put into different PG's on different OSD's:
      • you use the space "evenly"
      • you get a benefit from writing objects in parallel
      • this is where your SSD's ability to accept parallel writes comes in.
        • the more parallel writes it can achieve, the more throughput and IOP/s your ceph cluster is able to handle.

TL;DR
In short, for a replication 3 pool, expect a reduction by a factor of 3 in maximum throughput on normal usage and at best a slowdown by a factor of 6 during heavy usage (depended on your SSD's ability to handle a large amount of parallel writes). And factor in at least the effect of your network latency.

Edit: i just cut a bunch of this post cause it was riddled with (early morning) math mistakes.
 
Last edited:
  • Like
Reactions: jkirker
One small quibble with the above. The default behavior is to acknowledge the write when it has been registered in (n/2) journals (not all of them). Assuming you keep an odd number of replicas in the pool then this guarantees a "quorum" of replicas in case things need to be recovered. This means in a pool size "3" that 2 of the 3 replicas must be confirmed in order to confirm the write.

So in a pool with 3 replicas the expected write "slowdown" would be something close to 2x nominally and 4x in exceptional cases (as opposed to the 3x/6x noted above). I believe its actually a bit better than that because the primary OSD does not have to wait for the journal write before it can launch the writes to the replicas - but you have to review a complete ping-pong diagram of the transaction to get all the details right.

In order to improve write latency, with increased risk to resiliency, you can reduce the number of write replicas that must complete before the write is confirmed. The option "pool min size" can be set to determine the number of replicas written before a write is acknowledged. 'Min size 1' acknowledges the write after 1 OSD is written, etc. 'Min size 0' restores default behavior.

Probably not something to be messed with casually, however. If you move on after writing less than the quorum and something "bad" happens before the writes are propagated you risk the integrity of your data.
 
Last edited:
I'm considering switching to ceph in order to add HA, but i'm not yet sure about the network requirements.

How much does ceph's performance degrades compared to raid10?

Considering the following setup:
3 servers with 8x 480GB Intel S3500 SSD (500M/s max seq. read, 400M/s max seq write)

In theory, one such system would be able to get 4GB/s read and 3.2GB/s write maximum (no performance degradation)
At maximum usage (utopia), 7.2G/s would mean 57.6Gbps (6x10ge cards)

However, in real life, how much would the disks performance degrade? What network connection would i require per server?

Kind regards

It's depend mainly of the io pattern.
If you do big block read, no problem, you'll reach max speed.

if you do lot of small ios (4k), be carefull to take high frequency cpus (intel 3ghz).

with 3 nodes, 2x10cores 3,1ghz, replication x3, 18 osd s3610, I can reach 70000iops 4k write and 600000iops 4K read.



Note that for writes, you'll write datas twice on disk (1xjournal, 1x datas). This should change in the future (1year), with the new ceph bluestore implementation.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!