Understanding Ceph

jkirker · Feb 2, 2016

Hi All,

I'm currently evaluating ProxMox and love what I see so far.

I've got a couple of questions though.

Am I correct in my understanding that Ceph cannot be mounted like a normal drive or NFS? It can't be RSYNC'd to and from directly from a host node? (Why would I want to? I like the ability to reach in and poke at files from anywhere if necessary.)

Second, I've set up a multi-server (5 currently) test cluster environment to learn the ins-and-outs, simulate disaster and recovery, etc. Ultimately I may roll this out and migrate our hosting environment to it for easier scalability.

In the test environment each of the host nodes has a combination of 2x 1TB SSD's and 2x2TB spinners with a PCIe SSD for boot and journals. SUPER LOCAL!

And I'm wondering if I put my VM's into a SSD Ceph pool for HA if Ceph will work keep most of the content (read and/or write) on the local OSD's so as not to traverse the network and then buffer/cache the writes to the other clustered nodes.

Again, sorry if these are newbie questions. I'm trying to do as rapid a braindump as I can while I figure out which platform to run with.

Thanks in advance for your thoughts and time.

John

Q-wulf · Feb 2, 2016

jkirker said:
Am I correct in my understanding that Ceph cannot be mounted like a normal drive or NFS? It can't be RSYNC'd to and from directly from a host node? (Why would I want to? I like the ability to reach in and poke at files from anywhere if necessary.)

You "mount" ceph pools via (k)rdb.
Ceph is not a file-System, Its a Block Device / Object storage. You DO NOT want to poke inside it (unless its last effort rescue attempt, cause someone royally screwed up (inwhich case you use "rados")

There is a File-system available for CEPH, called CephFS, it requiers the use of Meta Data Server(s) aka MDS(s).

The reason you actually want to use Ceph is one of the following:

you like to keep your data secure on multiple Nodes at the same time (your data is VERY important). You could use Gluster for this.
You like to keep large amounts of Data, but would not like the huge overhead of Replicaion --> Erasure Coded pools
You like to be able to program the ability to overcome failure-domains into your Replicated / Erasure Coded pools, there by allowing you to use cheap, non redundant, cosumer grade Hardware, from Hardware to Storage to Network equipment --> Bucket Types are your friend
You like to better Prioritise the strength of your available Storage media for sizzling, hot, warm, Cold and Ultra-Cold data.
You have 30+ Nodes with 1,5k Disks per Cluster. and the network-capacity to use it.

jkirker said:
And I'm wondering if I put my VM's into a SSD Ceph pool for HA if Ceph will work keep most of the content (read and/or write) on the local OSD's so as not to traverse the network and then buffer/cache the writes to the other clustered nodes.

Read up on http://docs.ceph.com/docs/master/rados/operations/crush-map/
devour all those posts : http://www.sebastien-han.fr/blog/categories/ceph/

the only limitation here is your imagination and ability to "code" a custom crush location hook script, and define "custom" non 0815 pools.

jkirker said:
In the test environment each of the host nodes has a combination of 2x 1TB SSD's and 2x2TB spinners with a PCIe SSD for boot and journals. SUPER LOCAL!

You probably wanna read up on Cache-Tiering, your setup would provide the perfect use-case for this (assuming you use a small sizes SSD as a OS-Drive, and not the PCIE-SSD)

here are some links:
http://docs.ceph.com/docs/master/rados/operations/cache-tiering/ - take note of the different Cache-modes.

Here are some more links to read up on:
https://forum.proxmox.com/threads/ssd-ceph-and-network-planning.25687

Question: What types of nodes are these going to be (CPU/RAM) and what the network capacity (Nic-Type + Number) situation like ?
Both make a big difference for ceph.

jkirker · Feb 2, 2016

Thanks Q... I haven't slept much over the last 6 days... Spent a few days with OpenStack but it was a bit too much to bite off on for now. Let's hope my wife and kiddo don't find a new daddy over the next 5 while I digest the above.

Thanks again.

Curious your thoughts on this concept as well. I'm also toying with the idea of teaming Hot/Hot 2 node clusters direct on local storage with shared filesystems (likely splitting vm's based on space/load) while backing to Ceph for emergency portability.

Perhaps this might take some of pressure off the network? I'm close to pulling the trigger on 10GB but I'm still on the fence using nic teaming/bonding. I want to do it but I won't know if it's a justified necessity until I start pushing a Ceph model out into the wild for a bit of testing.

In all likelihood I'll start w/ 10-16 nodes but I have the spare equipment to spin up a full rack - but again, not sure the networking equipment I have in place will have to support it. :{

Q-wulf · Feb 2, 2016

jkirker said:
Curious your thoughts on this concept as well. I'm also toying with the idea of teaming Hot/Hot 2 node clusters direct on local storage with shared filesystems (likely splitting vm's based on space/load) while backing to Ceph for emergency portability.

We are operating around 105 Proxmox/Ceph nodes at work (5040x HDD + 840x NVME Samsung 950 Pro ),not just my view, my businesses view

jkirker said:
Perhaps this might take some of pressure off the network? I'm close to pulling the trigger on 10GB but I'm still on the fence using nic teaming/bonding. I want to do it but I won't know if it's a justified necessity until I start pushing a Ceph model out into the wild for a bit of testing.

If you use SSD-Pools , 4x1G is stretching it (a single link can only handle 125 MB/s - less then your SSD can produce). 1x 10G might be doable. (5x2x 500 MB/s =6GB/s incoming Bandwith (best case scenario) from the 2x1TB SSD's on each node alone.

With Ceph and Proxmox yo want to go openvswitch and stick all your (dedicated to ceph) links into a balance-tcp bond. Use Ceph with a private and public network operating over that bond. with 6 Nodes at 5 Disks per Nodes, you get 5x2x(6-1) = 50 outgoing/outgoing potential IP/Port Combos, so you are utilising your links to the max, as OVS makes sure to rebalance the load on a link as flows appear/disappear. That is because each OSD runs under a unique port.

jkirker · Feb 2, 2016

Nice... Thanks for sharing your config and info. You find it pretty manageable?

And good points on the bottlenecking issues.

I'm probably going to blow my test cluster away (again) and tweak a couple things now that I'm getting deeper into the weeds.

Boy, I wish Ceph could use a partial drive out of the box so that I could split an SSD. I read somewhere about a guy adding an LVM partition to Ceph via the CrushMap but I don't want to get myself into trouble hot-rodding things.

You mentioned to use Ceph on both public and private networks? I was planning run Ceph private only and then cheap it by expanding my NIC per node from 2 to 4 or 6 and then team them. (LOTS OF CABLES) Either that or I'll just bite the bullet and go get a 10G switch and be done with it which is likely be an eventual necessity as the network continues to grow.

jkirker · Feb 2, 2016

PS. I've got a STACK of legacy SuperMicro 1u 1/2 depth 2x4 XEON web-servers w/ 32-64GB of RAM that I've been replacing here and there. I could spin them up for node leverage. My power isn't metered so what the heck - but it's more HW to manage..

The downside is that they only have room for 2x3.5's or 4x2.5's @ 3Gb/s if I stack them. I could take them to 6 with a 4-port SATAIII card and a ribbon riser in the case while adding an extra nic as well. Should I utilize these guys or... I don't necessarily need them but they may provide more leverage? (Seems that's what Ceph eats for breakfast. But again - Network issues?)

Q-wulf · Feb 2, 2016

jkirker said:
The downside is that they only have room for 2x3.5's or 4x2.5's @ 3Gb/s if I stack them. I could take them to 6 with a 4-port SATAIII card and a ribbon riser in the case while adding an extra nic as well. Should I utilize these guys or... I don't necessarily need them but they may provide more leverage? (Seems that's what Ceph eats for breakfast. But again - Network issues?)

Ceph just loves to do stuff in parallel. Thats where it excels at.

if performance is not even a secondary concern (which it does not sound like), then a single 1G link per node will be challenging but doable. 4x1G (dedicated to ceph - openvswitch balance-tcp) i have been told (never tried that in production) can already get bogged down by a 2x SSD + 2x HDD setup on a 4-node cluster during normal use (let alone benchmarking or backfilling, or large file-writes)

I always recommend at least 10G, better dual 10G. For ceph. Especially when SSD based OSD's are involved and the cluster is beyond 6 of those OSD. And to switch to openvswitch and balance-tcp to get the best out of a bond.

Personally we use 2x10G + 2x 40G at work while waiting for 25G-based nics to get cheaper. But we have super huge amount of Disks per node, so that is not really representative.

Lemme ask, what are you after with Ceph ? There is so many options, i just gotta ask this, altho i normally despise this question.

udo · Feb 2, 2016

jkirker said:
...
You mentioned to use Ceph on both public and private networks? I was planning run Ceph private only and then cheap it by expanding my NIC per node from 2 to 4 or 6 and then team them. (LOTS OF CABLES) Either that or I'll just bite the bullet and go get a 10G switch and be done with it which is likely be an eventual necessity as the network continues to grow.

Hi,
from the ceph view is the public net the face to the clients (and ceph-mons) - there run the traffic between ODSs and client (qemu-kvm). This network must not (should not) visible from the VM network!
The private network is the network between all OSD nodes to transfer the replication data (and "rebuild"-data). Should be a different one to support enough bandwith even if the clients do a lot of IO.

One remark to 1GB/10GB - with 10GB you have the additional effort of an shorter latency! And latency could be an problem...

Udo

jkirker · Feb 4, 2016

Q, I pulled the trigger on a couple S4810's yesterday and a bunch of dual 10G nics to reduce my network concerns.

So I've got a question for you or anyone else.

I've got a few 12x3TB SAS systems here which I'll team 2x2x10GB NICs in.

I'm wondering if you have any experience or opinion to maximize performance for OSD's. Would you recommend that I keep them as JBOD or form 1 or several RAID volumes on each?

This is my current planned go-out config.
3 monitor nodes (2xQuad/16GBR/2x 1GBN on Storage Network)
3 12x2-3TB OSD boxes (2xQuad/64GBR/2x2x10GBN on Storage Network)
2-3 3x2TB OSD boxes (2xQuad/64GBR/2x2x10GBN on Storage Network) (These are a bit slower)
8 front facing VM nodes (2xDeca/128-256GBR/1x10GBN on Storage Network) +1 or 2x2TB for OSD's*

For starters only 4 of the VM nodes would be hot/active and the other 4 would either be spare or just processing OSD's. As I introduce more OSD nodes I'd remove the OSD node drives from front facing equipment.

Does it seem reasonable? Does anything stick out as a no-no? Trying to get by with what I've got or recently acquired for starters.

PS. This is all mainly for hosting - some dedicated, some vm and some shared. So other than the dedicated or vm it might be a challenge to pool things based on process type/application type. So I'm looking for the best "blended" performance. I'm also studying this: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/

udo · Feb 4, 2016

jkirker said:
I'm wondering if you have any experience or opinion to maximize performance for OSD's. Would you recommend that I keep them as JBOD or form 1 or several RAID volumes on each?

Hi,
normaly you should use single disks and no raids (ceph limit the IO to one OSDs...) but you can have some pros with raids (some people use replica 2 with raided OSDs). Use google for the ceph-mailinglist and the subject "anti-cephalopod question".

Udo

Q-wulf · Feb 5, 2016

The main advantage of Raid-Setups are :
- No Rebalance during a single Disk-Failure (you can get away with slower link-speeds)
- You can reduce the Replication count to achieve the same number of copies (again, lower link speed requirements)
- Less overhead on the Mons due to less OSD's to maintain (we are taking 1000's of OSDs where this is becoming interesting)

disadvantages:
-Read speeds on Raided setups are slower then non-raided JBOD. For write speed it depends greatly on replication count.
-you add a failure domain in the form of a (physical) raid-controller you need to potentially account for when doing custom Replication rules

Thats by personal pro/cons from my experience with Ceph.
@udo suggestion is spot on tho. Ceph-mailing list is the way to go there.

jkirker said:
This is my current planned go-out config.
3 monitor nodes (2xQuad/16GBR/2x 1GBN on Storage Network)
3 12x2-3TB OSD boxes (2xQuad/64GBR/2x2x10GBN on Storage Network)
2-3 3x2TB OSD boxes (2xQuad/64GBR/2x2x10GBN on Storage Network) (These are a bit slower)
8 front facing VM nodes (2xDeca/128-256GBR/1x10GBN on Storage Network) +1 or 2x2TB for OSD's*

Are these all Proxmox + Ceph Nodes ?
Or are only the front-facing nodes powered by Proxmox ?

Regardless of the answers to the above quetsions:
Leave the OSD's of the VM-Nodes unless you increase the network link. (currently 1x10G), rather stick the drives into the Faster or Slower OSD-Boxes. This is OFC, unless you put em there for use with a custom crush rule you have not mentioned anything about.
You want to maintain the ability to read/write 10G worth of Data per VM-Node that your VM's can share, especially should there be any VM <-> VM traffic involved you do not have a extra 10G link for.

You have 4x 10G links on your storage network for all OSD-Hosts, right ? Does your Switch(s) tying this together have the ability to do QOS on a Vlan/Subnet basis ?
If so, stick em into a Openvswitch based single VMBR in Balance-TCP mode. And tweak your QOS to give Ceph_Cluster (OSD to OSD) a minimum of 30G and burst of 40G, and Ceph_Public (Client to MON, Client to OSD)a minimum of 10G and a burst of 40G. This is incoming QOS to a Ceph-OSD-Node and is based on Replication 4 only pools. Adjust based on your own Crush-Rules and Replication modes and Pools used.

The Front-Facing Nodes Storage-network, i'd stick into OVS based Balance-TCP VMBR aswell (if you have or decide to have multiple 1G or multiple 10G links). Not much need for QOS there (keep OSDs off these)

jkirker · Feb 5, 2016

Thanks for your continued advice. I appreciate it.

While I'd love to put PM on every server for continuity, it's possible but unlikely that I will on the monitoring and/or OSD nodes. Licensing costs are a concern and I'm not 100% certain that support for Ceph via PM is fully available at this point. So, if I do fully implement Ceph at the DC I may try to arrange support directly through RedHat or a third party consultant - or try to figure out some type of event based support with someone. Funds are an issue while I work towards profitability on the new model.

Regarding system/network config, I have no issues putting 2x10G cards in the VM nodes. I had originally planned to do that but figured that it may be overkill - but maybe it's not.

I'm pretty sure that Force10's will do QOS on a Vlan - they are pretty feature rich.

This would be a big leap for me adopting new technology and really learning it so that we can support it. So hopefully via PM support and the community we'll be able to build something great and make it happen!

Maybe one day I'll be able to buy you a beer for your help and advice here. Or more.

Thanks again.

Search

Search

Understanding Ceph

jkirker

Member

Q-wulf

Renowned Member

jkirker

Member

Q-wulf

Renowned Member

jkirker

Member

jkirker

Member

Q-wulf

Renowned Member

udo

Distinguished Member

jkirker

Member

udo

Distinguished Member

Q-wulf

Renowned Member

jkirker

Member

We value your privacy