Few Ceph questions

adamb · Sep 16, 2014

We are looking at deploying a cloud to host our clients systems. I have been playing with ceph for a few weeks now and starting to get a good grasp and really like the project.

Here is a little background on what we are looking to do. I plan on deploying 2 independent ceph clusters in 2 different locations. We will be using incremental snapshots to keep the 2nd ceph cluster up-to-date, this will be our failover location in case of fire/emergency. We will have either 10GB or 40GB fiber between the two locations. I also plan on building the ceph cluster itself with a basic CentOS load unless others here think proxmox would be better (We are a centos shop for the most part and would like to keep proxmox strictly for virtualization purposes). Then we will be using some proxmox front ends to provide VM's.

I have one questions on the ceph setup itself.

1. Are you guys presenting a pool directly to proxmox
- Or are you creating a block device within the pool and presenting that to proxmox

2. If block device, do you create a individual block device for each VM, or run all VM's on a single block device?

Im still rapping my head around ceph as there is alot to it. I appreciate any input on the subject!

e100 · Sep 16, 2014

Proxmox uses the CEPH pool directly.

I only have three nodes running CEPH on our small testing Proxmox cluster, so I am no expert on the subject

It was much easier to use Proxmox on the CEPH servers than some other OS.
Once configured many CEPH tasks can be performed in the Proxmox web interface.
Upgrades will likely be easier and more predictable when setup with Proxmox.

If you must use CentOS only use it in your VMs, use Proxmox for your Infrastructure.

adamb · Sep 16, 2014

e100 said:
Proxmox uses the CEPH pool directly.

I only have three nodes running CEPH on our small testing Proxmox cluster, so I am no expert on the subject

It was much easier to use Proxmox on the CEPH servers than some other OS.
Once configured many CEPH tasks can be performed in the Proxmox web interface.
Upgrades will likely be easier and more predictable when setup with Proxmox.

If you must use CentOS only use it in your VMs, use Proxmox for your Infrastructure.

Sounds good, although I must say setting up ceph on CentOS 6.5 was quite simple.

Am I correct in thinking that proxmox creates a block device for each disk instance within the pool?

adamb · Sep 16, 2014

Scratch that question, answered it myself.

[ceph@cephadmin1 ceph]$ rbd ls test
vm-100-disk-1

wahmed · Sep 16, 2014

adamb said:
Here is a little background on what we are looking to do. I plan on deploying 2 independent ceph clusters in 2 different locations. We will be using incremental snapshots to keep the 2nd ceph cluster up-to-date, this will be our failover location in case of fire/emergency.

What are you going to use to create those Snapshots to keep both clusters in sync?

adamb said:
We will have either 10GB or 40GB fiber between the two locations. I also plan on building the ceph cluster itself with a basic CentOS load unless others here think proxmox would be better (We are a centos shop for the most part and would like to keep proxmox strictly for virtualization purposes). Then we will be using some proxmox front ends to provide VM's.

I have several Ceph deployments. Some are on Ubuntu and some collocated with Proxmox nodes. I dont think there are much differences in using Ceph on CentOS or Ubuntu. But Ceph developers uses Ubuntu to make and test Ceph builds. During my Ceph learning i used Ubuntu and now i am comfortable with it. Using Ceph with Proxmox has many advantages. The ability to management both Proxmox and Ceph from same GUI is a big positive point. Also there are no needs of having separate physical nodes for MONs, Admin. In one of my 7 nodes Proxmox+Ceph cluster, i use 3 of them for Ceph OSDs only and 4 nodes to put only VMs. Just personal preferences so that Ceph OSDs and VMs does not share same resources.
If you are going to have 10 or 40gb link between two locations, why not treat the entire Ceph cluster as One without snapshotting in between and just increase the number of replicas?
Are two locations in different physical location or in the same building?

adamb · Sep 16, 2014

symmcom said:
What are you going to use to create those Snapshots to keep both clusters in sync?

I have several Ceph deployments. Some are on Ubuntu and some collocated with Proxmox nodes. I dont think there are much differences in using Ceph on CentOS or Ubuntu. But Ceph developers uses Ubuntu to make and test Ceph builds. During my Ceph learning i used Ubuntu and now i am comfortable with it. Using Ceph with Proxmox has many advantages. The ability to management both Proxmox and Ceph from same GUI is a big positive point. Also there are no needs of having separate physical nodes for MONs, Admin. In one of my 7 nodes Proxmox+Ceph cluster, i use 3 of them for Ceph OSDs only and 4 nodes to put only VMs. Just personal preferences so that Ceph OSDs and VMs does not share same resources.
If you are going to have 10 or 40gb link between two locations, why not treat the entire Ceph cluster as One without snapshotting in between and just increase the number of replicas?
Are two locations in different physical location or in the same building?

Really appreciate the input! I will definitely be doing some testing with building my ceph cluster directly on proxmox.

I plan on using the rbd export diff ability within ceph to perform the incremental snapshots. I know this isn't real time, but we are Ok with that.

http://ceph.com/dev-notes/incremental-snapshots-with-rbd/

Remote site is roughly 45 miles away, we have some dark fiber between the two locations, just need to light it up. We will probably be using 10GB copper for the backend (other than the fiber between the two locations) as its our preferred medium and we already use it in a number of locations. We also plan on going SSD's across the board, we might entertain a tiered setup but one disk medium just seems simpler.

I never even thought about the idea of doing one large cluster. Trying to think of all caveats which would come with this scenario. It would need to be built in such a way that either site could still function if the other was completely down. As it sits I am only looking to have 2 replications per site. I am still quite new to ceph and I haven't seen a way to specify what replica's get placed on what OSD's. So far I have been testing some simple 3 node clusters and when I set replica's to 2, I don't quite understand what 2 OSD's those replica's actually reside on. I would want to go to replica's of 4 and have 2 at each site, just not sure how to force 2 replica's at each site. The more I think about this, the better it sounds, really appreciate the input and brainstorming assistance!

e100 · Sep 16, 2014

http://ceph.com/docs/master/rados/operations/crush-map/

With a properly configured CRUSH map you can define what OSDs are in what datacenter and CEPH will ensure replication across the two datacenters.
They have a whole hierarchy already defined:

Code:

# types 
type 0 osd 
type 1 host 
type 2 chassis 
type 3 rack 
type 4 row 
type 5 pdu 
type 6 pod 
type 7 room 
type 8 datacenter 
type 9 region 
type 10 root

If the fiber was cut you would have one issue: Quorum
I don't know enough about CEPH to figure out how to ensure both sides can keep on without the other and not mangle the data.

wahmed · Sep 16, 2014

adamb said:
Remote site is roughly 45 miles away, we have some dark fiber between the two locations, just need to light it up. We will probably be using 10GB copper for the backend (other than the fiber between the two locations) as its our preferred medium and we already use it in a number of locations. We also plan on going SSD's across the board, we might entertain a tiered setup but one disk medium just seems simpler.

I would suggest 40gb for local backend also. Will make huge difference. If cost is an issue, 40Gb Infiniband works great.
The great advantage of having one large cluster over remote distance is you can move whole bunch of VMs between sites. If one site goes dark completely, simply VMs from that site to functional site. Since Ceph is on one cluster this will provide the least downtime.

Let's say you have total 8 nodes for Ceph. 4 in each site. If you use 6 replicas, it will ensure that you have your data intact even if one site goes down completely. I think the trick is to keep equal number of OSD nodes in each site which you will anyway. Another benefit of Ceph+Proxmox node is you can temporarily move VMs to these nodes in emergency, given that these nodes have enough RAM and Core to go around.
Do keep in mind though, during site failure your Ceph cluster will go through massive rebalancing and thats where the 40gb fibre will come in handy. Also as soon as you notice the one site is down, if you initiate #ceph osd set nodown and #ceph osd set noout command it will hold off the rebalancing. Since we know the site is down temporarily. As soon as the site comes back do #ceph osd unset nodown and #ceph osd unset noout to start writing the backlog occurred during outage. Since mentioned you are new in Ceph, i would suggest setting up a small cluster lets say 4 Proxmox+Ceph nodes all together and try out the concept. You will like what you see.

I would like to point out to everybody though, this concept is very much viable in this case because Adam has a 40gb fiber hardlink between sites. This setup will be somewhat unstable on standard WAN due to very high latency.

adamb · Sep 16, 2014

So it looks like I can control this based off the CRUSH maps. I will have to play with those to get a better understanding of how that works. If I have those setup properly I think I could get away with 4 replica's all together but I am unsure.

The quorum issue is a good one, trying to determine the easiest way around that. Looking at this line from the documentation almost makes me feel like this wouldn't be possible.

Summarizing, Ceph needs a majority of monitors to be running (and able to communicate with each other), but that majority can be achieved using a single monitor, or 2 out of 2 monitors, 2 out of 3, 3 out of 4, etc.

I really appreciate both your inputs, probably the biggest project of my career and it has my wheels turning.

I am going to get on testing this idea tomorrow as soon as I come in, also plan on digging into Infiniband to see what I can do with that price wise.

mir · Sep 16, 2014

The quorum part is not able to be solved fully automatic since you "only" have two separate locations;-) but I should guess this is not an issue since a break down in a data center likely will require operator intervention anyway. Quorum in a situation where one location is down can be reestablished manually in the other data center. This requires though an uneven number of ceph nodes in each location. Eg. each location hosts 5 ceph nodes which then can form a quorum on its own given the other data center is down.

wahmed · Sep 17, 2014

mir said:
The quorum part is not able to be solved fully automatic since you "only" have two separate locations;-) but I should guess this is not an issue since a break down in a data center likely will require operator intervention anyway. Quorum in a situation where one location is down can be reestablished manually in the other data center. This requires though an uneven number of ceph nodes in each location. Eg. each location hosts 5 ceph nodes which then can form a quorum on its own given the other data center is down.

I agree, this is not going to be an automatic transition.
Following is just one this will work:
Taking the 8 nodes scenario, we will have one additional node in each location, powered off and not part of the cluster. As soon as site down is realized, fire up the additional node and join to cluster to form quorum. After the other site is up simply remove it from cluster and power down. This way Quorum will be restored and staff will keep working while fallen site is being worked on.

Or,

Use an additional node on each site but put it on separate LAN/WAN and power supply purely for the purpose of quorum. The node can be very underpowered thus cheap. That way even if main cluster in one site goes down Quorum will remain intact.

adamb · Sep 18, 2014

Starting to dig back into this today as I got sidetracked with a project yesterday.

Looking over the Ceph Server Proxmox wiki and it looks like I have to create a working proxmox cluster before setting up the ceph part of things. This part concerns me because I know the cluster will be limited to 16 nodes. No doubt we would be starting with 4 nodes at each location and possibly one monitor node per location. Right off the bat we would be using 10 of the 16 and that is just for storage, doesn't seem like it will scale well with our needs. We will never run VM's on the proxmox ceph storage nodes because they will be limited in resources, not only that but I plan on keeping the front ends serving up the VM's a completely different cluster, other wise we would hit that 16 node limit as soon as we set it up. If I am wrong on this please someone correct me because I do like the sounds of using proxmox as my ceph server OS, but the 16 node limit would instantly kill the idea.

The storage nodes we will be running our just some supermicro based 12 bay enclosures with a Intel E5 cpu and 32 GB ram. The front ends will be HP DL380p gen9's with 2 E5 CPU's and atleast 768GB of ram, we might even go with DL580's with 4 E5's and 1.5TB of ram but that is still up in the air.

I also want to avoid using 6 replica's, this seems like a huge waste of space. I am hoping I can stick with 4 replica's and get the crush maps setup properly to do this.

Lots to test out, hoping to pin some of this down today.

e100 · Sep 18, 2014

The cluster is not limited to 16 nodes, I have a 20 node cluster running without doing anything special.
This is also not a limitation of Proxmox, its a limitation of the underlying cluster system corosync.

It is also possible to modify the cluster.conf file to accommodate more nodes.
I've read various numbers from 32 to 64 being the limit, not really sure since I've not gone past 20 yet.

Someone correct me if I am wrong, I don't think it is necessary for a Ceph node to be part of the proxmox cluster that is utilizing the Ceph storage.
Also do not think that a Ceph node needs to be part of a Proxmox cluster if its only used for CEPH.

timm4 · Sep 18, 2014

If ceph is separated from the vm nodes how does it communicate? On the cluster network or the ceph network?

spirit · Sep 18, 2014

timm4 said:
If ceph is separated from the vm nodes how does it communicate? On the cluster network or the ceph network?

through the ceph public network.

(ceph cluster network if for inter node replication)

adamb · Sep 18, 2014

e100 said:
The cluster is not limited to 16 nodes, I have a 20 node cluster running without doing anything special.
This is also not a limitation of Proxmox, its a limitation of the underlying cluster system corosync.

It is also possible to modify the cluster.conf file to accommodate more nodes.
I've read various numbers from 32 to 64 being the limit, not really sure since I've not gone past 20 yet.

Someone correct me if I am wrong, I don't think it is necessary for a Ceph node to be part of the proxmox cluster that is utilizing the Ceph storage.
Also do not think that a Ceph node needs to be part of a Proxmox cluster if its only used for CEPH.

This is good to hear, I new it was a limitation of corosync I should have worded that better, I think 32 nodes would be enough for what we are looking to do. I will be reporting back shortly with my findings.

mo_ · Sep 18, 2014

as a side note: dont go above size=4 (replication count), as doing introduces a major slowdown to writes due to the write only being acknoledged after all copies have been written.

Also, you actually CAN have a fully-auto setup with just 2 locations for Ceph. This is a little complicated, but what you'd do is pick one of both locations to be your primary. In this location you run a monitor in a VM. There will be an identical VM in your other location as well. Both VMs sync the mon's datastore via DRBD. Pacemaker on top of that will automatically start the mon on the secondary node (which has an up to date monstore) if the link should go down, providing you with a majority of ceph monitors up on your secondary location.

Full disclosure: this is not actually my idea. Has been mentioned here: http://www.sebastien-han.fr/blog/2013/01/28/ceph-geo-replication-sort-of/

adamb · Sep 18, 2014

mo_ said:
as a side note: dont go above size=4 (replication count), as doing introduces a major slowdown to writes due to the write only being acknoledged after all copies have been written.

Also, you actually CAN have a fully-auto setup with just 2 locations for Ceph. This is a little complicated, but what you'd do is pick one of both locations to be your primary. In this location you run a monitor in a VM. There will be an identical VM in your other location as well. Both VMs sync the mon's datastore via DRBD. Pacemaker on top of that will automatically start the mon on the secondary node (which has an up to date monstore) if the link should go down, providing you with a majority of ceph monitors up on your secondary location.

Full disclosure: this is not actually my idea. Has been mentioned here: http://www.sebastien-han.fr/blog/2013/01/28/ceph-geo-replication-sort-of/

This sounds great but our management has a terrible taste in their mouth with DRBD. We used it in our previous HA setup before we moved to proxmox with central storage and honestly it was a night mare for our support people. From what I am reading I should have no issue doing one large cluster without DRBD in the mix as we will have an enormous pipe between the two locations. I like DRBD but I would get sent right back to the drawing board if I tried presenting it. I really appreciate the input though!

mo_ · Sep 18, 2014

Not the first time Ive heard people not liking DRBD, cant blame them.

The speed of the connection between your datacenters is completely irrelevant for this issue though. You could either have an equal number of monitors in each datacenter, or have 1 more mon than that in one location. Either way, if the connection between both locations fails, you either completely lose quorum or one of both sides becomes inoperable. Ideally you REALLY REALLY want a third location to run one monitor. This "location" can be a VM hosted by some VPS provider or whatever, just so you have an external view if you lose the direct connection between both your datacenters.

adamb · Sep 18, 2014

mo_ said:
Not the first time Ive heard people not liking DRBD, cant blame them.

The speed of the connection between your datacenters is completely irrelevant for this issue though. You could either have an equal number of monitors in each datacenter, or have 1 more mon than that in one location. Either way, if the connection between both locations fails, you either completely lose quorum or one of both sides becomes inoperable. Ideally you REALLY REALLY want a third location to run one monitor. This "location" can be a VM hosted by some VPS provider or whatever, just so you have an external view if you lose the direct connection between both your datacenters.

Yea DRBD is a tough one, I personally like it, but because we have 100+ support people who need to understand it, it makes it far more difficult.

I would like to avoid the 3rd site if possible, I would actually prefer to manually make the failover happen if one of the sites was to go down (I will need to bring up my router at the other site anyway so it can start doing its BGP advertisement). It seems I should be able to have 1 extra monitor at each site which I would bring up in the case of the other site going down. I want the one site to be the primary for the most part and function properly if an issue came up with the connection between the two.

Few Ceph questions

Famous Member

Renowned Member

Famous Member

Famous Member

Famous Member

Famous Member

Renowned Member

Famous Member

Famous Member

Famous Member

Famous Member

Famous Member

Renowned Member

Renowned Member

Distinguished Member

Famous Member

Active Member

Famous Member

Active Member

Famous Member