Few Ceph questions

adamb · Sep 25, 2014

mo_ said:
much appreciated. I'm actually currently looking at implementing such a metro cluster for 2 separate customers of mine that have asked this to be done.

The required crushmap alterations should be as simple as introducing datacenter buckets to separate the hosts. Also it is possible to define crush rules that makes it so that the primary OSD for all PGs would be located in/on your main site.

No problem.

Yea I got the datacenter split part worked out, just need to pin down how to make my main site the preferred location. I was running into all kinds of issues when splitting up into datacenters until I set this.

ceph osd crush tunables optimal

mo_ · Sep 25, 2014

Well yes having non-legacy tunables set is always a good idea, regardless of the situation

Also regarding the crush rule I mentioned, sage actually gave a detailed example how to do that in this thread: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg13028.html

adamb · Sep 25, 2014

mo_ said:
Well yes having non-legacy tunables set is always a good idea, regardless of the situation

Also regarding the crush rule I mentioned, sage actually gave a detailed example how to do that in this thread: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg13028.html

Perfect, I will check it out in a few, gotta get a bunch of systems updated due to this darn bash bug.

adamb · Sep 25, 2014

mo_ said:
much appreciated. I'm actually currently looking at implementing such a metro cluster for 2 separate customers of mine that have asked this to be done.

The required crushmap alterations should be as simple as introducing datacenter buckets to separate the hosts. Also it is possible to define crush rules that makes it so that the primary OSD for all PGs would be located in/on your main site.

Yep everything sync's up without any issues. Obviously I had to set "pvecm expected 1" so I could get write access, but other than that as soon as the other nodes come up, it syncs and they all match. From what I can tell it should be no issues keeping a good copy of the monmap in /etc/pve.

Moving on to the crush maps!

adamb · Sep 26, 2014

Well now that I figured out how to see what objects belong to OSD's I am making some solid progress. I am doing replica of 2, one copy at datacenter-1 and one copy at datacenter-2. Rule is quite simple and keeps a copy at each datacenter. I tried the examples in that link and no matter what I ended up with both copies at datacenter-1. I will say though that my test configuration is quite simple. 6 nodes with only 1 OSD per node. I am thinking if I had more OSD's per node that this current rule might not work the same.

Looking for input on using only 2 replica's? When pitching this to management their largest concern is the number of copies. They can't see a reason to keep 4 or 3 copies as it eats of up space. Any thoughts on only keeping 2 replica's?

# rules
rule dc {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type datacenter
step emit

The above rule with my current setup is placing a replica at each datacenter.

I've said it a few times, but I really appreciate everyone's input!

mo_ · Sep 26, 2014

It's a matter of risk analysis. If you want to be able to mitigate the failure of a disk while one of your datacenters is down then you need 3 copies.

BTW the remainder of the thread I linked talks about dry-testing the crush ruleset with crushtool. Also the rule sage posted only works for 3 copies:

The pool size (replication factor) is 3, so RADOS will just use the first three (2 hosts in first rack, 1 host in second rack).

adamb · Sep 26, 2014

mo_ said:
It's a matter of risk analysis. If you want to be able to mitigate the failure of a disk while one of your datacenters is down then you need 3 copies.

BTW the remainder of the thread I linked talks about dry-testing the crush ruleset with crushmaptool

Good point! I think it would make sense to keep 2 at our main location and only 1 at the off site location. This way if our off site location is down for whatever reason, and we loose a disk/OSD we will still have a good copy. Our off site is really for extreme issues, ie fire, tornado.

mo_ · Sep 26, 2014

yes. also I edited my last post

adamb · Sep 29, 2014

I have been toying around with the Ceph crush map rules. When trying to use the below rule, all my monitors die and the cluster losses quorum. This rule is similar to the one in the post provided by mo_ other than I am trying to do it by datacenter and not rack.

rule dc {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type datacenter
step chooseleaf firstn 2 type host
step emit
}

After lots of trial and error I came up with this rule which is a bit closer but still not quite right. I end up with 2 copies in one datacenter but I would prefer it be the 1st datacenter. Unsure how to work around this.

# rules
rule dc {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type datacenter
step chooseleaf firstn -1 type host
step emit

Im still trying to comprehend how these rules work but I am having a heck of a time wrapping my head around the concept.

Another thing I find interesting is determining what objects are located on what OSD's. This is what I am doing.

Determine the object name.

root@cephnode1:/etc/pve# rados -p Ceph ls | grep vm
rbd_id.vm-101-disk-1
rbd_id.vm-100-disk-1

As you see I have two VM disks. I can then determine the location of those by doing the following.

root@cephnode1:/etc/pve# ceph osd map Ceph rbd_id.vm-100-disk-1
osdmap e359 pool 'Ceph' (5) object 'rbd_id.vm-100-disk-1' -> pg 5.2ef8a3ea (5.ea) -> up ([2,1,3], p2) acting ([2,1,3], p2)
root@cephnode1:/etc/pve# ceph osd map Ceph rbd_id.vm-101-disk-1
osdmap e359 pool 'Ceph' (5) object 'rbd_id.vm-101-disk-1' -> pg 5.512a6f54 (5.54) -> up ([1,0,3], p1) acting ([1,0,3], p1)

What throws me off is the fact that I can do the same for things which are non existant and it still provides an output like it exists.

root@cephnode1:/etc/pve# ceph osd map Ceph rbd_id.vm-104-disk-5
osdmap e359 pool 'Ceph' (5) object 'rbd_id.vm-104-disk-5' -> pg 5.63f06384 (5.84) -> up ([3,5,1], p3) acting ([3,5,1], p3)

vm-104-disk-5 doesn't even exist but yet it provides a mapping like it does. Just odd.

mo_ · Sep 29, 2014

whoa, wait a minute. You're confusing RBDs and objects. Objects in ceph are 4MB in size by default. one RBD therefore consists of many objects.

Even though this is fairly pointless since an RBD consists of so many objects that it is almost guaranteed to make use of all the placement groups (effectively making "ceph pg dump" your distribution display), what you'd have to do to see the distribution of an RBD is:

rbd -p poolname info rbdname

This will say (amongst other things) something like: block_name_prefix: rb.0.e

You can then do:

rados -p poolname ls|grep ^rb.0.e

to see all the object names, and THEN you could use "ceph osd map objectname" to see its placement.

Also the reason "ceph osd map" works on non-existant objects is that thats the way ceph works. The CRUSH algorithm takes the object name and the crush map as inputs and the output is a placement group. Thats why Ceph doesnt need any lookup tables, the object placement is determined by a deterministic function.

adamb · Sep 29, 2014

mo_ said:
whoa, wait a minute. You're confusing RBDs and objects. Objects in ceph are 4MB in size by default. one RBD therefore consists of many objects.

Even though this is fairly pointless since an RBD consists of so many objects that it is almost guaranteed to make use of all the placement groups (effectively making "ceph pg dump" your distribution display), what you'd have to do to see the distribution of an RBD is:

rbd -p poolname info rbdname

This will say (amongst other things) something like: block_name_prefix: rb.0.e

You can then do:

rados -p poolname ls|grep ^rb.0.e

to see all the object names, and THEN you could use "ceph osd map objectname" to see its placement.

Also the reason "ceph osd map" works on non-existant objects is that thats the way ceph works. The CRUSH algorithm takes the object name and the crush map as inputs and the output is a placement group. Thats why Ceph doesnt need any lookup tables, the object placement is determined by a deterministic function.

Ahhh ok I see what you are saying. Added to my notes as this is good info.

adamb · Sep 29, 2014

Woot finally got a rule working (I think).

rule dc {
ruleset 0
type replicated
min_size 1
max_size 10
step take dc-1
step chooseleaf firstn 2 type host
step emit
step take dc-2
step chooseleaf firstn 1 type host
step emit

Haven't got a chance to see how it handles failovers, but so far its keeping 2 copes in my first datacenter and 1 copy in my other. Still waiting for my brain to just "click" and understand these rules. I am getting there but man the documentation is missing allot of details imo.

I don't think this rule will work for my production setup at all, as I will have more than 1 osd per host and I want to prevent the 2 copies at dc-1 from being located on the same host.

mo_ · Sep 29, 2014

itd probably make sense to follow up on the mailing list thread I had linked and ask there why ist not working for you since I cant see a reason why the mons should die from an altered crush map unless its got syntax errors. Also like I had mentioned, in the later post of that thread they mentioned testing the crush map/rules with

Code:

crushtool -i crushmap --test --show-utilization --rule 1 --min-x 1 --max-x 10 --num-rep 3

so thats most likely a much faster way to test things than seing how the cluster acts with it in place.

adamb · Sep 29, 2014

mo_ said:
itd probably make sense to follow up on the mailing list thread I had linked and ask there why ist not working for you since I cant see a reason why the mons should die from an altered crush map unless its got syntax errors. Also like I had mentioned, in the later post of that thread they mentioned testing the crush map/rules with

Code:

crushtool -i crushmap --test --show-utilization --rule 1 --min-x 1 --max-x 10 --num-rep 3

so thats most likely a much faster way to test things than seing how the cluster acts with it in place.

I agree, I am going to get a post going over there today. I am pretty sure its because I only have 1 OSD per node, if I had 2 or more OSD's per node I think this rule would work.

Search

Search

Few Ceph questions

adamb

Famous Member

mo_

Renowned Member

adamb

Famous Member

adamb

Famous Member

adamb

Famous Member

mo_

Renowned Member

adamb

Famous Member

mo_

Renowned Member

adamb

Famous Member

mo_

Renowned Member

adamb

Famous Member

adamb

Famous Member

mo_

Renowned Member

adamb

Famous Member

We value your privacy