Lemme try and break this down (if i make a mistake, someone correct me please)
you probably know what placement groups are, right ?
if not, read here for indepth info:
http://docs.ceph.com/docs/master/rados/operations/placement-groups/
- they store objects based on your crush algorythm
- they are spread over the osd's in your system that apply to the pg based on crush.
- The more objects your read/write at the same time, the higher the chance they are evenly split over the PG's defined by crush.
examples based on a 4 node cluster with 4 osd's each.
- crush rule Host: 256 PG's - 64 per host. 16 per OSD.
- crush rule OSD: 256 PG's - 16 per cluster. 4 per OSD.
regarding speed.
Lets say you do high amounts of reads/writes, then you get the more speed, the less PG's reside on the same OSD and the more OSD's service PG's on your pool.
There are basically 3 types of pools:
- Replication (simplified think Raid1 with X amount of Drives)
- You have a Size (Think the number of times a object is stored - including the original - ie replicas)
- min_Size (think the minimum number of replicas that need to be available to still provide io)
- Erasure Code (simplified think Raid 5 with Y amount of Parity Bits)
- k = amount of PG's the object is split into. (think the number of OSD's the Data will be stored on
- m = amount of parity bits provided (Think how many OSD's can be lost)
- Cache Pool / Cache tier (simplified think that 2 GB DDr2 Ram on Raidcontroller infront of your Raid1 or Raid5)
Examples: Based on a 4 node cluster with 4 osd's (1TB) each:
- Replicated
- Crush Rule host. Size = 4, Min_size =1 - Data is replicated 4 Times on 4 hosts, 3 Hosts can Fail - 4 TB available space - 400% overhead
- Crush Rule host. Size = 4, Min_size =2 - Data is replicated 4 Times on 4 hosts, 2 Hosts can Fail - 4 TB available space - 400% overhead
- Crush Rule host. Size = 3, Min_size =1 - Data is replicated 3 Times on 3 hosts, 2 Hosts can Fail - 5,3 TB available space - 300% overhead
- Crush Rule OSD. Size =4, min_size=1 - Data is replicated 4 Times on 4 osds, 3 osds can Fail - 4 TB available space - 400% overhead
- Crush Rule OSD. Size =3, min_size=1 - Data is replicated 3 Times on 3 osds, 3 osds can Fail - 5.3 TB available space - 300% overhead
- Erasure Code - slower by nature, since objcects need to be split into k pieces and then m paraty-parts calculated.
- Crush Rule host. K =1, M=3 object is spread on 4 hosts, 3 hosts can fail. - 4 TB Availabe - 400% overhead
- Crush Rule host. K =2, M=2 object is is spread on 4 hosts, 2 hosts can fail. - 8 TB Availabe - 200% overhead
- Crush Rule osd. K =14, M=2 object is is spread on 16 osds, 2 osds can fail. - 14 TB Availabe - 114% overhead
- Crush Rule osd. K =12, M=4 object is is spread on 16 osds, 4 osds can fail. - 12 TB Availabe - 133% overhead
- Crush Rule osd. K =11, M=5 object is is spread on 16 osds, 5 osds can fail. - 11 TB Availabe - 145% overhead
- Crush Rule osd. K =09, M=7 object is is spread on 16 osds, 7 osds can fail. - 09 TB Availabe - 177,7% overhead
- Crush Rule osd. K =04, M=12 object is is spread on 16 osds, 7 osds can fail. - 04 TB Availabe - 400% overhead.
Now that said, :
you can run multiple types of pools in tandem on the same cluster. Crush will take care of the placement.
- so you could do a crush rule host, replicated, with size=4 pool.
- so you could do a crush rule osd, erasure coded, with k=11 and M=5 pool.
my Recommendation:
There is something else, which may be useful for your Cluster. Since your SSHD benches from before with --numjobs=2 tell me that 4x 1SSHD with journal on the same Disk are as fast or faster then 4x 1 SSHD wth journal on an OSD. And since udo said that your type of SSD is probably not the best journal device because of high wear (when 4x the amount of data gets written). I would suggest to Split your Spinning disks from your non spinning disks.
You'd then basically have the following:
4x Nodes like this:
4x SSHD-OSD
1x SSD-OSD
Or in Total:
16x SSHD-OSD
4x SSD-OSD
If you do this, and i were you based on your requirement to make sure you can at minimum loose 4 Disks or 1 Node at the same time i'd do the following pools (ranked in order of speed):
- Cache Pool: for EC-Pool: Crush Rule SSD-OSD Replicated Size=4
- Fast pool: For e.g. OS-Disks: Crush Rule SSD-OSD . Replicated Size=4
- Medium speed Pool: crush rule HDD-host, replicated, with size=4 pool. 400% overhead for medium amount of Data
- Slow Speed pool: crush rule HDD-OSD, Erasure coded, with k=11, M=5, 145% overhead for all big files that do not get accessed this often - e.g. images/videos/documents. M=5, cause it will handle 1 Node +1 OSD going down, or 5 OSD's.
side note: I am not sure on the "size" value for your SSD- pools. i put it to 4 (as in 3 OSD's can fail and you still have the data). I however do not know on a EC-Pool with a cache-tier Pool attached to it, what would happen if you have the Data already on your EC-Pool, but it gets moved to the Cache pool, and then you loose all SSD-OSD's assigned to it. Someone with more experience would need to tell me. I know for a fact that if you e.g. have size=1 on an initial write and you loose the SSD-OSD before the data gets written it gets lost.
Based on my experience on my single node Cluster with 2x SSD and 16 HDD - and taking your use of SSHD instead of HDD into account, i am estimating your cached-EC-Pool speeds somewhere around 50% your 4x Replicated pools.
This would basically give you the best bang for your buck.
As soon as your managements approves more funds, you can scale that cluster by adding machines like the one above to it. in addition to the obvious advantages of giving you more capacity, you also get more fault tolerance and in case of the EC-Pools, less overhead (up to a factor of 1+m/k)
You'd also get more speed advantages from your SSD cache pools with rising SSD numbers. (compare
http://www.sebastien-han.fr/blog/2014/06/10/ceph-cache-pool-tiering-scalable-cache/ - look at local hit explanation and comment section regarding local hits)
oh and btw - Benchmark, benchmark, benchmark
Code:
rados bench -p SSD-Replication1 300 write --no-cleanup
rados bench -p SSD-Replication1 300 seq
rados -p SSD-Replication1 cleanup --prefix bench
sleep 60
rados -p SSD-Replication1 cleanup --prefix bench
sleep 120
rados bench -p HDD-Replication3 300 write --no-cleanup
rados bench -p HDD-Replication3 300 seq
rados -p HDD-Replication3 cleanup --prefix bench
sleep 60
rados -p HDD-Replication3 cleanup --prefix bench
sleep 120
rados bench -p HDD-EC_Failure-3 300 write --no-cleanup
rados bench -p HDD-EC_Failure-3 300 seq
rados -p HDD-EC_Failure-3 cleanup --prefix bench
sleep 60
rados -p HDD-EC_Failure-3 cleanup --prefix bench
just my 2 Dollar, 22 Cents.