Newbie need your input

hi guys, thanks for your great explanation and input. This definitely helps.

To avoid data lost, short service life of consumer SSD might not be the option now.
But again, how difficult to replace or rebuild a faulty SSD that used as journal?

Replication will be on host basis instead of OSD basis, ya, just like what Q-wulf mentioned, data could be reside on same host which shared the same journal and OSD pool.
3 replicates for 4 nodes i think much better than 2 replicates.
I think standard storage should sufficient for newbie like me. If go for Erasure K and m should be both 3 with split data to 3 parts and max allowable lost of 3. No idea on the cache tier that required here.

Seagate Laptop SSHD 1TB measured just ~100MB/s read/write. I think it is good enough to use as back storage with SSD journal at front.
But not sure suitable to host multiple VM like those hosting company do out there offering proxmox cloud vm. Since this is in-house we have total control over the setup.

so, assume SSD will be used as journal, with 16 OSD, 1TB SSHD each (4 OSD per node), effective storage only left about 5.33TB...

(16 * 1TB) / 3 replicates = 5.33TB

Please correct me I am wrong.

Again, worth?
 
Last edited:
Lemme try and break this down (if i make a mistake, someone correct me please)


you probably know what placement groups are, right ?
if not, read here for indepth info: http://docs.ceph.com/docs/master/rados/operations/placement-groups/
  • they store objects based on your crush algorythm
  • they are spread over the osd's in your system that apply to the pg based on crush.
  • The more objects your read/write at the same time, the higher the chance they are evenly split over the PG's defined by crush.

examples based on a 4 node cluster with 4 osd's each.
  1. crush rule Host: 256 PG's - 64 per host. 16 per OSD.
  2. crush rule OSD: 256 PG's - 16 per cluster. 4 per OSD.

regarding speed.
Lets say you do high amounts of reads/writes, then you get the more speed, the less PG's reside on the same OSD and the more OSD's service PG's on your pool.


There are basically 3 types of pools:
  • Replication (simplified think Raid1 with X amount of Drives)
    • You have a Size (Think the number of times a object is stored - including the original - ie replicas)
    • min_Size (think the minimum number of replicas that need to be available to still provide io)
  • Erasure Code (simplified think Raid 5 with Y amount of Parity Bits)
    • k = amount of PG's the object is split into. (think the number of OSD's the Data will be stored on
    • m = amount of parity bits provided (Think how many OSD's can be lost)
  • Cache Pool / Cache tier (simplified think that 2 GB DDr2 Ram on Raidcontroller infront of your Raid1 or Raid5)


Examples: Based on a 4 node cluster with 4 osd's (1TB) each:
  1. Replicated
    1. Crush Rule host. Size = 4, Min_size =1 - Data is replicated 4 Times on 4 hosts, 3 Hosts can Fail - 4 TB available space - 400% overhead
    2. Crush Rule host. Size = 4, Min_size =2 - Data is replicated 4 Times on 4 hosts, 2 Hosts can Fail - 4 TB available space - 400% overhead
    3. Crush Rule host. Size = 3, Min_size =1 - Data is replicated 3 Times on 3 hosts, 2 Hosts can Fail - 5,3 TB available space - 300% overhead
    4. Crush Rule OSD. Size =4, min_size=1 - Data is replicated 4 Times on 4 osds, 3 osds can Fail - 4 TB available space - 400% overhead
    5. Crush Rule OSD. Size =3, min_size=1 - Data is replicated 3 Times on 3 osds, 3 osds can Fail - 5.3 TB available space - 300% overhead
  2. Erasure Code - slower by nature, since objcects need to be split into k pieces and then m paraty-parts calculated.
    1. Crush Rule host. K =1, M=3 object is spread on 4 hosts, 3 hosts can fail. - 4 TB Availabe - 400% overhead
    2. Crush Rule host. K =2, M=2 object is is spread on 4 hosts, 2 hosts can fail. - 8 TB Availabe - 200% overhead
    3. Crush Rule osd. K =14, M=2 object is is spread on 16 osds, 2 osds can fail. - 14 TB Availabe - 114% overhead
    4. Crush Rule osd. K =12, M=4 object is is spread on 16 osds, 4 osds can fail. - 12 TB Availabe - 133% overhead
    5. Crush Rule osd. K =11, M=5 object is is spread on 16 osds, 5 osds can fail. - 11 TB Availabe - 145% overhead
    6. Crush Rule osd. K =09, M=7 object is is spread on 16 osds, 7 osds can fail. - 09 TB Availabe - 177,7% overhead
    7. Crush Rule osd. K =04, M=12 object is is spread on 16 osds, 7 osds can fail. - 04 TB Availabe - 400% overhead.


Now that said, :
you can run multiple types of pools in tandem on the same cluster. Crush will take care of the placement.

  1. so you could do a crush rule host, replicated, with size=4 pool.
  2. so you could do a crush rule osd, erasure coded, with k=11 and M=5 pool.




my Recommendation:
There is something else, which may be useful for your Cluster. Since your SSHD benches from before with --numjobs=2 tell me that 4x 1SSHD with journal on the same Disk are as fast or faster then 4x 1 SSHD wth journal on an OSD. And since udo said that your type of SSD is probably not the best journal device because of high wear (when 4x the amount of data gets written). I would suggest to Split your Spinning disks from your non spinning disks.

You'd then basically have the following:

4x Nodes like this:
4x SSHD-OSD
1x SSD-OSD

Or in Total:
16x SSHD-OSD
4x SSD-OSD

If you do this, and i were you based on your requirement to make sure you can at minimum loose 4 Disks or 1 Node at the same time i'd do the following pools (ranked in order of speed):



  1. Cache Pool: for EC-Pool: Crush Rule SSD-OSD Replicated Size=4
  2. Fast pool: For e.g. OS-Disks: Crush Rule SSD-OSD . Replicated Size=4
  3. Medium speed Pool: crush rule HDD-host, replicated, with size=4 pool. 400% overhead for medium amount of Data
  4. Slow Speed pool: crush rule HDD-OSD, Erasure coded, with k=11, M=5, 145% overhead for all big files that do not get accessed this often - e.g. images/videos/documents. M=5, cause it will handle 1 Node +1 OSD going down, or 5 OSD's.


side note: I am not sure on the "size" value for your SSD- pools. i put it to 4 (as in 3 OSD's can fail and you still have the data). I however do not know on a EC-Pool with a cache-tier Pool attached to it, what would happen if you have the Data already on your EC-Pool, but it gets moved to the Cache pool, and then you loose all SSD-OSD's assigned to it. Someone with more experience would need to tell me. I know for a fact that if you e.g. have size=1 on an initial write and you loose the SSD-OSD before the data gets written it gets lost.

Based on my experience on my single node Cluster with 2x SSD and 16 HDD - and taking your use of SSHD instead of HDD into account, i am estimating your cached-EC-Pool speeds somewhere around 50% your 4x Replicated pools.


This would basically give you the best bang for your buck.

As soon as your managements approves more funds, you can scale that cluster by adding machines like the one above to it. in addition to the obvious advantages of giving you more capacity, you also get more fault tolerance and in case of the EC-Pools, less overhead (up to a factor of 1+m/k)
You'd also get more speed advantages from your SSD cache pools with rising SSD numbers. (compare http://www.sebastien-han.fr/blog/2014/06/10/ceph-cache-pool-tiering-scalable-cache/ - look at local hit explanation and comment section regarding local hits)

oh and btw - Benchmark, benchmark, benchmark :p

Code:
rados bench -p SSD-Replication1 300 write --no-cleanup
rados bench -p SSD-Replication1 300 seq
rados -p SSD-Replication1 cleanup --prefix bench
sleep 60
rados -p SSD-Replication1 cleanup --prefix bench
sleep 120
rados bench -p HDD-Replication3 300 write --no-cleanup
rados bench -p HDD-Replication3 300 seq
rados -p HDD-Replication3 cleanup --prefix bench
sleep 60
rados -p HDD-Replication3 cleanup --prefix bench
sleep 120
rados bench -p HDD-EC_Failure-3  300 write --no-cleanup
rados bench -p HDD-EC_Failure-3  300 seq
rados -p HDD-EC_Failure-3  cleanup --prefix bench
sleep 60
rados -p HDD-EC_Failure-3  cleanup --prefix bench

just my 2 Dollar, 22 Cents.
 
hi Q-wulf, thanks for your details breakdown and explanation, appreciate very much!
Let me take down some of the nodes to do test run with your recommendation and come back with some results.
 
something i forgot to mention regarding Erasure coded pools:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041442.html
You can't use erasure coded pools directly with RBD. They're only suitable
for use with RGW or as the base pool for a replicated cache pool, and you
need to be very careful/specific with the configuration. I believe this is
well-documented, so check it out! :


If you wanna plug your EC-Pool straight into Ceph via storage plugin (k)rdb, you have to add a replicated "Cache-pool" infront of your "Storage-Pool"
^^it can be SSD or HDD or SSHD - as long as you stick one infront of it.

just ran into the same issue (as i broke my SSD and wanted to run my EC-Pool without cache - never had that scenario before)


ps.: let me know what your benchmarks turn out to be like rather curious about it.
 
Last edited:
[...]
Update:

Single SSHD without hardware RAID

--numjobs=1 bw=448914 B/s, iops=109
--numjobs=2 bw=2347.7KB/s, iops=586
--numjobs=3 bw=3246.3KB/s, iops=811
--numjobs=4 bw=4262.3KB/s, iops=1065
--numjobs=5 bw=4028.7KB/s, iops=1007
--numjobs=6 bw=4242.5KB/s, iops=1060
--numjobs=7 bw=4472.1KB/s, iops=1118

3.) is the speed and IOP's of "--numjobs=4" on your SSD greater then the aggregated speed and Iops of "--numjobs=2" on your 4x SSHD ?
^^Tells you wether it makes sense to use the SSD for the 4x SSHD, or wether you'd bottleneck your SSHD's.
>>>> I just do a quick test on 4xSSHD with LSI megaraid 9260 on RAID10 setup. Maybe it is not relevant or doesn't make sense, but just share some info here:

--numjobs=1 bw=63462KB/s, iops=15865
--numjobs=2 bw=50160KB/s, iops=12540
--numjobs=3 bw=84969KB/s, iops=21242
--numjobs=4 bw=124261KB/s, iops=31065
--numjobs=5 bw=151194KB/s, iops=37798
--numjobs=6 bw=163376KB/s, iops=40843


Let see what you can intepret from these results.

regarding 3.)
you, my friend, just benchmarked a 512MB 800MHz DDR II SDRAM cache on your raid-controller :)



regarding the single SSHD benchmark:
if i am not mistaken the initial question was, whether you'd gain a speed advantage if you go with 4 SSHD backed by a single journal on SSD, or run a journal for each SSHD directly on the SSHD.

from the benchmark i'd say your better of with the journal on the SSD.
lemme put it into numbers:

relevant benchmark of your SSD:
--numjobs=4 bw=66876KB/s, iops=16719

Relevant benchmark of your SSHD:
--numjobs=2 bw=2347.7KB/s, iops=586

Compare the numbers:
16719 IOPS vs 4x486=1944 IOPS.
in my book there is no question which way to go.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!