Considering a Ceph Cluster

greavette

Renowned Member
Apr 13, 2012
163
9
83
Hello,

We have two standalone Proxmox Servers with local storage to run our VM's. I currently do nightly backups to our local NAS for recovery. This obviously has lots of holes if we lose a server and need to rebuild from a backup. I'm looking to improve our situation but cost is a limiting factor.

I've been able to re purpose some older servers (3 of them) with 8 bay SATA storage bays. 2 of the drives use a hardware raid controller (I'll put the O/S in Raid 1 on these). The other 6 can use JBOD from the motherboard. My plan is to make a 5 node cluster...use my two proxmox servers to host the VM's (they have the CPU and Ram) and put the VM's on the 3 8 bay servers running Ceph. We have a flat gigabit network and have no plans to upgrade due to funds at this time.

The 3 Servers are a Supermicro 2x Intel Xeon Quad Core with 32 GB of Ram each so they should have the horsepower to run Ceph on Proxmox nicely.

I'll be using 8 1TB Sata (Desktop not enterprise) drives where each of the 3 Ceph storage servers will use 6 of these drives for storage.

I've been reading up on Ceph and Proxmox and it seems dead easy to setup so I'm about to begin this endeavor to see how it goes. But one concern that I've seen here on the forums and through Googling around (Sebastian Han - Good Bad and Ugly of Ceph) is performance could be an issue. Currently we have no performance issues running off of local storage so I'd like to have the benefit of distributed storage that keeps my VM's safe, but I don't want to impact performance too much so my users notice and complain.

I'd like to get thoughts from this community on whether or not I'm setting myself up for failure here with the network and setup I have at hand. Is there things I can do with this setup I'm planning to use that will help improve performance? Money is tight at the moment so purchasing a bigger solution won't be an option at this point in time.

I look forward to any thoughts or suggestions you can provide.

Thank you.
 
Last edited:
Hi,
looks that you have allready all things to test your config. So you should test the performance and look if you are happy with ceph.

Due to storage life migration it's easy to test from one node. But after test, you must connect both single nodes to an cluster (the joining node must be empty)!

Udo
 
Hello udo,

Thank you for the reply. I will ensure that when I add into this cluster one of my two Host Nodes I will have that one empty before joining the cluster. My hope is to move all VM's to one of my standalone Proxmox Servers leaving the other empty. Then I will create a cluster with my empty standalone node with these 3 repurposed SuperMicro Servers. Once it is up and running I will move a VM from the standalone node to the new cluster and continue my testing. Once I've confirmed I'm happy I will set about to move all VM's to the cluster emptying my first standalone node. Once it is empy I will then add this empty standalone node to the new cluster.

One further question I have...I'm considering using Seagate Desktop (not Enterprise) SATA drives for my Ceph Cluster - ST1000DM003. Would anyone care to comment on this type of drive...good or bad. Should I use something else perhaps that would be more reliable?

Thank you.
 
Wow! Thanks for the heads up redmop...I'm glad I'm able to review the backblaze findings before putting a significant investment into drives that may fail quickly on us.

So what are others recommending as best drives for the price and warranty period availabe?

I see the WD WD1002F9YZ offer a 5 year warranty...have a larger buffer (128 MB) but cost more. The WD WD10EZEX (Blue) have same seek time, but a smaller buffer and 3 year Warranty. But cost quite a bit cheaper. I found a Toshiba DT01ACA100 for a decent price..but it has en even smaller buffer with 2 year warranty.

Oi Vey...the choices can make your head spin!

Any ideas what this community recommends. From what I've read on Ceph desktop drives are ideal so I shouldn't have to pay for the high priced Enterprise drives in my Ceph Cluster.

I see in this post that mir had good things to say about the WD10EZEX -
http://forum.proxmox.com/threads/20655-HDD-for-Ceph-OSD. Seems like a good compromise middle of the road desktop drive for Ceph.
 
Last edited:
I would suggest investigating a few things cost wise before you dive in -

As I've learned from personal experience, write speed isn't fantastic with just spinners with 18 drives it will be ok, but don't expect stellar performance if you use 3 replicas. If you want better write performance you'll need to look at ssd journal drives. I just got my enterprise drives in so I have not been able to test them yet.

Next thing to think about is read - with that many drives you will probably easily saturate your network on reads. If you can afford to get some multi port Ethernet adapters and do some load balancing to get better throughput from multiple modes communicating.

You'll also want to modify how ceph works a bit on proxmox. By default it puts the cluster and the public network on the same lan, that will be a lot of traffic on that small pipe to handle both client requests and cluster osd operations. Having a separate network for the cluster really improved my ceph performance.

In my case I have a setup with management and ceph 3 networks (cluster, public, proxmox management) and all additional ethernets are on other vlans for vms.

I'd suggest setting up a small lab to test it out to get a feel for it before jumping in. At least, that's what I did
 
Hello Nethfel, I appreciate you replying to my questions. What drives are you using in your Ceph Cluster? Are they all SSD or do you have SATA for your OSD's? If SATA what make have you decided upon?

I should have mentioned that my 3 Ceph Servers do have 4 Nics in them so I will be using these to help separate the network duties of Ceph talking to my Proxmox Hosts and each other. I was hoping to keep my network flat but if need be I can look at adding vlan's to our managed switches.

These 3 Ceph servers I'll be using do not have the option (at least I don't believe they do) to add in any SSD drives so for now I'll be forced to keep my journal logs on my SATA drives. where should I put these journal logs then? On my 2 SATA O/S drives or on the 6 OSD's? I'm hoping to one day be able to get the funds to upgrade this to new servers where I will look at using SSD instead for journal logs.

Thanks!
 
Last edited:
Hello Nethfel, I appreciate you replying to my questions. What drives are you using in your Ceph Cluster? Are they all SSD or do you have SATA for your OSD's? If SATA what make have you decided upon?

Originally I was going to do all spinners, but after enough time and tests, I decided to go with SSDs for journals and spinners for the OSD Data. My SSD drive selection is a set of Intel DDCS 3700's 100GB @ ~165USD ea . For now I purchased 1 SSD journal / 3 spinners; from the person I corresponded with, he uses 1 SSD / 2 spinners, but I don't have the kind of budget that could afford that. The Ceph docs talk about 4-5 spinners per SSD, but from the Ceph list, performance gain isn't enough to warrant the costs at that ratio, plus it really allows just too many OSDs to fail at once should the SSD die.

My spinner selection is REALLY limited - I work for a school, so as I'm sure you can imagine, my budget is non-existant. I've had to use drives that I have plenty of spares of (we had made some system purchases about 4 years ago that had 2 HDD in, but we never utilized the second HDD, so I've been retasking them), so I'm using 500GB Seagate (ugh - both for size and drive type); the cluster size once re-built with the SSDs will be 18 OSDs with enough space to grow to 21 OSDs with our 3 ceph nodes before we have to expand to a 4th or more.

I should have mentioned that my 3 Ceph Servers do have 4 Nics in them so I will be using these to help separate the network duties of Ceph talking to my Proxmox Hosts and each other. I was hoping to keep my network flat but if need be I can look at adding vlan's to our managed switches.

Flat is easy (which was my first test of Ceph through proxmox), but flat isn't really effective unfortunately - you will find the network will get saturated if you have active nodes and heaven forbid you start a rebalancing. The nice thing is that for 2 of the vlans, you don't need external access - they can be completely isolated (ie: cluster network - isolated, client (aka public) network - isolated, proxmox management (personal preference on its own vlan, but with ability to access the outside for management and updates), VM networks (however you want to do it)

I don't have all of my numbers here (I'm at home kinda sick atm, I only have write data for 9, 12 and 15 OSD data for 3 replicas and 15 osd for 2 replicas, 1024 pgs, no SSD), but I do have some - these tests were done with the rados bench:

3 Replicas:
Code:
Write:
9 osds:
Total time run:         61.030614
Total writes made:      1204
Write size:             4194304
Bandwidth (MB/sec):     78.911

12 OSDs:
Total time run:         61.222597
Total writes made:      1212
Write size:             4194304
Bandwidth (MB/sec):     79.186 




15 OSDs:
Total time run:         61.835128
Total writes made:      1532
Write size:             4194304
Bandwidth (MB/sec):     99.102


2 Replicas, 15 OSDs:
Code:
Total time run:         60.639501
Total writes made:      2283
Write size:             4194304
Bandwidth (MB/sec):     150.595

As you can imagine, the write speed increased dramatically with just 2 replicas instead of 3, but 3 is what we will use for most of our pools as most of our servers are too critical to risk with just 2 replicas. Low priority VMs will be in the 2x replica pools.

Read from 2 replicas, 15 OSDs:
Code:
Stddev Bandwidth:       33.0239
Max bandwidth (MB/sec): 216
Min bandwidth (MB/sec): 0
Average Latency:        0.424944
Stddev Latency:         0.315361
Max latency:            2.38753
Min latency:            0.080402





These 3 Ceph servers I'll be using do not have the option (at least I don't believe they do) to add in any SSD drives so for now I'll be forced to keep my journal logs on my SATA drives. where should I put these journal logs then? On my 2 SATA O/S drives or on the 6 OSD's? I'm hoping to one day be able to get the funds to upgrade this to new servers where I will look at using SSD instead for journal logs.

Thanks!

No, do not do ANY operations for your data (OSD and related journals) on your OS drives - some of the operations can create a lot of thrashing could reduce performance and definitely shorten the life of the boot drive. Leave the boot drive(s) to do their thing and the OSDs (and potential journal drives) to do their thing. I don't know if there would be any benefit to storing the journal separately like how you're describing without storing it on an SSD, at that point you'd probably just have to have it stored with the drive being used as an OSD (default operation of Proxmox when having it build an OSD unless you change it when the popup shows up or manually create the OSD via command line). I'd probably with your case, just do 6 OSDs per node, journals stored on the drive and try it out -you may very well get significantly better performance than I due to the age and type of drives I'm using. Udo may have a better suggestion - I value his input as he has a lot of experience with Ceph.

If you have the time, I'd definitely do a trial run with your spare machines and run some benchmarks to see where you sit with different network configurations. Then add a test proxmox vm node (could be an antique desktop), setup a vm and start doing some benchmarks from within the vm with the OS you'll use most to see where you stand.
 
Thanks for the comments...I will consider using Vlans now that you've explained this to me for our small network.

I'm still getting up to speed on how Ceph works...for now I'll spin up some VM's on my Proxmox Host and mimic a ceph cluster. Would it be possible to have two Proxmox VM's running in Proxmox to mimic my two host standalone machines...and connect to these two VM's my physical 3 Ceph Servers in a cluster to mimic what I would do in Production. I for now only have Virtualbox running on my Windows 8 Server so I don't have a test physical Proxmox server atm. If I can do my testing virtually that would help me ensure I have the setup correctly. It wouldn't allow me to do much benchmarking or real world testing but would assist with my comfort level to setup.

Something else I'm considering now that SSD drives and journalling is very important is to use one of my SATA bays that was going to house an OSD and use something like a ICY DOCK EZConvert MB882SP-1S-1B 2.5" to 3.5" SATA 6Gbps SSD & HDD Adapter to add an SSD drive to my server. That way I have my O/S on two drives (Raid 1), 4 Drives for OSD, One Drive spare and the final bay used for my SSD and journalling. Would this community think this would work?

Thanks!
 
Last edited:
I had a chance to do my first test with the SSDs - as they only affect write speed, that's all I'm going to post. This is on a 3 node, 9 osd, 3 SSD cluster, 3 replica pool, running:

rados -p test bench 60 write

Code:
Total time run:         60.421890
Total writes made:      2068
Write size:             4194304
Bandwidth (MB/sec):     136.904

If you take a look at my previous post - pre-ssds on a 9 OSD cluster, 3 replicas:
Code:
Write:
9 osds:
Total time run:         61.030614
Total writes made:      1204
Write size:             4194304
Bandwidth (MB/sec):     78.911


The difference of having the SSDs in place as a journal is astounding. Almost double the write speed of no SSDs w/ 9 OSDs. I'm about to install to our planned starting max of 18 OSDs to see how it works, I'll post back the results when I have them.

Thanks for the comments...I will consider using Vlans now that you've explained this to me for our small network.

I'm still getting up to speed on how Ceph works...for now I'll spin up some VM's on my Proxmox Host and mimic a ceph cluster. Would it be possible to have two Proxmox VM's running in Proxmox to mimic my two host standalone machines...and connect to these two VM's my physical 3 Ceph Servers in a cluster to mimic what I would do in Production. I for now only have Virtualbox running on my Windows 8 Server so I don't have a test physical Proxmox server atm. If I can do my testing virtually that would help me ensure I have the setup correctly. It wouldn't allow me to do much benchmarking or real world testing but would assist with my comfort level to setup.

I think it's a viable way to test, but I've not tried it. I had done my very first testing using proxmox installed on virtualbox before I moved to regular hardware, but I didn't even try ceph until I moved to regular hardware.

Something else I'm considering now that SSD drives and journalling is very important is to use one of my SATA bays that was going to house an OSD and use something like a ICY DOCK EZConvert MB882SP-1S-1B 2.5" to 3.5" SATA 6Gbps SSD & HDD Adapter to add an SSD drive to my server. That way I have my O/S on two drives (Raid 1), 4 Drives for OSD, One Drive spare and the final bay used for my SSD and journalling. Would this community think this would work?

Thanks!

As long as you have the sata ports, and the RAID isn't being done by fakeraid (so either a dedicated RAID card or software raid) I don't see any issues with it personally. When I eventually get to spin up a 4th node in the ceph cluster, it was going to be a repurposed 4U case that I'll have to get some 3.5 -> 2.5 hot swap adapter bays for, so AFAIK it should be fine as long as the hardware isn't flaky.
 
As another note - how you handle load balancing - what algorithm you choose - will potentially affect performance.

What I mean (these numbers are only based on benchmark tests, real world (ie: multiple VM Hosts interacting with the cluster) could produce different results) -

Here's an example (all tests done with pool replica size 3, pg 1024, 18 OSDs on 500GB spinners, 6 SSD Journals):

Public/client network - Balance-ALB (2x gig ethernet)
Cluster Network - Balance-ALB (3x gig ethernet)

Write Speed: 147 MB/s
Read Speed: 166 MB/s

Public/client network - Balance-RR (2x gig ethernet)
Cluster Network - Balance-ALB (3x gig ethernet)

Write Speed: 135 MB/s
Read Speed: 307 MB/s


Public/client network - Balance-RR (2x gig ethernet)
Cluster Network - Balance-RR (3x gig ethernet)

Write Speed: 218 MB/s
Read Speed: 254 MB/s

Public/client network - Balance-RR (2x gig ethernet)
Cluster Network - Balance-RR (2x gig ethernet)

Write Speed: 212 MB/s
Read Speed: 242 MB/s


Public/client network - Balance-ALB (2x gig ethernet)
Cluster Network - Balance-RR (3x gig ethernet)

Write Speed: 213 MB/s
Read Speed: 165 MB/s


Balance-RR may look absolutely awesome and an ideal selection for the back end - but there are some real risks and drawbacks with it. What I mean:

* Balance-RR is affected by diminishing returns. The more lines you add to it, the less gain you get. 2x1Gb ethernet will show ~1.79-1.85Gbps. 3x 1Gb ethernet will show

* Balance-RR becomes more unreliable if you have partial failures (since it's round robin-ing between interfaces, if you loose a single interface on a single box, that single box will have to frequently re-request packets because it won't receive any packets that would be sent on that particular vlan or switch due to the lost interface)

* When using 3 lines (2 lines showed periods of low bandwidth, but I hadn't come across any 0's like I have with 3) in Balance-RR test during write phase, there were several times where the cur MB/s dropped below 30MB/s, there were even patches of 0MB/s which I'm not 100% sure what this means - whether it is potentially data loss, or due to packet re-ordering and re-requests just enough added latency for those few seconds to stop progress - SEE UPDATE BELOW SNIPPIT - see this snippit:

Code:
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   153      16      8906      8890   232.388       104   1.14794   0.27323
   154      16      8921      8905   231.268        60  0.062714  0.272937
   155      16      8921      8905   229.776         0         -  0.272937
   156      16      8921      8905   228.303         0         -  0.272937
   157      16      8964      8948   227.944   57.3333  0.129661  0.280456

and here was a bad one from a different run:

Code:
   189      16     10619     10603   224.366       116  0.098249  0.283019
   190      16     10629     10613   223.396        40   0.08789  0.282832
   191      16     10629     10613   222.226         0         -  0.282832
   192      16     10629     10613   221.069         0         -  0.282832
   193      16     10629     10613   219.923         0         -  0.282832
   194      16     10629     10613    218.79         0         -  0.282832
   195      16     10629     10613   217.668         0         -  0.282832
   196      16     10629     10613   216.557         0         -  0.282832
   197      16     10677     10661   216.432   27.4286  0.166645  0.295532

UPDATE ON THE 0's: It appears to be a journal sizing issue. If the journal is too small and fills up before it can flush the writes to the OSDs themselves, you loose data throughput until there is space again. The default size of the journal through proxmox (and ceph if you were to set it up manually I believe) is 5120 (5 Gig). Upping this took care of the issue (in my case, I upped it to 10Gig) and I'm no longer seeing the periods of 0 on long write periods (I'm doing tests of 300 seconds each which makes it easier to find issues like that; it was more pronounced when I was only using spinners as the drives were so slow, I could end up with long periods of 0 throughput until the journal cleared)

I never saw those types of results using ALB or LACP (I'm leaning toward ALB because my switches only support so many LAG groups, and I'm running out quickly) - I now suspect I never saw them with ALB or LACP because I would never have enough data written to fill the journal faster than it could output data to the OSDs, so the journals never filled up.

I used the default journal size that proxmox creates which is smaller then some of the recommendations I saw either on Cephs site or on some blogs I've read - this is an easy fix, just adjust the journal number from 5120 to 10240 to give a 10gig journal size prior to creating your OSDs. This would only really be an issue if your network speed is fast enough and write speed to the journal fast enough and write speed to the drive slow enough to fill up the journal where everything grinds to a halt until there is space to work again.
 
Last edited: