Ceph with 1 server, then adding a second and third

Sorry, should have elaborated further. Since Proxmox only handles full disks, eg. no partitions, then 1 OSD = 1 disk in proxmox. Correct?
 
@Time, It is somewhat unclear of many nodes in total (Proxmox and CEPH) you are going to have in your network. You said you wanted to start with 1 CEPH node, does that mean you have planned other nodes for Proxmox usage only? With Proxmox VE 3.2 both Proxmox and CEPH can co exist on same hardware. So if you start with only 3 nodes that gives you quorum and redundancy for both Proxmox and CEPH. Based on this your node setup will look like this:
Node 1
======
Platform : AMD / Inte
lSSD: 1 (For Proxmox Installation
HDD: 2 (For CEPH OSD+Journal)

Node 2
======
Platform : AMD / Intel
SSD: 1 (For Proxmox Installation)
HDD: 2 (For CEPH OSD+Journal)

Node 3
======
Platform : AMD / Intel
SSD: 1 (For Proxmox Installation)
HDD: 2 (For CEPH OSD+Journal)

No matter how mnay node you want to setup for CEPH, you still need more than one node for proxmox unless you were planning to have 1 node proxmox cluster. With this setup you give both Proxmox and CEPH do what they do best without spending lots of money.
Sorry I didn't point that out, but yes, I want to use the Ceph that comes with Proxmox, ie. run both on the same node. Initially I was looking for the best solution on how to start with just a single node and add 2 more nodes later and turn them into a Proxmox & Ceph cluster with HA fencing. It was unclear to me if that's a possibility and how the process would look like, ie. if I could use Ceph with just a single node or if I should use local storage and later add new HDDs for Ceph OSDs and migrate the VMs to it. However, you guys have pretty much convinced me now that it's better to start with 3 nodes - I just wanted to keep the initial investment as low as possible, as I'm not that certain yet how my idea would work out, which makes the risk of loosing the investment higher than I would usually dare. So back on topic - now that I will most likely go with 3 nodes from the start, the following questions appeared:

1.) Isn't it bad to have both the OSD and journal on the same single hard drive, ie. low I/O? The servers I have in mind only come with 2 drive bays by default, so I if I go with your suggested setup, would that result in worse performance too than if I had lets say 2 or 3 OSDs/HDDs per node?

2.) If I go for your suggested setup, will it be possible to simply "remove" the journal from the disks as soon as Proxmox supports the Ceph firefly release?

3.) The racks I would place the servers in only have dual 10G NICs for the connectivity of the whole rack, so I wouldn't be able to use 10G NICs for the internal network, as I won't be able to plug them directly into the switch with SFP modules or something. With your suggested setup, would it throttle I/O too much if I only use 1GigE NICs for internal networking (Ceph sycning)? I suppose not that much, because I believe running only one OSD per node with the journal on the same disk wouldn't result in great performance anyway, but I'm not sure how much network usage Ceph generates when writing/reading data, how much the journal "buffer" helps, etc.

4.) Is there any ETA for the Proxmox release that would have Ceph firefly integrated? Just in case it's not easy to migrate from journal-based Ceph to non-journal Ceph.

So with "ceph osd pool set" it should be possible to adjust the pool settings from shell, right? I just wonder why there isn't a simple "Edit Pool" feature in the Proxmox GUI, if it actually works that simple.

Again I'm sorry about all these probably dumb questions, but jumping from working with your average RAID-10 array for years right into Ceph can be confusing without having read the whole documentation and without real-life experience with it. Thanks to all of you for your replies.

PS: Looks like the forum's WYSIWYG editor isn't working properly in Chromium - it messed up the whole format of my post, had to edit it in Iceweasel.
 
Last edited:
1.) Isn't it bad to have both the OSD and journal on the same single hard drive, ie. low I/O? The servers I have in mind only come with 2 drive bays by default, so I if I go with your suggested setup, would that result in worse performance too than if I had lets say 2 or 3 OSDs/HDDs per node?
With just one HDD used as OSD per node, either way you look at it you are going to face performance issue, since there are not just enough HDD to share the load. Using SSD for journal in your case will certainly provide higher performance but do not expect lighting fast speed even though you have used SSD for journal.
For small CEPH node with less than 8 OSDs, using SSD for journal makes sense, but when you will go beyond 8 OSDs per node, putting journal on same OSD will save world of headache without cutting performance. The higher your OSDs are, the less you need to depend on separate SSD for journal.

2.) If I go for your suggested setup, will it be possible to simply "remove" the journal from the disks as soon as Proxmox supports the Ceph firefly release?
I am not fully familiar with new FireFly so do not know if they are fully taking away the need of journal or just doing it differently. Using OSD and Journal on same HDD keeps things simpler though. But regardless of that, it is possible to move journal around.

3.) The racks I would place the servers in only have dual 10G NICs for the connectivity of the whole rack, so I wouldn't be able to use 10G NICs for the internal network, as I won't be able to plug them directly into the switch with SFP modules or something. With your suggested setup, would it throttle I/O too much if I only use 1GigE NICs for internal networking (Ceph sycning)? I suppose not that much, because I believe running only one OSD per node with the journal on the same disk wouldn't result in great performance anyway, but I'm not sure how much network usage Ceph generates when writing/reading data, how much the journal "buffer" helps, etc.
You are right, running one OSD per node will have no issue with network bandwidth. Higher bandwidth allows larger amount of data transfer between OSDs during self-healing/re-balancing due to HDD failure/adding OSD/change of PG number etc. You will notice these tasks will get done faster with higher bandwidth in a high number OSDs cluster. But in your case, 1GB NIC will be more than enough for now.

4.) Is there any ETA for the Proxmox release that would have Ceph firefly integrated? Just in case it's not easy to migrate from journal-based Ceph to non-journal Ceph.
I am not aware of any release date for CEPH FireFly into Proxmox yet. Honestly, i hope it does not get added the moment after it comes out. I would like to see Proxmox waiting a bit before implementing to some first batch bugs are worked out in FireFly. That is drastic change going to take place in CEPH arena seems like it.

So with "ceph osd pool set" it should be possible to adjust the pool settings from shell, right? I just wonder why there isn't a simple "Edit Pool" feature in the Proxmox GUI, if it actually works that simple.
Yes it is just matter of simple command line to change PG. You have to keep in mind that CEPH inclusion in Proxmox is a technology Preview. We are very fortunate to have the feature control we have now for CEPH through Proxmox GUI. I really foresee Proxmox adding more and more feature in upcoming version when we will be able to do more of CEPH in GUI.

Again I'm sorry about all these probably dumb questions, but jumping from working with your average RAID-10 array for years right into Ceph can be confusing without having read the whole documentation and without real-life experience with it. Thanks to all of you for your replies.
You are very right on this one! People coming from RAID era has hard time grasping the concept of CEPH. To them it is just nightmare to ignore RAID. :)
[[ RAID users, do not raise protest against this comment of mine. :) ]]
 
Thank you for your detailed reply, symmcom, it was really helpful! Every time I think about it, I run into new questions unfortunately.

1.) With the 1 SSD + 1 HDD per server scenario, instead of using the HDD for journal, wouldn't it result in better performance to run both the Ceph journal and Proxmox from the SSD? Or wouldn't Proxmox and the Ceph journal work well together on one SSD?

2.) Different sources about Ceph only mention that the more OSDs one has, the better the I/O performance will be. But where/how exactly would that "RAID-0 effect" begin? I suppose a setup with 3 nodes with only 1 HDD/OSD per node would perform similar to a single HDD or RAID-1, am I right? Now if I would have 2 HDDs/OSDs per node and 3 nodes in total, would that already increase the I/O compared to a single SATA?
 
Different sources about Ceph only mention that the more OSDs one has, the better the I/O performance will be. But where/how exactly would that "RAID-0 effect" begin? I suppose a setup with 3 nodes with only 1 HDD/OSD per node would perform similar to a single HDD or RAID-1, am I right? Now if I would have 2 HDDs/OSDs per node and 3 nodes in total, would that already increase the I/O compared to a single SATA?

OSD = Object Storage Daemon. The deployments I have seen use one dameon per spindle and some number of daemons per node. Barring a poorly engineered node, the "RAID-0 effect" is typically constrained by NIC bandwidth. For a Proxmox host that does not utilize 10gigE or Inifiniband NICs, OSDs on the same host should have a significant bandwidth advantage over those on other machines. I find this quite interesting in that it might carry some benefits over a local RAID array (in particular, replication to other nodes.) The project I am currently working on does not have significant CPU loads and has the potential to benefit from this type of architecture.
 
1.) With the 1 SSD + 1 HDD per server scenario, instead of using the HDD for journal, wouldn't it result in better performance to run both the Ceph journal and Proxmox from the SSD? Or wouldn't Proxmox and the Ceph journal work well together on one SSD?
Both Proxmox and CEPH Journal would definitely work on same SSD. Journal only needs a separate partition. Since both Proxmox and CEPH has to share the SSD I/O, i dont believe you will get "wow" performance, but still it should faster than the journal on HDD itself.

2.) Different sources about Ceph only mention that the more OSDs one has, the better the I/O performance will be. But where/how exactly would that "RAID-0 effect" begin? I suppose a setup with 3 nodes with only 1 HDD/OSD per node would perform similar to a single HDD or RAID-1, am I right? Now if I would have 2 HDDs/OSDs per node and 3 nodes in total, would that already increase the I/O compared to a single SATA?
By "RAID-0 Effect" i think you meant mirror operation? Not clear what you meant there.

3 Nodes with 2 HDDs(OSDs) in each will have better performance than 3 nodes with 1 HDD(OSD) in each. Also,
2 Nodes with 2 HDDs(OSDs) in each will have better performance than 3 nodes with 1 HDD(OSD) in each.
 
Again, thanks for your replies! If I want to have two partitions on one SSD, one for Proxmox and one for Ceph journal, can I archive that with "maxroot", ie. will it leave the remaining space empty? For instance if I have a 256GB SSD and set "maxroot=128", would it leave the remaining space empty and available as Ceph journal? Or would it make more sense to use a Debian ISO and then install Proxmox on top of it? And about the "RAID-0 effect" I meant mirror operation, yes, and I wanted to know with how many OSDs exactly that would be possible, but you answered my question good enough already - I will just add a second disk to each node if I should run into I/O issues. One further question that I already asked, but that hasn't been answered yet I think: Is the default Ceph pool "RBD" only for a specific number of nodes and OSDs or can I use it for the 3 node and 1 OSD per node setup that symmcom suggested and then for instance add a fourth node *without* adjusting the pool settings, or is it *required* to for instance change the "Size" and "Min. Size" settings as well as the "pg_num"? Is the default setting of the "RBD" pool fine, or should I rather use pg_num 150 for the suggested setup instead of the default, which is 64?
 
One further question that I already asked, but that hasn't been answered yet I think: Is the default Ceph pool "RBD" only for a specific number of nodes and OSDs or can I use it for the 3 node and 1 OSD per node setup that symmcom suggested and then for instance add a fourth node *without* adjusting the pool settings, or is it *required* to for instance change the "Size" and "Min. Size" settings as well as the "pg_num"? Is the default setting of the "RBD" pool fine, or should I rather use pg_num 150 for the suggested setup instead of the default, which is 64?
Realistically, you will always have to adjust your PG based on number of OSDs you have in your cluster. A good formula to calculate number of PG is this(http://ceph.com/docs/master/rados/operations/placement-groups/):
Total PG = (# of OSDs x 100) / Replicas = Rounded to nearest power of 2

So lets say you have 2 node with 1 OSD in each. Your ideal PG would be : Total PG = (2 x 100) / 2 = 100 = 128 (nearest power of 2).

If you added additional OSD in each node: Total PG = (4 x 100) / 2 = 200 = 256 (nearest power of 2).

The default PG 64 is just there as starting point.
 
Last edited:
Realistically, you will always have to adjust your PG based on number of OSDs you have in your cluster. A good formula to calculate number of PG is this(http://ceph.com/docs/master/rados/operations/placement-groups/): Total PG = (# of OSDs x 100) / Replicas = Rounded to nearest power of 2So lets say you have 2 node with 1 OSD in each. Your ideal PG would be : Total PG = (2 x 100) / 2 = 100 = 128 (nearest power of 2).If you added additional OSD in each node: Total PG = (4 x 100) / 2 = 200 = 256 (nearest power of 2). The default PG 64 is just there as starting point.
Thanks, now I understood that too! So the only question left would be how to split the SSD in two partitions, one for Proxmox and one for Ceph journal. Would "maxroot=128" during Proxmox install leave the rest of the space unassigned and theoretically free for a Ceph journal, assuming I have a 256GB SSD?
 
Thanks, now I understood that too! So the only question left would be how to split the SSD in two partitions, one for Proxmox and one for Ceph journal. Would "maxroot=128" during Proxmox install leave the rest of the space unassigned and theoretically free for a Ceph journal, assuming I have a 256GB SSD?
With SSDs so cheap now a days, you could very well throw in a small 60GB SSD for journal. Saves you from hassles of partitioning.
 
With SSDs so cheap now a days, you could very well throw in a small 60GB SSD for journal. Saves you from hassles of partitioning.
True, but the problem is that the servers I want to use only have 2 drive bays, which is why it isn't that easy unfortunately.
 
firefly is available now... as mentioned many times firefly does not need journals anymore. BUT:-----Key/value OSD backend (experimental): An alternative storage backend for Ceph OSD processes that puts all data in a key/value database like leveldb. This provides better performance for workloads dominated by key/value operations (like radosgw bucket indices). - See more at: http://ceph.com/releases/v0-80-firefly-released/#sthash.ZXLSq60J.dpuf----so the guys say its still experimental.... is it a good idea to use it in production?
 
Hello,

I am a newbie to this subject.

I have 3 nodes setup per suggestion of ProxMox's wiki. Each has:

Boot= 1 x 250GB
Journal = 1 x 250GB
OSDs = 4 x 2TB

Based on the Proxmox article (http://pve.proxmox.com/wiki/Ceph_Server), I understand my maximum storage capacity at 8TB (100%)

QUETIONS:

- Why is it that I have a total storage capacity of 24TB (12 x 2TB) but only 8TB can be use?

- What would happen if I store data more than 8TB?

- If I want more storage capacity than 8TB, can I just add another server with same exact hardware configuration and instantly get another 8TB?

- What if I add a 4th, 5th, and 6th server to the cluster with same hardware configuration? Is there an advantage to storage and performance?

- Since future version of CEPH is not going to use journal, can I just put the journal on the 3rd OSD drive for now). My servers only has 6 bays and I just don't want to go through the hassle of having to add in another 2TB later and deal with the OSD renaming, etc. I rather just add another 2 TB now and make is an OSD. I am trying to prepare ahead for the CEPH future release (no need for dedicated disk for journal)

Your help and guidance is very much appreciated.

Andrew
 
Although you did not mention i am assuming you are using 3 replicas? Which means any data you store get replicated across 3 nodes. For example, if you store 100GB data, it will take 300GB to store for 3 x 100GB when 3 replica is set. If you had 2 replicas then your usable storage would have been 12TB instead of 8TB. Replicas can be adjusted any time. But make sure you give enough time for the cluster to go through rebalancing of data after changing replicas. This also applies when you changing PG numbers.

You can put the journals on same OSDs, it will work. Journal on SSD provides performance up to about 8 OSDs per node. Anything beyond that co locating journal and OSD on same HDD is good idea.
 
Although you did not mention i am assuming you are using 3 replicas? Which means any data you store get replicated across 3 nodes. For example, if you store 100GB data, it will take 300GB to store for 3 x 100GB when 3 replica is set. If you had 2 replicas then your usable storage would have been 12TB instead of 8TB. Replicas can be adjusted any time. But make sure you give enough time for the cluster to go through rebalancing of data after changing replicas. This also applies when you changing PG numbers.

You can put the journals on same OSDs, it will work. Journal on SSD provides performance up to about 8 OSDs per node. Anything beyond that co locating journal and OSD on same HDD is good idea.

Thank you very much for the quick reply. I am grateful.

I have not set-up any replica yet. So in summary I can setup as many replicas as I would like?

What is the advantage to setting up 3 replicas versus 2? I mean a copy is a copy. So is there a technical advantage to having 3 or even 4 replicas?

In the event of having 2 replicas and I get 12TB. What would happen if 2 out 3 servers died? Since the physical capacity of each server is 8TB, does this mean I will lose info if data storage is more than 8TB?

You mentioned journal on SSD provides performance up to 8 OSDs per node. Does this men I may have performance issue since I have only 5 OSDs per nodes and co locating journal on the same HDD?

When you mentioned "give enough time" for the data to rebalance, does this mean I just wait and do nothing? What if I have live users on the system while the data is rebalancing? What will happen?

After changing replicas and/or PGs, how long is the typical wait for the cluster to go through the rebalancing of data? Is there a way to see its progress?

Thank you very much for your help.

Andrew
 
True, but the problem is that the servers I want to use only have 2 drive bays, which is why it isn't that easy unfortunately.

What server/chassis is it? There's plenty of dual 2.5 adapters for 3.5" bays out there. We did this with several of our older supermicro servers wth 8 3.5" and two non hotswap bays, where we loaded up 4 SSD drives for a zil/l2arc for another project.

These are some of the better ones that we've used: Thermaltake We've used others, but unfortunately, the mounting position for the screws does not adapt well in certain situations. I think it was this Silverstone version that the lower mount against the bottom plate would hit the screw heads they provided. We ended up using these for 1 2.5" setups, where the upper slot was more convenient.

Carlos.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!