Sorry, should have elaborated further. Since Proxmox only handles full disks, eg. no partitions, then 1 OSD = 1 disk in proxmox. Correct?
Yep that is correct.Sorry, should have elaborated further. Since Proxmox only handles full disks, eg. no partitions, then 1 OSD = 1 disk in proxmox. Correct?
Sorry I didn't point that out, but yes, I want to use the Ceph that comes with Proxmox, ie. run both on the same node. Initially I was looking for the best solution on how to start with just a single node and add 2 more nodes later and turn them into a Proxmox & Ceph cluster with HA fencing. It was unclear to me if that's a possibility and how the process would look like, ie. if I could use Ceph with just a single node or if I should use local storage and later add new HDDs for Ceph OSDs and migrate the VMs to it. However, you guys have pretty much convinced me now that it's better to start with 3 nodes - I just wanted to keep the initial investment as low as possible, as I'm not that certain yet how my idea would work out, which makes the risk of loosing the investment higher than I would usually dare. So back on topic - now that I will most likely go with 3 nodes from the start, the following questions appeared:@Time, It is somewhat unclear of many nodes in total (Proxmox and CEPH) you are going to have in your network. You said you wanted to start with 1 CEPH node, does that mean you have planned other nodes for Proxmox usage only? With Proxmox VE 3.2 both Proxmox and CEPH can co exist on same hardware. So if you start with only 3 nodes that gives you quorum and redundancy for both Proxmox and CEPH. Based on this your node setup will look like this:
Node 1
======
Platform : AMD / Inte
lSSD: 1 (For Proxmox Installation
HDD: 2 (For CEPH OSD+Journal)
Node 2
======
Platform : AMD / Intel
SSD: 1 (For Proxmox Installation)
HDD: 2 (For CEPH OSD+Journal)
Node 3
======
Platform : AMD / Intel
SSD: 1 (For Proxmox Installation)
HDD: 2 (For CEPH OSD+Journal)
No matter how mnay node you want to setup for CEPH, you still need more than one node for proxmox unless you were planning to have 1 node proxmox cluster. With this setup you give both Proxmox and CEPH do what they do best without spending lots of money.
So with "ceph osd pool set" it should be possible to adjust the pool settings from shell, right? I just wonder why there isn't a simple "Edit Pool" feature in the Proxmox GUI, if it actually works that simple.
With just one HDD used as OSD per node, either way you look at it you are going to face performance issue, since there are not just enough HDD to share the load. Using SSD for journal in your case will certainly provide higher performance but do not expect lighting fast speed even though you have used SSD for journal.1.) Isn't it bad to have both the OSD and journal on the same single hard drive, ie. low I/O? The servers I have in mind only come with 2 drive bays by default, so I if I go with your suggested setup, would that result in worse performance too than if I had lets say 2 or 3 OSDs/HDDs per node?
I am not fully familiar with new FireFly so do not know if they are fully taking away the need of journal or just doing it differently. Using OSD and Journal on same HDD keeps things simpler though. But regardless of that, it is possible to move journal around.2.) If I go for your suggested setup, will it be possible to simply "remove" the journal from the disks as soon as Proxmox supports the Ceph firefly release?
You are right, running one OSD per node will have no issue with network bandwidth. Higher bandwidth allows larger amount of data transfer between OSDs during self-healing/re-balancing due to HDD failure/adding OSD/change of PG number etc. You will notice these tasks will get done faster with higher bandwidth in a high number OSDs cluster. But in your case, 1GB NIC will be more than enough for now.3.) The racks I would place the servers in only have dual 10G NICs for the connectivity of the whole rack, so I wouldn't be able to use 10G NICs for the internal network, as I won't be able to plug them directly into the switch with SFP modules or something. With your suggested setup, would it throttle I/O too much if I only use 1GigE NICs for internal networking (Ceph sycning)? I suppose not that much, because I believe running only one OSD per node with the journal on the same disk wouldn't result in great performance anyway, but I'm not sure how much network usage Ceph generates when writing/reading data, how much the journal "buffer" helps, etc.
I am not aware of any release date for CEPH FireFly into Proxmox yet. Honestly, i hope it does not get added the moment after it comes out. I would like to see Proxmox waiting a bit before implementing to some first batch bugs are worked out in FireFly. That is drastic change going to take place in CEPH arena seems like it.4.) Is there any ETA for the Proxmox release that would have Ceph firefly integrated? Just in case it's not easy to migrate from journal-based Ceph to non-journal Ceph.
Yes it is just matter of simple command line to change PG. You have to keep in mind that CEPH inclusion in Proxmox is a technology Preview. We are very fortunate to have the feature control we have now for CEPH through Proxmox GUI. I really foresee Proxmox adding more and more feature in upcoming version when we will be able to do more of CEPH in GUI.So with "ceph osd pool set" it should be possible to adjust the pool settings from shell, right? I just wonder why there isn't a simple "Edit Pool" feature in the Proxmox GUI, if it actually works that simple.
You are very right on this one! People coming from RAID era has hard time grasping the concept of CEPH. To them it is just nightmare to ignore RAID.Again I'm sorry about all these probably dumb questions, but jumping from working with your average RAID-10 array for years right into Ceph can be confusing without having read the whole documentation and without real-life experience with it. Thanks to all of you for your replies.
Different sources about Ceph only mention that the more OSDs one has, the better the I/O performance will be. But where/how exactly would that "RAID-0 effect" begin? I suppose a setup with 3 nodes with only 1 HDD/OSD per node would perform similar to a single HDD or RAID-1, am I right? Now if I would have 2 HDDs/OSDs per node and 3 nodes in total, would that already increase the I/O compared to a single SATA?
Both Proxmox and CEPH Journal would definitely work on same SSD. Journal only needs a separate partition. Since both Proxmox and CEPH has to share the SSD I/O, i dont believe you will get "wow" performance, but still it should faster than the journal on HDD itself.1.) With the 1 SSD + 1 HDD per server scenario, instead of using the HDD for journal, wouldn't it result in better performance to run both the Ceph journal and Proxmox from the SSD? Or wouldn't Proxmox and the Ceph journal work well together on one SSD?
By "RAID-0 Effect" i think you meant mirror operation? Not clear what you meant there.2.) Different sources about Ceph only mention that the more OSDs one has, the better the I/O performance will be. But where/how exactly would that "RAID-0 effect" begin? I suppose a setup with 3 nodes with only 1 HDD/OSD per node would perform similar to a single HDD or RAID-1, am I right? Now if I would have 2 HDDs/OSDs per node and 3 nodes in total, would that already increase the I/O compared to a single SATA?
Realistically, you will always have to adjust your PG based on number of OSDs you have in your cluster. A good formula to calculate number of PG is this(http://ceph.com/docs/master/rados/operations/placement-groups/):One further question that I already asked, but that hasn't been answered yet I think: Is the default Ceph pool "RBD" only for a specific number of nodes and OSDs or can I use it for the 3 node and 1 OSD per node setup that symmcom suggested and then for instance add a fourth node *without* adjusting the pool settings, or is it *required* to for instance change the "Size" and "Min. Size" settings as well as the "pg_num"? Is the default setting of the "RBD" pool fine, or should I rather use pg_num 150 for the suggested setup instead of the default, which is 64?
Thanks, now I understood that too! So the only question left would be how to split the SSD in two partitions, one for Proxmox and one for Ceph journal. Would "maxroot=128" during Proxmox install leave the rest of the space unassigned and theoretically free for a Ceph journal, assuming I have a 256GB SSD?Realistically, you will always have to adjust your PG based on number of OSDs you have in your cluster. A good formula to calculate number of PG is this(http://ceph.com/docs/master/rados/operations/placement-groups/): Total PG = (# of OSDs x 100) / Replicas = Rounded to nearest power of 2So lets say you have 2 node with 1 OSD in each. Your ideal PG would be : Total PG = (2 x 100) / 2 = 100 = 128 (nearest power of 2).If you added additional OSD in each node: Total PG = (4 x 100) / 2 = 200 = 256 (nearest power of 2). The default PG 64 is just there as starting point.
With SSDs so cheap now a days, you could very well throw in a small 60GB SSD for journal. Saves you from hassles of partitioning.Thanks, now I understood that too! So the only question left would be how to split the SSD in two partitions, one for Proxmox and one for Ceph journal. Would "maxroot=128" during Proxmox install leave the rest of the space unassigned and theoretically free for a Ceph journal, assuming I have a 256GB SSD?
True, but the problem is that the servers I want to use only have 2 drive bays, which is why it isn't that easy unfortunately.With SSDs so cheap now a days, you could very well throw in a small 60GB SSD for journal. Saves you from hassles of partitioning.
fso the guys say its still experimental.... is it a good idea to use it in production?
Although you did not mention i am assuming you are using 3 replicas? Which means any data you store get replicated across 3 nodes. For example, if you store 100GB data, it will take 300GB to store for 3 x 100GB when 3 replica is set. If you had 2 replicas then your usable storage would have been 12TB instead of 8TB. Replicas can be adjusted any time. But make sure you give enough time for the cluster to go through rebalancing of data after changing replicas. This also applies when you changing PG numbers.
You can put the journals on same OSDs, it will work. Journal on SSD provides performance up to about 8 OSDs per node. Anything beyond that co locating journal and OSD on same HDD is good idea.
True, but the problem is that the servers I want to use only have 2 drive bays, which is why it isn't that easy unfortunately.