Enterprise HW Config w/ Ceph

unknownvariable

New Member
Jul 13, 2024
8
2
3
Hi

We are planning to drop vSphere in a (smallish) medium size enterprise which for us means all new HW config and way for doing things. At least for now we are looking at using Ceph to get close to parity compared to VMware (snapshots, thin prov etc). For those that have done something similar at scale w/ Ceph, what sort of HW config did you find optimal?

For VMware we tend to keep cluster sizes around 10 hosts (give or take) for OS licensing purposes and to minimize the blast radius should things go sideways. I imagine a similar approach for PVE would make sense but am curious about number of OSDs/drives per host and number of hosts. Workloads are "avg" so to avoid an "it depends" answer, we are basically looking for what is optimal from a cost / performance perspective w/ an enterprise budget.

Thanks in advance.
 
Odd number of nodes, OSDs usually less than 10 per node ,and this should be it for smallish clusters(<15). Depending on hdd,ssd or nvme, 10g is minimum, with probably 25 or even 40gig in mind .
 
as we need cpu and ram for workloads anyway we dropped the concept of hyperconverged and seperate storage (ceph, nfs) from workload clusters.

and we dont go further then 20 nodes/Cluster (the only disadvantage of pve imo, scalability)
Does splitting out Ceph onto separate storage increase or decrease operational complexity? I would imagine managing capacity independent makes things easier from a planning / provisioning perspective (we do this for vSphere for this reason). Curious about upgrades and patching.
 
Last edited:
imo decrease it a lot.

upgrading a large hyperconverged production cluster is.... challenging :)

and yes, i too think that scaling is much easier that way.



depends on your situation, you need more hardware, more rackspace but we dont do hyperconverged clusters ever. we also have more than one storage "tier", which is not easily possible otherwise (ssd ceph, hdd ceph, das nvme, ssd nfs, hdd nfs,....)
 
  • Like
Reactions: unknownvariable
imo decrease it a lot.

upgrading a large hyperconverged production cluster is.... challenging :)

and yes, i too think that scaling is much easier that way.



depends on your situation, you need more hardware, more rackspace but we dont do hyperconverged clusters ever. we also have more than one storage "tier", which is not easily possible otherwise (ssd ceph, hdd ceph, das nvme, ssd nfs, hdd nfs,....)
What do your CPU / mem specs look like for the PVE hosts and the CPU/mem/SSD for the dedicated Ceph cluster?

Right now I am think about the "building blocks" needed and then scale horizontally for capacity as needed. For vSphere all our hosts are the same (for each gen of HW), and clusters have access to 2-3 different NetApp AFF SANs.

Thanks btw.. this is helpful.. also @ness1602. And yes we are thinking 40Gb is probably where we want to go unless we keep the clusters small.
 
We have 2 clusters running hyperconvered - one is 3 nodes and the other is 4 nodes. Ideally I agree, odd numbers are good but alas it was like this when I got here. We haven't any issues with our 3 node clusters as far as patching and maintenance goes. Same with our 4 node. Our prod cluster is a 100G networking and there are times when shuffling stuff around we are pushing it pretty hard. If you have the budget and go hyper-converged... I would suggest 100G, or perhaps splitting ceph traffic. We already have 100G networking for all our other stuff so for us it isn't a huge deal. Both clusters were NOT a 100G originally and we had piles of issues as we could easily saturate 10G/25G. I would suggest 100G IMO if your budget allows. Ceph traffic was split then but we collapsed to one network instead of two since we have fat pipes for "simplicity" sake. There's both pros and cons to this of course.

Ceph has been very solid but as some of the other folks here have noted.. there certainly can be dragons and I like to keep my storage/compute separate. I haven't had issues but I always treat ceph.. very carefully. Same disk/type sizes between the clusters. We thankfully rarely touch it and it "just works" but we have simple needs.

In a previous live I ran several Nutanix AHV/VMware clusters. It was mostly solid** but I had more issues than I would care to think (bugs bugs bugs). I guess I'm not sold on the simplicity of Hyperconverged... but again this is only my opinion. Ceph has been around a long time and certainly is battle tested but its a big kitty to run standalone. Proxmox does a good job of simplifying it. Most current gen storage arrays are quiet easy to manage so I can't buy the Nutanix/(insert hyperconverged provider here) BS of "oh it simplifies storages blah blah". It can... but as always depends on your needs and the skill sets of you and your team.

We're in the midst of planning a refresh and the current selection is chunky AMD boxes (single socket) with lots of RAM (same make/model) since our Lab cluster is growing exponentially with more use cases and test use cases. We have talked about expanding the cluster with whatever HW we have laying around and I have shot that down hard. Imperative to keep everything the same (esp with Ceph) unless you love a good challenge haha.

Since there is no licensing tax I'm trying to get it nicely specc'ed out. Likely end up with 3 or 5 nodes depending on the budget but... probably only three which will be okay. Likely going to have to be hyperconverged... which is fine.

If I eventually have my way (budget!) we will drop Ceph for a Pure array (probably an X20) and just run NFS. Thats my dream anyways. I terribly miss my Pure arrays.

Good luck!
 
Ideally I agree, odd numbers are good but alas it was like this when I got here.
The number/"oddness" of nodes isnt relevant in and of itself. What you want is:
3x monitors
(edit- a minimum of) r+1 OSD nodes, where r=number of shards in a placement group, which in most cases in a replication crush rule is 3.
we are looking at using Ceph to get close to parity compared to VMware (snapshots, thin prov etc).
Thats not a valid comparison; ceph is analogous to vsan, not vsphere. PVE supports snapshots and thin provisioning on other forms of storage.

For VMware we tend to keep cluster sizes around 10 hosts (give or take) for OS licensing purposes and to minimize the blast radius should things go sideways. I imagine a similar approach for PVE would make sense
It does, your instincts are spot on.

but am curious about number of OSDs/drives per host and number of hosts.
As others pointed out, it all depends on your workload, individual guest IOPs requirement, TOTAL iops requirement, back end network availability, etc etc. cost optimization would be secondary to meeting performance requirements, but in case you have a hard budget limit you can aim to maximize the overall solution as long as you understand and accept to inevitable pitfalls of the compromises required.
 
Thats not a valid comparison; ceph is analogous to vsan, not vsphere. PVE supports snapshots and thin provisioning on other forms of storage.
We wanted to check off as many "feature" boxes as we have for VMware. Thin provisioning, snapshots, resilient storage (survive host failures without a large crater/sig data loss). According to the PVE storage matrix, Ceph seems to be closest. https://pve.proxmox.com/wiki/Storage Not many options state "yes" across the board or am I missing something?

I want to avoid engineering one off solutions and would think striving for simplicity and "closest to out of the box" would be a key to longer term success. At least select more popular solutions such that we aren't going to be the first ones encountering issues or creating weird/original problems for ourselves.

In a perfect world it would be great to be able to continue to use iSCSI with our NetaApp SANs but the loss of snapshots is a deal breaker.
 
Last edited:
According to the PVE storage matrix, Ceph seems to be closest
Again, apples and oranges. The featureset available to the hypervisor is a function of the underlying storage, with the exception being vmware can do more with iscsi storage then pve. When designing your solution, consider the disparate goals that you have in terms of functionality and performance, and you can design analogous solutions using either environment. The proxmox environment rolls in multiple software solutions to provide various services including ceph, which would be analogous to vsan.

In a perfect world it would be great to be able to continue to use iSCSI with our NetaApp SANs but the loss of snapshots is a deal breaker.
If your Netapp can provide nfs functionality instead, you can continue to use it with snapshots and thin provisioning intact.
 
If your Netapp can provide nfs functionality instead, you can continue to use it with snapshots and thin provisioning intact.
Our older SANs are AFF and can do NFS but all of our new ones are ASAs and are iSCSI only. By the time we are finishing up the migration off of VMware, our AFFs will be EoS.
 
Alternatively, you can look at solutions such as Blockbridge to make your netapp work. paging @bbgeek17
I did look at them. Financially they will be a headache. Our company goes to significant lengths to avoid operating expenses (subscriptions) and prefers capital expenses. Finance groans over NetApp maintenance and support, but we have been able to work with NetApp to front end load purchases to shift costs to the capital side.

Our company provides critical infrastructure and historically we have gone with big vendors and always had support agreements. The OpEx pressure over the years has forced adoption of open source solutions but only on the edges in non critical areas (monitoring etc). This move away from VMware will be a fundamental shift.
 
  • Like
Reactions: cfgmgr and pvps1
I only had one client who split ceph and proxmox , it also worked well, but why i'm forcing small hyperconverged clusters , they are easier to maintain ,and ok if crash, because you usually build more small clusters.
 
This move away from VMware will be a fundamental shift.
I hear that. a lot.

Our company provides critical infrastructure and historically we have gone with big vendors and always had support agreements.
This is somewhat a challenge with Proxmox, especially if you're not on European time. Unless you have some in house talent (read: competent linux sysadmin) you may want to consider some third party help.

So it comes back to the beginning. Size and scope will determine the hardware and infrastructure; you'll either pay to buy hardware and infra or to support your existing investment; I dont envy you the work of proving TCO and calculating ROI.
 
The number/"oddness" of nodes isnt relevant in and of itself. What you want is:
3x monitors
(edit- a minimum of) r+1 OSD nodes, where r=number of shards in a placement group, which in most cases in a replication crush rule is 3.
<snip>
Thank you for the correction!
 
This is somewhat a challenge with Proxmox, especially if you're not on European time. Unless you have some in house talent (read: competent linux sysadmin) you may want to consider some third party help.

So it comes back to the beginning. Size and scope will determine the hardware and infrastructure; you'll either pay to buy hardware and infra or to support your existing investment; I dont envy you the work of proving TCO and calculating ROI.
We are in North America. so yeah..

And while I think strange, the decision to leave VMware and Proxmox defined as the solution has outright been made by Mgmt. We have history with Broadcom so that makes sense, but the latter is a bit of head scratcher. Less paperwork for me and I won't have to justify the consequences of going with a solution that has less than ideal Enterprise level support. I'll just focus on mitigating those shortcomings.
 
We have 2 clusters running hyperconvered - one is 3 nodes and the other is 4 nodes. Ideally I agree, odd numbers are good but alas it was like this when I got here. We haven't any issues with our 3 node clusters as far as patching and maintenance goes. Same with our 4 node. Our prod cluster is a 100G networking and there are times when shuffling stuff around we are pushing it pretty hard. If you have the budget and go hyper-converged... I would suggest 100G, or perhaps splitting ceph traffic. We already have 100G networking for all our other stuff so for us it isn't a huge deal. Both clusters were NOT a 100G originally and we had piles of issues as we could easily saturate 10G/25G. I would suggest 100G IMO if your budget allows. Ceph traffic was split then but we collapsed to one network instead of two since we have fat pipes for "simplicity" sake. There's both pros and cons to this of course.

Ceph has been very solid but as some of the other folks here have noted.. there certainly can be dragons and I like to keep my storage/compute separate. I haven't had issues but I always treat ceph.. very carefully. Same disk/type sizes between the clusters. We thankfully rarely touch it and it "just works" but we have simple needs.
Out of curiousity, how many drives per node do you have that is able to pressure a 100Gb network?

For your dedicate Ceph infra, what do those nodes look like?
 
Hi

We are planning to drop vSphere in a (smallish) medium size enterprise which for us means all new HW config and way for doing things. At least for now we are looking at using Ceph to get close to parity compared to VMware (snapshots, thin prov etc). For those that have done something similar at scale w/ Ceph, what sort of HW config did you find optimal?

For VMware we tend to keep cluster sizes around 10 hosts (give or take) for OS licensing purposes and to minimize the blast radius should things go sideways. I imagine a similar approach for PVE would make sense but am curious about number of OSDs/drives per host and number of hosts. Workloads are "avg" so to avoid an "it depends" answer, we are basically looking for what is optimal from a cost / performance perspective w/ an enterprise budget.

Thanks in advance.

Have you looked at Linbit? Not a user but seems easier to grok than Ceph IMHO...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!