Sizing for an HA cluster with Ceph

tholderbaumRTS

New Member
Mar 22, 2023
14
0
1
Hello,

I am in the process of sizing a Proxmox HA cluster as a private label cloud solution that we will resell. My design considerations are to allow for growth by adding nodes without downtime, and to ensure no single points of failure in the new environment. I was looking at Dell VXRAIL, but with VMWARE under broadcom being a question mark, I am looking at alternatives.

Currently I have 100 VMs totaling 540 vCPUs and 1100GB of active committed RAM with 58 TB of used storage based on Hyper-V. We intend to grow this; our target is to double it within a year.

What I think I need is a proxmox cluster, using ceph as shared storage.

According to the Ceph calculators, I am looking at 7 hosts, each with 5 7.68TB NVME Drives. which should give me 110TB+ of safe storage with 2 replicas.

My plan is also to have redundant LAG 100GB nics to redundant 100GB switches for Ceph storage, And redundant 10GB switches for VM traffic.

The idea is that once I have the networking, and storage setup, I can essentially scale this solution as needed by adding nodes. The question is there a limit to the number of vms a CEPH server could support? In VMware or Hyper-V we can calculate the VCPU and Memory load and go from there. But I can't seem to find information which would tell me how many VMs I can fit into a given solution while also accounting for the load CEPH would place.

Does it make sense to separate the VM and storage roles into separate systems?
 
Currently I have 100 VMs totaling 540 vCPUs and 1100GB of active committed RAM

remember to leave room for failover; if you're thinking to have 7 hosts, make sure the workload fits in 6. If you intend to do a hyperconverged setup, this need to include ceph overhead (~1C+4GB ram per OSD and Monitor). The good news is that you probably can overprovision CPU to a wide extent- I'd measure your actual usage over a period of time to get an idea of how much.

with 58 TB of used storage based on Hyper-V. We intend to grow this; our target is to double it within a year.
5x7.68x7 = 268.8TB, or 89.6TB usable with triple replication (its actually a little less when accounting for db data.) you'll have enough to start. I would actually advise going with 10x3.84TB instead for more performance and resiliency- and maybe consider write optimized disks depending on your VMs workload.

My plan is also to have redundant LAG 100GB nics to redundant 100GB switches for Ceph storage, And redundant 10GB switches for VM traffic.
Good.

The question is there a limit to the number of vms a CEPH server could support?
No Hard limit, but experience shows that you really dont want more then ~80VMs or 150 containers per host- but what they are actually doing will also play a part. 540vms in 6 hosts yields 90 so you are PROBABLY ok- under normal operation you'll be using all 7 nodes. the problem is that you never end up with a totally even distribution, so a failover can make for lopsided load. would need to test, but I would counsel adding a node.

Does it make sense to separate the VM and storage roles into separate systems?
depends on what you're after. In a perfect world- yes; you reduce the potential for fault per node, and likely result in a faster and more dependable VM network performance- but hyperconverged deployments can and do work.
 
Our thinking was that we would have 2 levels of expansion.

The servers I am looking at will have 24 NVME slots. So I can expand by adding SSDs. If I go with 3.84TB NVME drives, then I will end up limiting the total capacity of each node to roughly 78tb per node. (I am assuming 85% of 24 3.84 TB drives in 3 replicas at max. I would get more from bigger drives, but then a node outage means the Cluster has to work that much harder to replace in a node failure.

It’s a balancing act. But i like that idea. I am looking to go with N+3 failures to tolerate, so lesser sized drives over more slots means less exposure to capacity loss.

A question that I thought of, can Ceph be expanded asymmetrically? Do I always have to add storage when adding nodes?

I am also testing a Rust based NFS shared storage solution for capacity drives. Using TrueNAS scale. The idea being for clients who need more capacity and lesser I/O we have a non flash option. I have ram TrueNAS Core up to 2-3 PB, so i know that solution is stable. Our clients have a lot of VDI and thin clients going on now using Terminal servers. The key is flexibility in what we can offer them.

7 nodes gives me N+2 today. As we grow I will add nodes, but the goal is to get the networking and architecture right. The switches I am looking at are 32x100gb. So conceivably this stack will top out at 15 nodes max. If I reserve 125 GB of ram for Ceph and overhead, I would have 1800 GB for my VMs today with 7 nodes (n+2).

Do you know how proxmox compares to Hyper-V and VMware in CPU calculations? I am assuming because of Linux that I and closer to CPU. If so, I should be ok there as welll. 128 cores should be more than sufficient. On my current Hyper-V setup I am getting 5.8 vCPU to a core now and CPU has never been a constraint on us. Memory has always been the bottleneck. VMWare said I could get 4:1 easy. So if Proxmox gets near that efficiency I should be able to support 350 vCPU per host comfortably. I would target 66%of that as the point I add a node.
 
Your thoughts sound good.
Our thinking was that we would have 2 levels of expansion.

The servers I am looking at will have 24 NVME slots. So I can expand by adding SSDs. If I go with 3.84TB NVME drives, then I will end up limiting the total capacity of each node to roughly 78tb per node. (I am assuming 85% of 24 3.84 TB drives in 3 replicas at max. I would get more from bigger drives, but then a node outage means the Cluster has to work that much harder to replace in a node failure.

It’s a balancing act. But i like that idea. I am looking to go with N+3 failures to tolerate, so lesser sized drives over more slots means less exposure to capacity loss.

A question that I thought of, can Ceph be expanded asymmetrically? Do I always have to add storage when adding nodes?
You can also add other nodes in the Proxmox-Cluster that have Ceph installed but do not provide OSD. You just have to be careful not to overload your nodes with OSDs, because they have to take over the requests of the other nodes.

I am also testing a Rust based NFS shared storage solution for capacity drives. Using TrueNAS scale. The idea being for clients who need more capacity and lesser I/O we have a non flash option. I have ram TrueNAS Core up to 2-3 PB, so i know that solution is stable. Our clients have a lot of VDI and thin clients going on now using Terminal servers. The key is flexibility in what we can offer them.

7 nodes gives me N+2 today. As we grow I will add nodes, but the goal is to get the networking and architecture right. The switches I am looking at are 32x100gb. So conceivably this stack will top out at 15 nodes max. If I reserve 125 GB of ram for Ceph and overhead, I would have 1800 GB for my VMs today with 7 nodes (n+2).
If you have two switches with 32x 100GBit and distribute the servers redundantly on both switches, you can connect 28-30 servers, depending on how much uplink you allow the switches.
Do you know how proxmox compares to Hyper-V and VMware in CPU calculations? I am assuming because of Linux that I and closer to CPU. If so, I should be ok there as welll. 128 cores should be more than sufficient. On my current Hyper-V setup I am getting 5.8 vCPU to a core now and CPU has never been a constraint on us. Memory has always been the bottleneck. VMWare said I could get 4:1 easy. So if Proxmox gets near that efficiency I should be able to support 350 vCPU per host comfortably. I would target 66%of that as the point I add a node.
I did benchmarks with HyperV and vSphere, there were no big differences.

When benchmarking vSphere vs. Proxmox, I had about 20% more performance with Linux VMs, about 15% more performance with Windows VMs and about 110% more performance with Windows and SQL databases on Proxmox.

But the benchmark was with PVE 7.1 vs. vSphere 7 U3.
 
  • Like
Reactions: Wichets and Darkk
Hi @tholderbaumRTS,

It might be helpful for you to consider the latency impact of hyper-converged vs. dedicated CEPH. You will likely be better off having CEPH on dedicated hardware. Yes, this does require much more hardware, rack space, ports, and power. But, the key latency killer for KVM is context switching. If you service storage requests on your compute nodes, you force a ton of additional context switches. These context switches inject latency to both your I/Os and compute. You pointed out that your workload is VDI, which is typically latency sensitive. CEPH, in general, has relatively high latency due to the way that it processes I/Os. Your best bet is likely to refrain from having it compete with your compute, especially if you plan to oversubscribe your CPUs.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
It’s a balancing act. But i like that idea. I am looking to go with N+3 failures to tolerate, so lesser sized drives over more slots means less exposure to capacity loss.
N+3 failure for which domain? if compute, you need 3x spare cpu and ram capacity vs used; if for storage OSD host, you need to set your replication policy to 4-way. If for drives, you can already lose an entire node's worth of drives and not skip a beat. The question is, why? are you operating in a risky environment where multiple failures are common?
7 nodes gives me N+2 today.
again, what domain(s)?
If I go with 3.84TB NVME drives, then I will end up limiting the total capacity of each node to roughly 78tb per node. (I am assuming 85% of 24 3.84 TB drives in 3 replicas at max. I would get more from bigger drives, but then a node outage means the Cluster has to work that much harder to replace in a node failure.
While the per node capacity is of some importance (for cluster balance,) typically the value you want to measure is the usable cluster capacity. generally this is calculated as follows:

total capacity of OSDs / policy (typically triple replication, so 3) * 0.8.

so in the above example, you'd have 70x3.84 disks, or 268.8;
268.8/3=89.6
@80% utilization 89.6*.8= 71.68 usable.

This is identical to a configuration with 35x7.68TB disks.
It might be helpful for you to consider the latency impact of hyper-converged vs. dedicated CEPH. You will likely be better off having CEPH on dedicated hardware. Yes, this does require much more hardware, rack space, ports, and power. But, the key latency killer for KVM is context switching.
Said more succinctly than me :) Bear in mind this is not an all or nothing proposition- you can have nodes with OSDs, without OSDs, with monitors, mgrs, etc in any combination.
 
A question that I thought of, can Ceph be expanded asymmetrically? Do I always have to add storage when adding nodes?
You can also add other nodes in the Proxmox-Cluster that have Ceph installed but do not provide OSD. You just have to be careful not to overload your nodes with OSDs, because they have to take over the requests of the other nodes.
One additional note; when adding OSDs you REALLY want to ensure that the total capacity presented per node is more or less the same, regardless of individual OSD size. If the reason isnt obvious, losing a node that contains an oversized amount of OSD space relative to others can result in an insufficient room for placement. So if you're adding a new node with OSDs its good practice to add enough disks to equal other node capacity.
 
very interesting thread. thx guys for your input and the interesting read
i am, myself pretty new to proxmox. proxmox is based on KVM, Microsoft cloud azure is based on KVM (microsoft knows why they dont use the own hyper-v solution). also EXS are brilliant products but to overloaded. i reqested a price for a 3 Node cluster with vSAN and NSX (which, in a kind of, proxmox can do also) for 128.000 dollar. the Prices of VMWare are pretty sick, anyway. KVM is also, from the perspective of performance, the closest to the hardware speed.
my struggle was 2 things:
1. get the Network Properly run - i have now 4 interfafaces per Node / 10gb (1 x ceph private, 1 x ceph public, 1 internal traffic (SDN / communication between servers on the same subnet over different Noeds), 1 x external traffik)
2. one time with an update my whole cluster was set offline (no network connection was working) - hightly recommended to test the updates before installing it (especially if such sensitive environment like yours)

the only one big issue, out of box, which is fixable is the internal traffic from node to node of servers from the same subnets - but use eVPN (use it from beginning) everything works flawless.
an HA event works perfectly and in less than 2 mins the actually copy of a VM-shadow on another node starts (i think it can be adjusted to less than 2 mins). no data loose

for me is your thread interesting because i am expanding and my next cluster will be like your description.

currently i am working on an autoscaler script based on CPU, RAM and storage use. its what is missing (for me) if you deploy 10 VMs on node 1, the script should move the VM automatically to Node 2, 3 etc - that all workload is balanced. if finished i will publish it here and send to proxmox to add the feature.

please allow me one slighly correction: with storage is usually meant a netapp or any other Storage solution. the Drives IN THE server are revered as local storage (it confused me a bit but from your whole thread i could understand what you are meaning).
thx for the interesting subject, i keep follow up your thread.
 
Hello @pille99,
if you plan to build a larger cluster, I would never plan for Ceph below 25GBit, better 100GBit.
Nowadays I plan Ceph clusters with 2x 100GBit for storage, but not separated into private/public but as LACP and run all Ceph traffic over it.
For LAN I also try to switch directly to 25 GBit for new servers.

On your point 2, I suspect a misplanning for the Corosync network.
The Corosync service is very latency sensitive and does not like dropouts.
Either use dedicated network cards (min. 1 GBit) for Corosync, or if shared with other traffic, then on an interface with not so high load as primary.
The Ceph network and the migration network are rather unsuitable for Corosync.
 
So since you guys helped me out, I present to you my phase 1 proof of concept reference design.

I have this already built in a virtual lab (obviously without the hardware)

I am further exploring the networking setup, I am planning on each client having an isolated VM.

Notice how I have dedicated NICs for cluster traffic? That is a trick I picked up from VMWare land.

Phase 1 will be three nodes in N+1 for about 1/3 of my load. Then we will expand. Full deployment will be 7 nodes as N+2. Max deployment is 15 nodes. Then I build another rack.

the VMware bill for this would be $350,000. That is money that is going straight to my pocket. We are going with the top end support license to support the project.

With Broadcom taking over VMware and with Microsoft just flat out ripping people off with the per core datacenter license, I see a bright future for Proxmox.
 

Attachments

  • Colo 3.0 Concept 0.1 (1).pdf
    184.4 KB · Views: 34
Hello @pille99,
if you plan to build a larger cluster, I would never plan for Ceph below 25GBit, better 100GBit.
Nowadays I plan Ceph clusters with 2x 100GBit for storage, but not separated into private/public but as LACP and run all Ceph traffic over it.
For LAN I also try to switch directly to 25 GBit for new servers.

On your point 2, I suspect a misplanning for the Corosync network.
The Corosync service is very latency sensitive and does not like dropouts.
Either use dedicated network cards (min. 1 GBit) for Corosync, or if shared with other traffic, then on an interface with not so high load as primary.
The Ceph network and the migration network are rather unsuitable for Corosync.
Dedicated 10gb corosync Nics on dedicated redundant switches.

For Ceph, initial plan is a dedicated redundant 100gb Network, eventually I would upgrade to 400gb on the storage side, and move the 100Gb to the public and deploy the 10Gb switches to other roles.
 
One additional note; when adding OSDs you REALLY want to ensure that the total capacity presented per node is more or less the same, regardless of individual OSD size. If the reason isnt obvious, losing a node that contains an oversized amount of OSD space relative to others can result in an insufficient room for placement. So if you're adding a new node with OSDs its good practice to add enough disks to equal other node capacity.
Thanks. We came to that conclusion as well. I have to say that even though we have experience in Hyper-V and VMWare land, I have been shocked at how easily we got all of this up. From having heard about Proxmox a week ago, I have a working lab model in less than a week.

If you are reading this thinking of trying it, download it and set it up as a series of VMs.
 
very interesting thread. thx guys for your input and the interesting read
i am, myself pretty new to proxmox. proxmox is based on KVM, Microsoft cloud azure is based on KVM (microsoft knows why they dont use the own hyper-v solution). also EXS are brilliant products but to overloaded. i reqested a price for a 3 Node cluster with vSAN and NSX (which, in a kind of, proxmox can do also) for 128.000 dollar. the Prices of VMWare are pretty sick, anyway. KVM is also, from the perspective of performance, the closest to the hardware speed.
my struggle was 2 things:
1. get the Network Properly run - i have now 4 interfafaces per Node / 10gb (1 x ceph private, 1 x ceph public, 1 internal traffic (SDN / communication between servers on the same subnet over different Noeds), 1 x external traffik)
2. one time with an update my whole cluster was set offline (no network connection was working) - hightly recommended to test the updates before installing it (especially if such sensitive environment like yours)

the only one big issue, out of box, which is fixable is the internal traffic from node to node of servers from the same subnets - but use eVPN (use it from beginning) everything works flawless.
an HA event works perfectly and in less than 2 mins the actually copy of a VM-shadow on another node starts (i think it can be adjusted to less than 2 mins). no data loose

for me is your thread interesting because i am expanding and my next cluster will be like your description.

currently i am working on an autoscaler script based on CPU, RAM and storage use. its what is missing (for me) if you deploy 10 VMs on node 1, the script should move the VM automatically to Node 2, 3 etc - that all workload is balanced. if finished i will publish it here and send to proxmox to add the feature.

please allow me one slighly correction: with storage is usually meant a netapp or any other Storage solution. the Drives IN THE server are revered as local storage (it confused me a bit but from your whole thread i could understand what you are meaning).
thx for the interesting subject, i keep follow up your thread
Apparently 7.4 has some auto scaling/ tuning built in. For us, watching my Hyper-V servers constantly moving things around is annoying. I don’t trust the VM placement.

Secondly, see the diagram I posted. The servers themselves will have PCIE m.2 BOSS cards for booting. All the drives on the front are for Ceph. I will have an archive server for ISOs and templates and the like.
 
N+3 failure for which domain? if compute, you need 3x spare cpu and ram capacity vs used; if for storage OSD host, you need to set your replication policy to 4-way. If for drives, you can already lose an entire node's worth of drives and not skip a beat. The question is, why? are you operating in a risky environment where multiple failures are common?

again, what domain(s)?

While the per node capacity is of some importance (for cluster balance,) typically the value you want to measure is the usable cluster capacity. generally this is calculated as follows:

total capacity of OSDs / policy (typically triple replication, so 3) * 0.8.

so in the above example, you'd have 70x3.84 disks, or 268.8;
268.8/3=89.6
@80% utilization 89.6*.8= 71.68 usable.

This is identical to a configuration with 35x7.68TB disks.

Said more succinctly than me :) Bear in mind this is not an all or nothing proposition- you can have nodes with OSDs, without OSDs, with monitors, mgrs, etc in any combination.
Sorry I misspoke. It will be N+2 for the hosts and 3 replicas for Ceph.

The primary reason for all of this redundancy is uptime and reliability. I also like to go on vacation from time to time. When I do my coworkers who are excellent, but they are not solution architects.

So I will leave them detailed and specific instructions on how to recover from any failure. N+2 or even N+3 allows for hosts to be down for maintenance without causing unacceptable exposure to risk.

In my view ALL systems are inherently risky environments. Anything could happen. Probably won’t. And I probably have not thought of everything. But I am prepared.
 
Hi @tholderbaumRTS,

It might be helpful for you to consider the latency impact of hyper-converged vs. dedicated CEPH. You will likely be better off having CEPH on dedicated hardware. Yes, this does require much more hardware, rack space, ports, and power. But, the key latency killer for KVM is context switching. If you service storage requests on your compute nodes, you force a ton of additional context switches. These context switches inject latency to both your I/Os and compute. You pointed out that your workload is VDI, which is typically latency sensitive. CEPH, in general, has relatively high latency due to the way that it processes I/Os. Your best bet is likely to refrain from having it compete with your compute, especially if you plan to oversubscribe your CPUs.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
I am considering this as well. This might be a phase 3 upgrade. But it is on my radar and I will let you know.
 
Your thoughts sound good.

You can also add other nodes in the Proxmox-Cluster that have Ceph installed but do not provide OSD. You just have to be careful not to overload your nodes with OSDs, because they have to take over the requests of the other nodes.


If you have two switches with 32x 100GBit and distribute the servers redundantly on both switches, you can connect 28-30 servers, depending on how much uplink you allow the switches.

I did benchmarks with HyperV and vSphere, there were no big differences.

When benchmarking vSphere vs. Proxmox, I had about 20% more performance with Linux VMs, about 15% more performance with Windows VMs and about 110% more performance with Windows and SQL databases on Proxmox.

But the benchmark was with PVE 7.1 vs. vSphere 7 U3.
See my map. The intention is redundant links to redundant switches. In our initial setup the plan is to try for a bonded 200gb links to each switch. But that is probably wishful thinking.
 
See my map. The intention is redundant links to redundant switches. In our initial setup the plan is to try for a bonded 200gb links to each switch. But that is probably wishful thinking.
You have 4x 100GBit distributed on two switches on your plan. Do you want to split 200 GBit public and 200 GBit cluster?
Depending on which switches you use, I would form an MLAG and then make an LACP over 400 GBit.
This would also give you redundancy for the Ceph cluster network.
 
You have 4x 100GBit distributed on two switches on your plan. Do you want to split 200 GBit public and 200 GBit cluster?
Depending on which switches you use, I would form an MLAG and then make an LACP over 400 GBit.
This would also give you redundancy for the Ceph cluster network.
The intention is to completely and entirely separate the networks for Cluster communications, vm communications and storage. On the storage side, we will MLAG everything we can as long as we get redundancy. That will be fully tested before we are in production.

The VM Network is lesser defined but it seems desirable to separate host networking from VM networking. Especially considering the fact that we have tons of Vlans and try to enforce vm isolation.

Admittedly that is a hyper-v network design. But it holds up on our lab.
 
The intention is to completely and entirely separate the networks for Cluster communications, vm communications and storage. On the storage side, we will MLAG everything we can as long as we get redundancy. That will be fully tested before we are in production.

The VM Network is lesser defined but it seems desirable to separate host networking from VM networking. Especially considering the fact that we have tons of Vlans and try to enforce vm isolation.

Admittedly that is a hyper-v network design. But it holds up on our lab.
I am only talking about Ceph Public and Ceph Cluster here.
I will leave Proxmox Cluster (Corosync) and VM traffic out of consideration for now.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!