Deploying a colo

dmca7919 · Sep 19, 2025

Hello everyone,

I am hoping some of you can help me out with some info I am putting together.

I work for a managed service provider and we are looking to deploy our own colo for our clients. I do have a Proxmox server for my own personal use but its only one host and my company is looking at deploying 3 servers to create a cluster. While I am somewhat familiar with Proxmox in my home lab (been running it for about 6-8 months), I am not as familiar on the cluster side. Here is a brief rundown of the equipment we have and what we are trying to achieve....

We currently have 3 somewhat older Dell PowerEdge servers. Two of them will have very similar specs (R420's) and the third is a T630. We will be upgrading the servers over time to be the same models/specs, this is just to get us started. So we have the 3 servers (since you need 3 for the cluster and quorum vote) and we want to be able to have high availability in case one host goes down. I have set these up in a lab for testing, along with a NAS drive for shared storage of the VM files. I have seen Ceph mentioned in the forums but I am unfamiliar with the differences between HA and Ceph, so hopefully someone can clear that up for me (TIA).

So here are my thoughts about having a system with stability and redundancy....
- For HA, the VM files would be stored on the NAS (large shared storage)
- For redundancy we were thinking of scheduled replication 3x/day (or more) to the local storage of the host(s) in case the shared storage goes down
- We were also thinking of scheduled backups 3x/day or more to another NAS device in case the replicas arent able to be accessed

We already have a basic layout of the network topology, so the inquiries are more about the cluster itself to make sure we have enough redundancies/backups and high availability.

Any insight to these would be greatly appreciated. I'm sure there will be questions, so please let me know if there is anything else I need to provide to you to give the best info possible. Thank a lot guys, I am looking forward to hearing everyone's views and thoughts.

SteveITS · Sep 19, 2025

HA controls whether a VM is running.

Ceph is a shared storage layer so the VM can migrate without turning off or losing data. Redundant etc.

For 3, read This thread .

smueller@TK · Sep 19, 2025

Hi @dmca7919,
first thing I’d ask is: how flexible is your budget for the servers? If it’s not extremely tight, I’d strongly recommend looking at Ceph as the storage layer instead of relying on a single NAS.

The big advantage with Ceph is that it’s distributed and redundant by design: data is replicated across all nodes in the cluster. With a replication factor of 3, every VM disk exists simultaneously on Node-1, Node-2, and Node-3. That way you don’t just have high availability, but also real data redundancy. If one node or disk goes down, the cluster automatically heals and keeps running.

The main trade-off is capacity:

Node	Disks (OSD)	Disk size	Raw capacity
Node-1	4	3.64 TB (NVMe or SAS/SATA)	14.56 TB
Node-2	4	3.64 TB	14.56 TB
Node-3	4	3.64 TB	14.56 TB
Total	12		43.68 TB

With 3× replication, only one third of the raw capacity is usable.

Usable theoretical capacity: 14.56 TB
Usable practical capacity (~75% rule): ~10.5 TB

That “75%” rule is important: Ceph needs free headroom so it can rebalance when an OSD (disk) fails. If you fill the cluster too far, you risk hitting 95% usage, which effectively blocks new writes.
So yes, you give up capacity compared to a NAS, but you gain:

Fault-tolerance at both node and disk level
True HA without a single storage bottleneck
Automatic healing and rebalancing when failures happen

Important note on networking:
Ceph replicates data over the network, which means your interconnect is critical. Each node should ideally have at least 2× 10 Gb NICs, dedicated to Ceph traffic, and cabled in a full mesh or through redundant switches. Without a fast, low-latency network, Ceph performance will suffer significantly.

Quick Comparison: NAS + Replication vs. Ceph

NAS + Replication

Simple to set up if you already have a NAS
Capacity efficiency (no 3× overhead)
Single point of failure (NAS goes down = whole cluster halts)
Replication is scheduled, not instant (risk of data loss between replication jobs)
Limited scalability

Ceph

Distributed and fully integrated with Proxmox
No single storage bottleneck
Real-time replication (no RPO gap like scheduled jobs)
Scales easily when you add more nodes/disks
Requires more disks and some learning curve
Higher overhead: only ~33% usable space with 3× replication
Needs solid networking: minimum 2× 10 Gb NICs per node for stable performance

If your main goal is stability and redundancy, Ceph is usually the more future-proof approach. If budget is really tight at the beginning, starting with the NAS is fine — but plan a roadmap towards Ceph once the hardware refresh happens.

dmca7919 · Sep 19, 2025

@SteveITS thanks for that info.

@smueller@TK thank you for that breakdown, this is the exact info I need to plan deployment. ATM I am only using 2 of the 3 servers for testing, as we didnt decide on which server to use for the third until just recently. With the info you provided, I will setup the third server and test Ceph.

I do have another question.....typically we like to separate the management port(s) and VM port(s) (we do this for existing ESXi setups). Depending on the server hardware (ie; if the server mobo has 2 on-board NICs), we use one NIC port for ESXi/Proxmox management and configure the other as the VM network. If the server as an additional NIC card installed (ie; a 2 or 4-port NIC card), we'll configure a fail-over for both the management and VM networks. My question is if Ceph is doing replication (or if using the scheduled replication), is that done thru the management port?

We have Veeam replication (moving away from it) setup from some of our existing servers (ESXi) and they told me that it uses the management port to transfer the data due to how its configured. Curious if that is the case with Proxmox replication/Ceph. If so, we'll obviously use higher speed NICs for the management port.

SteveITS · Sep 19, 2025

Ceph lets you specify networks. It has a so called "public" network where VMs talk to it. It has a "private" or "cluster" network where the Ceph services talk to each other, for replication and rebalancing. They can be the same. Ideally both are at least 10 Gbit. See pic at https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/.

D

[SOLVED] Thread 'Ceph Public/Private And What Goes Over the network'

May 23, 2024

Good Day All.

Im new to ceph. i have set up a 6 node cluster.
Its a mixture of SSD AND sas drives.
With the sas drives i use a ssd partition for the db.

Now what im experiencing is that my VMs are slow.
Boot is slow Opening programs are slow etc etc.

the 10.0.45.0/24 network is 10Gig
the public network is 192.168..14.0/24 is on a bonded 1GB network.
This happens to also be the iprange that i connect to proxmox gui on.

I assumeed the public network would not carry much traffic.
and all ceph traffic was on the cluster network
However, i read a post that the public network is...

dmca7919 · Sep 19, 2025

SteveITS said:
Ceph lets you specify networks. It has a so called "public" network where VMs talk to it. It has a "private" or "cluster" network where the Ceph services talk to each other, for replication and rebalancing. They can be the same. Ideally both are at least 10 Gbit. See pic at https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/.

D

[SOLVED] Thread 'Ceph Public/Private And What Goes Over the network'

May 23, 2024

Good Day All.

Im new to ceph. i have set up a 6 node cluster.
Its a mixture of SSD AND sas drives.
With the sas drives i use a ssd partition for the db.

Now what im experiencing is that my VMs are slow.
Boot is slow Opening programs are slow etc etc.

the 10.0.45.0/24 network is 10Gig
the public network is 192.168..14.0/24 is on a bonded 1GB network.
This happens to also be the iprange that i connect to proxmox gui on.

I assumeed the public network would not carry much traffic.
and all ceph traffic was on the cluster network
However, i read a post that the public network is...

Donovan Hoare

ceph

Replies: 7

Forum: Proxmox VE: Installation and configuration

This is great! Thank you again for all this info. I will read thru this today.

In regards to speed, we do not currently have anything 10Gbps. Referencing what you said earlier about our budget, part of all the info I am trying to gather will determine what we can spend. Currently we only have 1Gbps devices (firewalls/switches). If we are unable to do 10Gbps out the gate (would upgrade to 10Gbps later), do you think using LAGG/LACP would suffice as an interim solution (combining two or even three 1Gb ports for faster local speeds, prob do the same for internet speeds as well)? I know 10Gb would be preferred but hoping 2Gb would at least help with Ceph.

bbgeek17 · Sep 19, 2025

@smueller@TK , you've done a great job organizing and presenting data.

I do wonder regarding this point:

smueller@TK said:
Single point of failure (NAS goes down = whole cluster halts)

I am always surprised when a discussion about business critical infrastructure design does not presume that NAS (or SAN) is generally deployed in an HA configuration.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

dmca7919 · Sep 19, 2025

@bbgeek17 While @smueller@TK has provided a lot of great info about using Ceph, we will still have a NAS/shared storage device for other redundancies like scheduled replication and backup jobs. However if you have any other suggestions for NAS devices, please feel free to list them. The more info I have, the better I can plan our deployment.

bbgeek17 · Sep 19, 2025

@dmca7919 , the choice of NAS solutions is very vast. Given what you have shared about your current infrastructure and budget, frankly, I am not sure what would be a good solution for you and your customers.

Good luck in your endeavor!

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

dmca7919 · Sep 19, 2025

@bbgeek17 so not even suggested for other scheduled replication/backups jobs? I would think the more redundancies to have in place, the better chances of recovery should Ceph fail for some reason.

Again this is why I am asking tho so we can budget for anything we may need.

SteveITS · Sep 19, 2025

The budget question was the other reply. You can find discussions of Ceph speed and using 1 Gbps on the forum. Proxmox recommends 10. The issue is most drive interfaces are faster than that so as one scales out with nodes and drives the network remains the bottleneck. Haven't used LAGG for it but it seems like it would help.

dmca7919 · Sep 19, 2025

@SteveITS Thanks for the reply. I know we will go 10Gb shortly after we deploy but if we can get away with LAGG/LACP until then, that allows up to move forward without spending too much out the gate. I am going to setup a test network on my home lab to see if it will help. I appreciate the response and the info. I will probably have more questions, but this is a great start.

Thanks to both of you for all the info!

alexskysilk · Sep 19, 2025

bbgeek17 said:
I am always surprised when a discussion about business critical infrastructure design does not presume that NAS (or SAN) is generally deployed in an HA configuration.

You shouldnt be. there are a lot more people (especially on these forums) that equate NAS with Synology and not Netapp. and fewer know why Netapp costs so much more.

@dmca7919 if serious, NAS is an excellent option for HA virtualization storage. while I appreciate @bbgeek17 not wanting to endorse anyone... Netapp. But having said that, I think you really need to think through WHY doing the self managed colo is a good option for you.

Doing a full cluster EFFECTIVELY requires a good networking stack and knowledge on how to set it up- ESPECIALLY when using SDS such as ceph, since it uses standard networking for both disk and host traffic. Colo's dont usually do that- they leave it up to you. It is POSSIBLE to run a cluster with only three nodes and no networking infrastructure (mesh) but this is fraught for production use since 3 nodes dont afford any self healing beyond same-node OSD level. If you have the in-house expertise (or are intending to hire them) then I'd suggest already accounting on a larger footprint which means the upfront cost of redundant high speed switches (25gb or better; there's no excuse for 10g in production in 2025,) routers, etc- which also means a larger U commit and commensurate A+B power. Moving from a quarter cab to a full cab is usually not a big jump in cost.

If you dont have the in house expertise, just resell AWS/Azure/GC. 0 upfront cost, available engineering support, prebaked services such as a managed DB service, etc etc etc.

bbgeek17 · Sep 20, 2025

It seems to me that the OP used the wrong terminology when describing their goal.

Colocation means providing space in a data center for someone else’s compute, storage, and networking. Colos also provide shared network ingress/egress and, of course, power. In some cases shared networking or storage are provided as an additional service.

Hosting, on the other hand, is when a provider builds and manages their own compute, storage, and networking infrastructure, and then customers rent/run their VMs or applications on top of that. There is also, generally, expectation of expertise in provided services.

Whether there’s market share for the OP to capture with the type of infrastructure they currently have is something only they can determine.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Search

Search

Deploying a colo

dmca7919

New Member

SteveITS

Active Member

smueller@TK

Active Member

Quick Comparison: NAS + Replication vs. Ceph

dmca7919

New Member

SteveITS

Active Member

[SOLVED] Thread 'Ceph Public/Private And What Goes Over the network'

dmca7919

New Member

[SOLVED] Thread 'Ceph Public/Private And What Goes Over the network'

bbgeek17

Distinguished Member

dmca7919

New Member

bbgeek17

Distinguished Member

dmca7919

New Member

SteveITS

Active Member

dmca7919

New Member

alexskysilk

Distinguished Member

bbgeek17

Distinguished Member

We value your privacy

Deploying a colo

New Member

Active Member

Active Member

Quick Comparison: NAS + Replication vs. Ceph​

New Member

Active Member

[SOLVED] Thread 'Ceph Public/Private And What Goes Over the network'

New Member

[SOLVED] Thread 'Ceph Public/Private And What Goes Over the network'

Distinguished Member

New Member

Distinguished Member

New Member

Active Member

New Member

Distinguished Member

Distinguished Member

We value your privacy

Quick Comparison: NAS + Replication vs. Ceph