Proxmox HA cluster with Ceph

cgeerinckx

New Member
Nov 12, 2025
3
0
1
Hello,

I don't know if this question has been asked already. If so I have not found it.

Question:

Would it be a good practice to setup a Proxmox/Ceph HA cluster using 5 nodes:

node 1 : proxmox/ceph running VM's
node 2: proxmox/ceph running VM's
node 3: proxmox/ceph running VM's
node 4: proxmox/ceph no VM's
node 5: proxmox/ceph no VM's

In this scenario for node4 and node 5 I would use servers with less CPU cores and less RAM.

5 nodes for ceph because of best practice
3 nodes for VM's because that wil be sufficient
2 node with lower CPU cores/RAM to save costs.
 
It's true that Ceph recommends a minimum of 5 nodes, but there are many 3-node configurations that have been running for a long time without any major problems. With the right hardware, of course.
Furthermore, I don't think your idea of having 2 nodes with low CPU/RAM is a good one. You would risk lower performance, since Ceph (like all distributed file systems) is only as fast as the slowest node.
 
  • Like
Reactions: cgeerinckx
With a 3-node Ceph cluster you need to be careful when planning how many disks you add for OSDs. More but smaller is preferred, because if just a single disk fails, Ceph can only recover to the same node in such a small cluster.
For example, if you use only 2 large OSDs per node, and one of them fails, the remaining can quickly run full when Ceph is recovering the lost data to it.
If you have 4 or more smaller OSDs, the chances for this happening are a lot smaller.

If you want to run a 2-node cluster, you could consider using local ZFS (same pool name on both nodes) and use the Replication feature. That is async though with the shortest interval at 1 minute. So if you combine that with HA, you could have some potential data loss, depending on when the last replication ran.
If that caveat is okay, depends on your situation.

In any way, for a 2-node cluster you need a 3rd vote! https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_external_vote_support

If you have a dedicated physical PBS planned, then have a look at this: https://forum.proxmox.com/threads/planning-advice.169434/#post-791732
 
Three nodes will work but need to be aware of it's limitations: With three nodes the cluster can tolerate the loss of one node (due to an fail or system maintenance-reboot etc) but not auto-heal (this would need more nodes). If you can live with that (because you consider the propability that two of three nodes might be down at the same time low enough that you are willing to accept the risc) that's ok. I remember @Falk R. and @LnxBil mentioned several times, that they setup such three-node clusters for customers of them which are happy with that environment.
Some other caveats concerning small clusters by @UdoB : https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/
Please note, that Udos example is (deliberately) an extreme case, which has just one OSD per node and minimal network hardware (no redundancy, no dedicated network for corosync or ceph data, under 10Gbit/s bandwith etc), so parts of it don't apply if you are doing your network architecture and hardware sizing correctly. Nontheless it's a good read to get an idea of the caveats with small (<5 nodes) Ceph clusters.
 
  • Like
Reactions: cgeerinckx and UdoB
In this scenario for node4 and node 5 I would use servers with less CPU cores and less RAM.

5 nodes for ceph because of best practice
3 nodes for VM's because that wil be sufficient
2 node with lower CPU cores/RAM to save costs.

Ceph really, really wants homogeneous hardware. Meaning same CPU, memory, networking, storage, storage controller, firmware, etc.

While it's true you can run a 3-node cluster, you can only have a 1-node outage. With 5-nodes, can have 2-node outage. So, for production, 5-node minimum.

With that being said, at least get the memory & storage even across all 5 nodes. As for the CPUs, as long it's the same CPU family, you'll be OK. Your future self will thank you.

Best option is having same hardware on all nodes.
 
Last edited:
  • Like
Reactions: cgeerinckx and UdoB
With 5-nodes, can have 2-node outage. So, for production, 5-node minimum.
This statement is a bit too general or dogmatic for my taste. I know of several reports in this forum, where people use three-node-clusters in production in smbs.
Usually they didn't had the budget for five nodes, but can tolerate the risc of an outage of two nodes but are not fine with the potential dataloss envolved with Storage replication. The main goal of their cluster is to have the ability to do maintenance tasks like ProxmoxVE os patches or hardware upgrades on one nodes without downtimes of the hosted Services. The potential risk of a failure on one of the remaining two nodes was considered acceptable for them. Now obviouvsly this consideration isn't feasible for everyone but I can imagine a lot of production environments where that tradeoff would be legitimate. YMMV
 
Last edited:
  • Like
Reactions: cgeerinckx
Ceph really, really wants homogeneous hardware. Meaning same CPU, memory, networking, storage, storage controller, firmware, etc.

While it's true you can run a 3-node cluster, you can only have a 1-node outage. With 5-nodes, can have 2-node outage. So, for production, 5-node minimum.

With that being said, at least get the memory & storage even across all 5 nodes. As for the CPUs, as long it's the same CPU family, you'll be OK. Your future self will thank you.

Best option is having same hardware on all nodes.
I did intent to use the same CPU family.

Let's say ( CPU's chosen to make the question more clear)

Node 1, 2 and 3: AMD EPYC™ Turin 9475F - 48C/96T - 3.65GHz - 4.8GHz boost - 256MB - 400W - SP5
Node 4 and 5: AMD EPYC™ Turin 9135 16C/32T - 3.65GHz - 4.3GHz boost - 64MB - 200W - SP5

For RAM (There would be 10.2 TB raw storage on each node, using 3 x 3.4 TB nvme)

Node 1, 2 and 3: 512 GB RAM
Node 4 and 5: 64 GB RAM

The extra Cores and RAM on node 1,2 and 3 would be used bij VM's.
 
With a 3-node Ceph cluster you need to be careful when planning how many disks you add for OSDs. More but smaller is preferred, because if just a single disk fails, Ceph can only recover to the same node in such a small cluster.
For example, if you use only 2 large OSDs per node, and one of them fails, the remaining can quickly run full when Ceph is recovering the lost data to it.
If you have 4 or more smaller OSDs, the chances for this happening are a lot smaller.

If you want to run a 2-node cluster, you could consider using local ZFS (same pool name on both nodes) and use the Replication feature. That is async though with the shortest interval at 1 minute. So if you combine that with HA, you could have some potential data loss, depending on when the last replication ran.
If that caveat is okay, depends on your situation.

In any way, for a 2-node cluster you need a 3rd vote! https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_external_vote_support

If you have a dedicated physical PBS planned, then have a look at this: https://forum.proxmox.com/threads/planning-advice.169434/#post-791732
The Ceph cluster would be 5 nodes.
Of those 5 nodes, only the first 3 nodes would also be used to host VM's.
 
Ceph really, really wants homogeneous hardware. Meaning same CPU, memory, networking, storage, storage controller, firmware, etc.
Where did you get that idea?! ceph doesnt care about ANY of those things. Just be aware that the overall performance of the cluster will be as slow as the slowest monitor.

Yes, for live migration, manual/HA. Hence, should really have same amount memory across nodes.
ceph doesnt need or perform any live migration. As for memory, nodes should contain as much memory as the intended load; in other words nodes that have 20 OSDs need ~80GB of RAM for the OSDs regardless of what's on other nodes, etc. If you intend to failover 40GB of ram onto surviving nodes, those nodes should contain 40GB of FREE MEMORY BETWEEN THEM, subject to individual VM granularity. Incidentally, you dont need to (and there is reason not to) include all your nodes for workload use, and then those wouldnt even count toward HA. An example would be dedicated OSD nodes.