Cluster setup and HA configuration suggestions

kryrena

New Member
Apr 5, 2026
1
0
1
Hi all,

I’m putting together a Proxmox cluster with Ceph for HA and wanted to get some feedback before I go ahead and deploy everything.

What I’m aiming for is fairly simple: I want proper HA with no data loss and automatic failover, but at the same time I’d still like one node (an R640) to act as the “preferred” node where all VMs normally run. If that node goes down, the VMs should automatically restart on a secondary node.

Hardware-wise, I’ll have three main nodes. The first is a Dell R640 with 8x NVMe and 25Gb, which will be the main compute node. The second is an Intel server with 6x NVMe and 25Gb, which I’m planning to use as the secondary. The third is an HP CL3100 with 4x 16TB HDDs and a couple of SSDs, mainly intended for capacity storage. There’s also an older R630 that I might either exclude or use for non-critical workloads. Currently the Intel server is the only node and has all my VMs and data on there.

Networking will be 25Gb, and I’m planning to keep Ceph traffic on a dedicated interface.

The idea is to run a 3-node Proxmox cluster with Ceph, using an NVMe pool for VM disks and a separate HDD pool for backups or colder data. Replication would be size=3 (I think, dont really understand this) . On top of that, Id use Proxmox HA to keep VMs pinned to the R640 as the preferred node, with failover to the Intel box if needed.

I just want to sanity check a few things before I proceed. Does this overall design make sense for a production HA setup with Ceph? For separating NVMe and HDD storage, is it better to rely on device classes or define custom CRUSH rules? Also, is it a good idea to include the HP node in the Ceph cluster, or would it be better to keep it separate and use it only for backups?

Main priorities are reliability, predictable failover, and keeping the setup manageable.

Hardware:

Current / Running:
  • Intel Server – 2x Xeon Gold 6138, 128GB RAM, 6x NVMe Samsung PM983, 25GbE (to be upgraded to 256GB RAM)
  • Dell R630 – older server, can be repurposed
Planned / New:
  • Dell R640 – 2x Xeon Platinum 8164, 8x NVMe Kioxia CD8, 25GbE
  • HP CL3100 – 2x Xeon E5-2683v4, 4x 16TB HDD, 2x SATA SSD, 25GbE
Networking:
  • Dell S5148F-ON switch
 
Read through https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/. 3 nodes will technically work but as soon as you reboot one you’re at the minimum 2 replicas.

You don’t give disk sizes on the first two but if storage is unbalanced that can cause issues if the “large” server goes down as Ceph tries to “move” those PGs to other servers. Though with only 3 it can’t.

HDD in the pool will affect performance as it stores one replica on each (3) server.
 
  • Like
Reactions: UdoB
For separating NVMe and HDD storage, is it better to rely on device classes or define custom CRUSH rules?
you need to create 2 custom crush rules using the device_classes

ceph osd crush rule create-replicated <crush_rules_name> default host <device_class>


ex:

ceph osd crush rule create-replicated myhddreplication default host hdd
ceph osd crush rule create-replicated myssdreplication default host ssd


then
- create 1 pool for hdd with your myhddreplication crushrule
- create 1 pool for ssd with your mysddreplication crushrule
 
Isn't that true for any ceph cluster with 3 replicas?
With only three nodes (and one of them down) there is not chance to rebalance. It will stay "degraded" forever as the "failure domain" usually is "host".
 
Isn't that true for any ceph cluster with 3 replicas?
I answered quickly but my point was supposed to be “if a server is down” which is “reboot” (temporary) or something more permanent in which case Ceph cannot recover to 3/2 replicas.

Similar issue if a drive dies and an OSD is down. Ceph will create the missing PGs on other drives on the same host (the only one it can, using the default 3/2) so other drives need enough free space to hold what was on the hypothetical missing drive. (Plus nearfull ratio defaults to 85% per OSD)

Yes there are the flags to prevent rebalancing for a simple reboot.

If 1/3 servers are down Ceph has only 2/2 copies of PGs but also the whole cluster is at increased risk since 2/3 is 1 away from no quorum.

Yes it can all work with 3, I’m just trying to point out the potential issues.

4 servers plus a Qdevice would allow for two servers to be offline at any one time.
 
I've been migrating from VMware vSphere to Proxmox Ceph at work on 13th-gen Dells.

Ceph really, really, really wants homogeneous hardware, ie, same CPU, RAM, NIC, storage, storage controller, firmware, etc, etc. I swapped out PERCs for Dell HBA330s since I won't want to deal with PERC HBA-mode drama.

IMO, you really want 5-nodes for production, that way can lose 2 nodes and still have quorum. Plus, Ceph is a scale-out solution. More nodes/OSDs = more IOPS.

So, been creating, 5-, 7-, 9-, 11-node Proxmox Ceph clusters at work. I really see a big difference in performance between 5- vs. 11-node clusters. Not hurting for IOPS. Workloads range from DHCP to DB servers. All workloads backed up to a bare-metal Proxmox Back Server.

It's considered best practice to use isolated switches for Ceph, Corosync, and VM traffic.

I use the following optimizations learned through trial-and-error. YMMV.

Code:
    Set SAS HDD Write Cache Enable (WCE) (sdparm -s WCE=1 -S /dev/sd[x])
    Set SAS HDD 'bluestore_prefer_deferred_size_hdd = 0' in osd stanza in /etc/pve/ceph.conf
    Set VM Disk Cache to None if clustered, Writeback if standalone
    Set VM Disk controller to VirtIO-Single SCSI controller and enable IO Thread & Discard option
    Set VM CPU Type for Linux to 'Host'
    Set VM CPU Type for Windows to 'x86-64-v2-AES' on older CPUs/'x86-64-v3' on newer CPUs/'nested-virt' on Proxmox 9.1
    Set VM CPU NUMA
    Set VM Networking VirtIO Multiqueue to 1
    Set VM Qemu-Guest-Agent software installed and VirtIO drivers on Windows
    Set VM IO Scheduler to none/noop on Linux
    Set Ceph RBD pools to use 'krbd' option
    Set Erasure Coding profiles to 'plugin=ISA' & 'technique=reed_sol_van'
    Set Erasure Coding profiles to 'stripe_unit=65536' for SAS HDD
 
Last edited: