Cluster setup and HA configuration suggestions

kryrena · Apr 5, 2026

Hi all,

I’m putting together a Proxmox cluster with Ceph for HA and wanted to get some feedback before I go ahead and deploy everything.

What I’m aiming for is fairly simple: I want proper HA with no data loss and automatic failover, but at the same time I’d still like one node (an R640) to act as the “preferred” node where all VMs normally run. If that node goes down, the VMs should automatically restart on a secondary node.

Hardware-wise, I’ll have three main nodes. The first is a Dell R640 with 8x NVMe and 25Gb, which will be the main compute node. The second is an Intel server with 6x NVMe and 25Gb, which I’m planning to use as the secondary. The third is an HP CL3100 with 4x 16TB HDDs and a couple of SSDs, mainly intended for capacity storage. There’s also an older R630 that I might either exclude or use for non-critical workloads. Currently the Intel server is the only node and has all my VMs and data on there.

Networking will be 25Gb, and I’m planning to keep Ceph traffic on a dedicated interface.

The idea is to run a 3-node Proxmox cluster with Ceph, using an NVMe pool for VM disks and a separate HDD pool for backups or colder data. Replication would be size=3 (I think, dont really understand this) . On top of that, Id use Proxmox HA to keep VMs pinned to the R640 as the preferred node, with failover to the Intel box if needed.

I just want to sanity check a few things before I proceed. Does this overall design make sense for a production HA setup with Ceph? For separating NVMe and HDD storage, is it better to rely on device classes or define custom CRUSH rules? Also, is it a good idea to include the HP node in the Ceph cluster, or would it be better to keep it separate and use it only for backups?

Main priorities are reliability, predictable failover, and keeping the setup manageable.

Hardware:

Current / Running:

Intel Server – 2x Xeon Gold 6138, 128GB RAM, 6x NVMe Samsung PM983, 25GbE (to be upgraded to 256GB RAM)
Dell R630 – older server, can be repurposed

Planned / New:

Dell R640 – 2x Xeon Platinum 8164, 8x NVMe Kioxia CD8, 25GbE
HP CL3100 – 2x Xeon E5-2683v4, 4x 16TB HDD, 2x SATA SSD, 25GbE

Networking:

Dell S5148F-ON switch

SteveITS · Apr 5, 2026

Read through https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/. 3 nodes will technically work but as soon as you reboot one you’re at the minimum 2 replicas.

You don’t give disk sizes on the first two but if storage is unbalanced that can cause issues if the “large” server goes down as Ceph tries to “move” those PGs to other servers. Though with only 3 it can’t.

HDD in the pool will affect performance as it stores one replica on each (3) server.

LnxBil · Apr 5, 2026

SteveITS said:
Read through https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/. 3 nodes will technically work but as soon as you reboot one you’re at the minimum 2 replicas.

Isn't that true for any ceph cluster with 3 replicas? Some data will always be not available. The data is not instantly available on other nodes and needs rebalancing if you're not using noout on reboot.

spirit · Apr 5, 2026

kryrena said:
For separating NVMe and HDD storage, is it better to rely on device classes or define custom CRUSH rules?

you need to create 2 custom crush rules using the device_classes

ceph osd crush rule create-replicated <crush_rules_name> default host <device_class>

ex:

ceph osd crush rule create-replicated myhddreplication default host hdd
ceph osd crush rule create-replicated myssdreplication default host ssd

then
- create 1 pool for hdd with your myhddreplication crushrule
- create 1 pool for ssd with your mysddreplication crushrule

UdoB · Apr 5, 2026

LnxBil said:
Isn't that true for any ceph cluster with 3 replicas?

With only three nodes (and one of them down) there is not chance to rebalance. It will stay "degraded" forever as the "failure domain" usually is "host".

SteveITS · Apr 5, 2026

LnxBil said:
Isn't that true for any ceph cluster with 3 replicas?

I answered quickly but my point was supposed to be “if a server is down” which is “reboot” (temporary) or something more permanent in which case Ceph cannot recover to 3/2 replicas.

Similar issue if a drive dies and an OSD is down. Ceph will create the missing PGs on other drives on the same host (the only one it can, using the default 3/2) so other drives need enough free space to hold what was on the hypothetical missing drive. (Plus nearfull ratio defaults to 85% per OSD)

Yes there are the flags to prevent rebalancing for a simple reboot.

If 1/3 servers are down Ceph has only 2/2 copies of PGs but also the whole cluster is at increased risk since 2/3 is 1 away from no quorum.

Yes it can all work with 3, I’m just trying to point out the potential issues.

4 servers plus a Qdevice would allow for two servers to be offline at any one time.

jdancer · Apr 5, 2026

I've been migrating from VMware vSphere to Proxmox Ceph at work on 13th-gen Dells.

Ceph really, really, really wants homogeneous hardware, ie, same CPU, RAM, NIC, storage, storage controller, firmware, etc, etc. I swapped out PERCs for Dell HBA330s since I won't want to deal with PERC HBA-mode drama.

IMO, you really want 5-nodes for production, that way can lose 2 nodes and still have quorum. Plus, Ceph is a scale-out solution. More nodes/OSDs = more IOPS.

So, been creating, 5-, 7-, 9-, 11-node Proxmox Ceph clusters at work. I really see a big difference in performance between 5- vs. 11-node clusters. Not hurting for IOPS. Workloads range from DHCP to DB servers. All workloads backed up to a bare-metal Proxmox Back Server.

It's considered best practice to use isolated switches for Ceph, Corosync, and VM traffic.

I use the following optimizations learned through trial-and-error. YMMV.

Code:

    Set SAS HDD Write Cache Enable (WCE) (sdparm -s WCE=1 -S /dev/sd[x])
    Set SAS HDD 'bluestore_prefer_deferred_size_hdd = 0' in osd stanza in /etc/pve/ceph.conf
    Set VM Disk Cache to None if clustered, Writeback if standalone
    Set VM Disk controller to VirtIO-Single SCSI controller and enable IO Thread & Discard option
    Set VM CPU Type for Linux to 'Host'
    Set VM CPU Type for Windows to 'x86-64-v2-AES' on older CPUs/'x86-64-v3' on newer CPUs/'nested-virt' on Proxmox 9.1
    Set VM CPU NUMA
    Set VM Networking VirtIO Multiqueue to 1
    Set VM Qemu-Guest-Agent software installed and VirtIO drivers on Windows
    Set VM IO Scheduler to none/noop on Linux
    Set Ceph RBD pools to use 'krbd' option
    Set Erasure Coding profiles to 'plugin=ISA' & 'technique=reed_sol_van'
    Set Erasure Coding profiles to 'stripe_unit=65536' for SAS HDD

SteveITS · Apr 5, 2026

Here’s a discussion on Ceph recovery I just ran across: https://forum.proxmox.com/threads/u...-node-cluster-with-12-osds.161734/post-745316

> Ceph really, really, really wants homogeneous hardware

Hmm, not my experience, explain?

kryrena · Apr 7, 2026

SteveITS said:
Read through https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/. 3 nodes will technically work but as soon as you reboot one you’re at the minimum 2 replicas.

You don’t give disk sizes on the first two but if storage is unbalanced that can cause issues if the “large” server goes down as Ceph tries to “move” those PGs to other servers. Though with only 3 it can’t.

HDD in the pool will affect performance as it stores one replica on each (3) server.

R630 wont be part of the storage CEPH, it will run independent, non critical machines. The intel server has 6x 1.92TB NVMes.
The shared storage shall be between the R640 (8x 1.6TB NVMe) and the Intel (6x 1.92TB NVMe), but if its better to work with the same storage I can switch the planned R640 to have the exact same storage capacity as the Intel server.

HDDs wont be in CEPH, I will use them to store the whole shared storage on there, as a backup

kryrena · Apr 7, 2026

spirit said:
you need to create 2 custom crush rules using the device_classes

ceph osd crush rule create-replicated <crush_rules_name> default host <device_class>

ex:

ceph osd crush rule create-replicated myhddreplication default host hdd
ceph osd crush rule create-replicated myssdreplication default host ssd

then
- create 1 pool for hdd with your myhddreplication crushrule
- create 1 pool for ssd with your mysddreplication crushrule

Thank you. This way the NVMe and HDD storage remain independent, so the HDD can act as a backup correct?

kryrena · Apr 7, 2026

jdancer said:
Ceph really, really, really wants homogeneous hardware, ie, same CPU, RAM, NIC, storage, storage controller, firmware, etc, etc. I swapped out PERCs for Dell HBA330s since I won't want to deal with PERC HBA-mode drama.

IMO, you really want 5-nodes for production, that way can lose 2 nodes and still have quorum. Plus, Ceph is a scale-out solution. More nodes/OSDs = more IOPS.

Unfortunately I do not have the budget for more machines or homogenization of the servers. I will have a total of 4 servers and a QDevice in the cluster, but that's all I can afford right now. My biggest fear is that the Intel server is already setup and working, without redundancy or backups. I need to make sure that whatever I plan to do won't lead to data loss

SteveITS · Apr 7, 2026

kryrena said:
HDDs wont be in CEPH, I will use them to store the whole shared storage on there, as a backup

I’m not sure what you mean, still. Ceph is shared storage. Perhaps more coffee…

You plan 2 nodes with Ceph, or 3? With 2 you can’t really use the default 3/2 replica setting. (3 copies of each PG, 2 minimum, on 3 different hosts)

Leonid Kanareykin · Apr 15, 2026

It's considered best practice to use isolated switches for Ceph, Corosync, and VM traffic.

Can I use VLANs dedicated for Ceph only trafic with my 25 Gb NICs? Or i must use separate switches?

SteveITS · Apr 15, 2026

Can you prioritize the corosync VLAN in your switch? We did that for our backup corosync links. 1 Gbps is just fine, the actual bandwidth used is minimal. The need for corosync is for low latency.

Ceph can have a second, optional "private" network to handle replication/rebalancing traffic. If the switch can keep up with multiple 25 Gbps ports then I don't know that another switch is required.

Note if all your corosync links are going through one switch that switch is one point of failure for all of them.

Cluster setup and HA configuration suggestions

kryrena

New Member

SteveITS

Renowned Member

LnxBil

Distinguished Member

spirit

Distinguished Member

UdoB

Distinguished Member

SteveITS

Renowned Member

jdancer

Renowned Member

SteveITS

Renowned Member

kryrena

New Member

kryrena

New Member

kryrena

New Member

SteveITS

Renowned Member

Leonid Kanareykin

New Member

SteveITS

Renowned Member

We value your privacy