Hi all,
I’m putting together a Proxmox cluster with Ceph for HA and wanted to get some feedback before I go ahead and deploy everything.
What I’m aiming for is fairly simple: I want proper HA with no data loss and automatic failover, but at the same time I’d still like one node (an R640) to act as the “preferred” node where all VMs normally run. If that node goes down, the VMs should automatically restart on a secondary node.
Hardware-wise, I’ll have three main nodes. The first is a Dell R640 with 8x NVMe and 25Gb, which will be the main compute node. The second is an Intel server with 6x NVMe and 25Gb, which I’m planning to use as the secondary. The third is an HP CL3100 with 4x 16TB HDDs and a couple of SSDs, mainly intended for capacity storage. There’s also an older R630 that I might either exclude or use for non-critical workloads. Currently the Intel server is the only node and has all my VMs and data on there.
Networking will be 25Gb, and I’m planning to keep Ceph traffic on a dedicated interface.
The idea is to run a 3-node Proxmox cluster with Ceph, using an NVMe pool for VM disks and a separate HDD pool for backups or colder data. Replication would be size=3 (I think, dont really understand this) . On top of that, Id use Proxmox HA to keep VMs pinned to the R640 as the preferred node, with failover to the Intel box if needed.
I just want to sanity check a few things before I proceed. Does this overall design make sense for a production HA setup with Ceph? For separating NVMe and HDD storage, is it better to rely on device classes or define custom CRUSH rules? Also, is it a good idea to include the HP node in the Ceph cluster, or would it be better to keep it separate and use it only for backups?
Main priorities are reliability, predictable failover, and keeping the setup manageable.
Hardware:
Current / Running:
I’m putting together a Proxmox cluster with Ceph for HA and wanted to get some feedback before I go ahead and deploy everything.
What I’m aiming for is fairly simple: I want proper HA with no data loss and automatic failover, but at the same time I’d still like one node (an R640) to act as the “preferred” node where all VMs normally run. If that node goes down, the VMs should automatically restart on a secondary node.
Hardware-wise, I’ll have three main nodes. The first is a Dell R640 with 8x NVMe and 25Gb, which will be the main compute node. The second is an Intel server with 6x NVMe and 25Gb, which I’m planning to use as the secondary. The third is an HP CL3100 with 4x 16TB HDDs and a couple of SSDs, mainly intended for capacity storage. There’s also an older R630 that I might either exclude or use for non-critical workloads. Currently the Intel server is the only node and has all my VMs and data on there.
Networking will be 25Gb, and I’m planning to keep Ceph traffic on a dedicated interface.
The idea is to run a 3-node Proxmox cluster with Ceph, using an NVMe pool for VM disks and a separate HDD pool for backups or colder data. Replication would be size=3 (I think, dont really understand this) . On top of that, Id use Proxmox HA to keep VMs pinned to the R640 as the preferred node, with failover to the Intel box if needed.
I just want to sanity check a few things before I proceed. Does this overall design make sense for a production HA setup with Ceph? For separating NVMe and HDD storage, is it better to rely on device classes or define custom CRUSH rules? Also, is it a good idea to include the HP node in the Ceph cluster, or would it be better to keep it separate and use it only for backups?
Main priorities are reliability, predictable failover, and keeping the setup manageable.
Hardware:
Current / Running:
- Intel Server – 2x Xeon Gold 6138, 128GB RAM, 6x NVMe Samsung PM983, 25GbE (to be upgraded to 256GB RAM)
- Dell R630 – older server, can be repurposed
- Dell R640 – 2x Xeon Platinum 8164, 8x NVMe Kioxia CD8, 25GbE
- HP CL3100 – 2x Xeon E5-2683v4, 4x 16TB HDD, 2x SATA SSD, 25GbE
- Dell S5148F-ON switch