Suggestions for low cost HA production setup in small company

jb_wisemo · 2025-10-30T04:22:06+0100

I am looking to design a production HA Proxmox VE cluster in a small company. So costs are the major constraint. Cluster will run production VMs (including DNS/DHCP) and development VMs. Cluster will use manually assigned IPs

Setups I am considering so far, if I missed something, please suggest that, but consider the cost limitation.

Option A: 2-node + QDevice with Ceph

The 2 main nodes will be new servers with lots of RAM and threads, plus a few hot-plug disks, separate high speed (10GbE) between nodes. QDevice maybe on a slower net.
Provides automatic failover of compute and storage
Near instant replication of virtual disks as writes are passed through to the replica on the other node
Unclear how to set up a QDevice for both Proxmox itself and Ceph by just installing software on a less powerful Debian node
Rumors that Ceph will insist on keeping 3 copies of everything instead of the 2 in RAID 1
Rumors that Ceph will do massive amounts of unneeded data copying on the other node when one of the nodes is taken offline
Unclear how this deals with a full power outage taking out both nodes at near the same time as the UPS runs empty.

Option B: 2-node + QDevice with ZFS replication

Same hardware as option A
Provides automatic failover of compute, maybe storage too
Delayed replication of virtual disks, causing failed over VMs to revert to older data.
QDevice apparently needs only deal with Proxmox coresync, no extra work for ZFS Replication.
Unclear if ZFS replication keeps 2 or 4 copies of data (1 or 2 per node).
Hopefully will have less complicated reaction to a full power outage taking out both nodes at near the same time as the UPS runs empty.

Option C: 3-node with Ceph

Similar hardware to the 2-node options but with less memory and higher total cost.
Provides automatic failover of computer and storage
Near instant replication of virtual disks as writes are passed through to the replica on other nodes
Rumors that Ceph will insist on keeping 3 copies of everything instead of the 2 in RAID 1
Rumors that Ceph will do massive amounts of unneeded data copying on the other nodes when one of the nodes is taken offline
Unclear how this deals with a full power outage taking out all/most nodes at near the same time as the UPS runs empty.
More expensive due to the extra node and need for a 10GbE switch to connect 3 nodes on each backend net.

Option D: 3-node with ZFS replication

Same hardware as option C
Provides automatic failover of compute, maybe storage too
Delayed replication of virtual disks, causing failed over VMs to revert to older data.
Unclear if ZFS replication keeps 2 or 4 copies of data (1 or 2 per node).
Hopefully will have less complicated reaction to a full power outage taking out all/most nodes at near the same time as the UPS runs empty.
More expensive due to the extra node and need for a 10GbE switch to connect 3 nodes on each backend net.

Option E: 2-node + QDevice with other clustered iSCSI SAN

Same hardware as option A/B, but without the hot-plug disks Pplus a 3rd party HA SAN storage solution and hardware.
Provides automatic failover of compute, shared access to HA storage via SCSI locking on SAN or Proxmox coordination of access.
Maybe the QDevice can run on the SAN hardware, maybe on some other Debian server.
QDevice apparently needs only deal with Proxmox coresync, no extra work for NAS HA.
Hopefully will have less complicated reaction to a full power outage taking out both nodes at near the same time as the UPS runs empty.
More expensive due to the extra SAN solution and potential need for a 10GbE switch to connect 2 nodes to SAN.

For the Proxmox nodes I would consider low cost new 1U servers with identical CPU/RAM setup and some kind of IPMI/BMC feature.

ness1602 · 2025-10-30T08:08:00+0100

CEPH is only real HA that is best-in-class supported in Proxmox, so i would always choose that.

leesteken · 2025-10-30T08:20:14+0100

If you want high-availability by redundancy then going for the bare-minimum of said redundancy does not make sense to me.
Maybe run a single PVE (and maybe one stand-by PVE separately, using PDM to migrate between them if necessary) and a PBS with hourly backups instead.

Search

Search

Suggestions for low cost HA production setup in small company

jb_wisemo

New Member

ness1602

Famous Member

leesteken

Distinguished Member

We value your privacy