Suggestions for low cost HA production setup in small company

jb_wisemo · Oct 30, 2025

I am looking to design a production HA Proxmox VE cluster in a small company. So costs are the major constraint. Cluster will run production VMs (including DNS/DHCP) and development VMs. Cluster will use manually assigned IPs

Setups I am considering so far, if I missed something, please suggest that, but consider the cost limitation.

Option A: 2-node + QDevice with Ceph

The 2 main nodes will be new servers with lots of RAM and threads, plus a few hot-plug disks, separate high speed (10GbE) between nodes. QDevice maybe on a slower net.
Provides automatic failover of compute and storage
Near instant replication of virtual disks as writes are passed through to the replica on the other node
Unclear how to set up a QDevice for both Proxmox itself and Ceph by just installing software on a less powerful Debian node
Rumors that Ceph will insist on keeping 3 copies of everything instead of the 2 in RAID 1
Rumors that Ceph will do massive amounts of unneeded data copying on the other node when one of the nodes is taken offline
Unclear how this deals with a full power outage taking out both nodes at near the same time as the UPS runs empty.

Option B: 2-node + QDevice with ZFS replication

Same hardware as option A
Provides automatic failover of compute, maybe storage too
Delayed replication of virtual disks, causing failed over VMs to revert to older data.
QDevice apparently needs only deal with Proxmox coresync, no extra work for ZFS Replication.
Unclear if ZFS replication keeps 2 or 4 copies of data (1 or 2 per node).
Hopefully will have less complicated reaction to a full power outage taking out both nodes at near the same time as the UPS runs empty.

Option C: 3-node with Ceph

Similar hardware to the 2-node options but with less memory and higher total cost.
Provides automatic failover of computer and storage
Near instant replication of virtual disks as writes are passed through to the replica on other nodes
Rumors that Ceph will insist on keeping 3 copies of everything instead of the 2 in RAID 1
Rumors that Ceph will do massive amounts of unneeded data copying on the other nodes when one of the nodes is taken offline
Unclear how this deals with a full power outage taking out all/most nodes at near the same time as the UPS runs empty.
More expensive due to the extra node and need for a 10GbE switch to connect 3 nodes on each backend net.

Option D: 3-node with ZFS replication

Same hardware as option C
Provides automatic failover of compute, maybe storage too
Delayed replication of virtual disks, causing failed over VMs to revert to older data.
Unclear if ZFS replication keeps 2 or 4 copies of data (1 or 2 per node).
Hopefully will have less complicated reaction to a full power outage taking out all/most nodes at near the same time as the UPS runs empty.
More expensive due to the extra node and need for a 10GbE switch to connect 3 nodes on each backend net.

Option E: 2-node + QDevice with other clustered iSCSI SAN

Same hardware as option A/B, but without the hot-plug disks Pplus a 3rd party HA SAN storage solution and hardware.
Provides automatic failover of compute, shared access to HA storage via SCSI locking on SAN or Proxmox coordination of access.
Maybe the QDevice can run on the SAN hardware, maybe on some other Debian server.
QDevice apparently needs only deal with Proxmox coresync, no extra work for NAS HA.
Hopefully will have less complicated reaction to a full power outage taking out both nodes at near the same time as the UPS runs empty.
More expensive due to the extra SAN solution and potential need for a 10GbE switch to connect 2 nodes to SAN.

For the Proxmox nodes I would consider low cost new 1U servers with identical CPU/RAM setup and some kind of IPMI/BMC feature.

ness1602 · Oct 30, 2025

CEPH is only real HA that is best-in-class supported in Proxmox, so i would always choose that.

leesteken · Oct 30, 2025

If you want high-availability by redundancy then going for the bare-minimum of said redundancy does not make sense to me.
Maybe run a single PVE (and maybe one stand-by PVE separately, using PDM to migrate between them if necessary) and a PBS with hourly backups instead.

jb_wisemo · Oct 30, 2025

leesteken said:
If you want high-availability by redundancy then going for the bare-minimum of said redundancy does not make sense to me.
Maybe run a single PVE (and maybe one stand-by PVE separately, using PDM to migrate between them if necessary) and a PBS with hourly backups instead.

HA is always a matter of money versus probability. At one extreme, someone could spend 1 billion $ on redundant capacity to obtain 99.999999999999% uptime or more. Or one could spend $10 to obtain 90% uptime or less . In practice the best choice will be somewhere in between. Goal here is to cause the VMs to automatically fail over to new hardware with the same replicated data within 3 minutes of hardware failure. Thus the basic goal is for that failover hardware to exist, and for the Proxmox VE software to do the failover and replication . Backup is kept as a separate issue and not the subject of this discussion.

Theoretically, having compute capacity to run all the VMs on all but one server would cover compute uptime, while having a copy of data on all but one server or all but one SAN node will provide the storage uptime to run those VMs . In practice however software limitations in Proxmox will limit its ability to do the job. For example, it may fail to keep up to date copies of data on enough nodes or keep too many copies on the same nodes.

If a combined compute/storage node goes down as a unit, the copies of data stored on disks of that node will be temporarily inaccessible, but will generally still exist as a redundant copy that will rejoin the cluster when the node does so. On the other hand if a physical disk device (PV in lvm terminology) dies, the redundant data on that disk is typically lost except for difficult offline recovery techniques. In terms of storage uptime it matters how (un)likely it is for a node and a disk on that node to die simultaneously, and how (un)likely it is for a node and the disk elsewhere holding a redundant data copy to die simultaneously . Either way, if a running VM writes a change to its virtual disk when the cluster is in a degraded state, storage redundancy would require those changes to be stored on two still online disks. Conversely if a running VM reads from a virtual disk location where no copy is available, that VM will have to freeze or fail until a copy comes online again. Knowing which of these scenarios will work in practice is higly specific to Proxmox as opposed to ideal considerations of what an ideal software suite would do.

floh8 · Oct 30, 2025

Your one must give more information. How many VMs, how much RAM, need high speed storage, how much storage space, how many user aso.?

alexskysilk · Oct 30, 2025

jb_wisemo said:
HA is always a matter of money versus probability. At one extreme, someone could spend 1 billion $ on redundant capacity to obtain 99.999999999999% uptime or more. Or one could spend $10 to obtain 90% uptime or less

This, while true, is the wrong perspective. what is the CONSEQUENCE of downtime? put a cost on that, and you have an economic baseline.

Its one thing if your massive ecommerce platform is out. its another if you cant access your emails for a few hours. If you are designing a solution, the first order of business is to understand what the redlines are- business impact of outage is one.

jb_wisemo said:
In practice however software limitations in Proxmox will limit its ability to do the job.

no idea what you're trying to say. All you're pointing out are the fundamentals of HA, not what your customer's needs are. As you mentioned above, increasing uptime increases cost exponentially. you need to put a pin where you are meeting customer requirement, and design for that.

Without knowing what use/load is being designed for, I'd say that a "small" business would be fine with option B (which should be the lowest cost and complexity.) The only thing I'd probably do differently is have proper external shared storage instead of zfs.

Johannes S · Oct 31, 2025

jb_wisemo said:
Option A: 2-node + QDevice with Ceph

The other options all have their pro and cons, but will work. The option a isn't viable, since by default ceph has a min_size of 2 and three replicas, where each node has one replica. So you would have to change this setting to work, which is a bad idea as explained by Promxo Developer dcsapak in this discussion:

dcsapak said:
https://docs.ceph.com/docs/master/rados/operations/pools/
says that

min_size:
Sets the minimum number of replicas required for I/O.

so no, this is actually the number of replicas where it can still write (so 3/2 can tolerate a replica of 2 and still write)

2/1 is generally a bad idea because it is very easy to lose data, e.g. bit rot on one disk while the other fails/flapping osds, etc.
the more osds you have the more likely your data loss will be with this

a more prominent example of how it can fail is this story from 2016:
https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
even though they did not lose any(many?) data, it was much work to get it working again

Post in thread 'Ceph pool size (is 2/1 really a bad idea?)'

Apr 27, 2020

https://docs.ceph.com/docs/master/rados/operations/pools/
says that

min_size:
Sets the minimum number of replicas required for I/O.

so no, this is actually the number of replicas where it can still write (so 3/2 can tolerate a replica of 2 and still write)

2/1 is generally a bad idea because it is very easy to lose data, e.g. bit rot on one disk while the other fails/flapping osds, etc.
the more osds you have the more likely your data loss will be with this

a more prominent example of how it can fail is this story from 2016...

dcsapak

So I would use one of the other options, my preffered ones would be a two-node+qdevice cluster and zfs replication or a three-node cluster with Ceph, depending on your business needs (can you accept the asynchronous nature of zfs replication and thus resulting potential dataloss?) and budget (Ceph needs fast network and enterprise-SSDs). Please also take in mind that ZFS and Ceph both except that they are not run on discs in HW RAID, so you might need to reconfigure the storage controller to "IT-Mode/HBA" or something like that.

Regarding Ceph you will want to read https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/ and the manual https://pve.proxmox.com/wiki/Deploy...r#_recommendations_for_a_healthy_ceph_cluster

VictorSTS · Oct 31, 2025

For me Option A: 2-node + QDevice with Ceph is the worst idea ever (as explained above), Option D: 3-node with ZFS replication makes no sense having other options, and Option E: 2-node + QDevice with other clustered iSCSI SAN is a no go due to the SAN becoming an SPOF, the lack of a supported clustered filesystem on PVE and the lack of features like full snapshot support or thin provisioning.

My bet is Option C: 3-node with Ceph with 25G mesh network for [1] to avoid the cost of 10G+ switches, which is significant specially on small clusters/small hosts, and good enterprise NVMe drives. When budget allows you can even migrate from mesh to a switched network and add more nodes to let Ceph self-heal in case an OSD or host go down.

[1] https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server

alexskysilk · Oct 31, 2025

VictorSTS said:
3-node with Ceph with 25G mesh network

ceph with 3 nodes is not production quality, and should not be deployed outside a lab environment.

VictorSTS · Oct 31, 2025

alexskysilk said:
ceph with 3 nodes is not production quality, and should not be deployed outside a lab environment.

I don't agree: you have 3 copies of your data, you have host HA, there is no SPOF and you can easily grow up if/when needed.

With proper sizing you can even tolerate the loss of some OSD in any host and still allow Ceph to self heal. If you lose a host, everything will still work, albeit with no chance for Ceph to self-heal and obviously risking your data more than if you had more hosts and Ceph could self heal (same situation with any kind of RAID/mirror/ZFS).

It's not as perfect as a 5+ hosts Ceph cluster, but if business requirements don't demand that higher data availability if 2+ OSD or 1+ host(s) go south, then 3 host Ceph is a valid choice.

alexskysilk · Oct 31, 2025

VictorSTS said:
I don't agree: you have 3 copies of your data,

Storage isnt just about keeping your files. its about availability. A 3 host ceph cluster has no resilience to speak of. What happens when your cluster shuts off write access in the middle of the day and you dont know how to fix it? is there someone you can call, or do you have to become a ceph expert just to operate?

You can make decisions as are suited for your particular situation. Maybe you ARE a ceph expert. I operate multiple clusters in anger and *I* wouldnt do what you suggest.

bbgeek17 · Oct 31, 2025

VictorSTS said:
Option E: 2-node + QDevice with other clustered iSCSI SAN is a no go due to the SAN becoming an SPOF, the lack of a supported clustered filesystem on PVE and the lack of features like full snapshot support or thin provisioning.

One can certainly buy a non-redundant DiskStation on Amazon; however, I’m not sure why that would be the starting point for a business-grade SAN solution. A highly available SAN should be considered the bare minimum in this context.

Op specifically described HA SAN in his option E. If your point is that SAN software can fail and take out the storage, then it is no different than Ceph software can fail, or PVE software can fail. If this is your position, Op should be doing cross-vendor Cloud hosting across AWS, Azure, and Google...

A properly selected SAN will provide production-level support for snapshots, thin provisioning, and all the other critical storage functions.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

jb_wisemo · Nov 1, 2025

alexskysilk said:
This, while true, is the wrong perspective. what is the CONSEQUENCE of downtime? put a cost on that, and you have an economic baseline.

Its one thing if your massive ecommerce platform is out. its another if you cant access your emails for a few hours. If you are designing a solution, the first order of business is to understand what the redlines are- business impact of outage is one.

Speculations about underlying risk calculations in other organizations is none of your business.

alexskysilk said:
no idea what you're trying to say. All you're pointing out are the fundamentals of HA, not what your customer's needs are. As you mentioned above, increasing uptime increases cost exponentially. you need to put a pin where you are meeting customer requirement, and design for that.

Without knowing what use/load is being designed for, I'd say that a "small" business would be fine with option B (which should be the lowest cost and complexity.) The only thing I'd probably do differently is have proper external shared storage instead of zfs.

"Software limitations in Proxmox" is shorthand for things like "Proxmox VE 9 and older has chosen to use a version of Ceph which cannot provide HA with only 2 nodes + QDevice, even though it is theoretically possible to design Ceph algorithms to do that". Explaining how a HA distributed storage algorithm can work with only 2 storage nodes (each with 2 or more disks) is a subject for another day.

jb_wisemo · Nov 1, 2025

alexskysilk said:
Storage isnt just about keeping your files. its about availability. A 3 host ceph cluster has no resilience to speak of. What happens when your cluster shuts off write access in the middle of the day and you dont know how to fix it? is there someone you can call, or do you have to become a ceph expert just to operate?

You can make decisions as are suited for your particular situation. Maybe you ARE a ceph expert. I operate multiple clusters in anger and *I* wouldnt do what you suggest.

Are you telling the world that Ceph with 3 copy redundancy across 3 nodes will shut off write access if single node unavailability causes only 2 copies to be online?

That would certainly be a major failure of the Ceph HA algorithms, as the general assumption (from high level claims of being a HA system) would be that writes intended for the 3rd copy will instead go to free space on other disks (strangely named On Screen Displays) within the other 2 nodes until the 3rd node comes back up.

jb_wisemo · Nov 1, 2025

VictorSTS said:
[1] https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server

Skimming through that page, I am surprised there is no example using Linux kernel bridges for STP meshing, and/or some (unknown) feature of Linux kernel routing to do routing with fallback. Running a meshing or routing daemon just adds another point of failure.

Either way, that page requires an additional high speed NIC on each node to do the connections to the other neighbor node.

alexskysilk · Nov 1, 2025

jb_wisemo said:
Proxmox VE 9 and older has chosen to use a version of Ceph which cannot provide HA with only 2 nodes + QDevice

Proxmox uses Generic ceph. there is no "other" version.

jb_wisemo said:
Are you telling the world that Ceph with 3 copy redundancy across 3 nodes will shut off write access if single node unavailability causes only 2 copies to be online?

"copy redundancy" # availability. there is a limit to how much time I want to spend on this subject. I'd suggest you read and understand what ceph is, how it works, and why the limitations presented to you are presented. If you want to just discount my opinion, feel free to do so. which leads to

jb_wisemo said:
Speculations about underlying risk calculations in other organizations is none of your business.

You're right. they're not. and I never suggested it was- I dont have skin in your project. If pointing out that without acceptance criteria your original question cant have a useful answer is offensive to you, why did you bother asking it in the first place?

jb_wisemo · Nov 2, 2025

Ceph documentation has been made unreadable by excessive use of nonstandard terminology, including excessive use of project specific code names where version numbers and plain language would be appropriate. I already mentioned "OSD", but also the pile of sealife terms that outsiders cannot understand as computing terms. At such a weak S/N level, it is impossible to devise that words about copy counts don't refer to the attempted number of copies in non-degraded operation (which is the normal word use in describing RAID systems) .

waltar · Nov 2, 2025

While having limited budget resources but expecting some kind of ha of your solution I assume you would even have limited man power to maintain and fix problems as they appear. So you should look for a solution which you are familar with and in case of broken environment are able to get the vm's running again even in case of 1 host totally broken and in another disk died. In case of broken hw it's sometimes not that easy in process to get exchange hw fast so you have to prepare to run with reduced equipment also.

jb_wisemo · Nov 4, 2025

waltar said:
While having limited budget resources but expecting some kind of ha of your solution I assume you would even have limited man power to maintain and fix problems as they appear. So you should look for a solution which you are familar with and in case of broken environment are able to get the vm's running again even in case of 1 host totally broken and in another disk died. In case of broken hw it's sometimes not that easy in process to get exchange hw fast so you have to prepare to run with reduced equipment also.

Obviously. For disk failures, the policy is to keep the broken hardware and buy replacements from local mail order outlets (only takes a few days to ship parts 2km+). Physical consumer stores are faster to reach by foot or car, but don't have good stuff like 12TB raid-qualified HDDs . For node failures, switching to a consumer PC from a physical store is an option, but will obviously loose the high speed backend net. Waiting a week for a new server from somewhere may be more reasonable if the OS is good enough to keep all VMs running in the degraded state. Hence why I was upset when someone claimed that a frequently mentioned Proxmox option would non-obviously refuse to run any VMs at all in such a degraded state.

VictorSTS · Nov 7, 2025

alexskysilk said:
Storage isnt just about keeping your files. its about availability. A 3 host ceph cluster has no resilience to speak of. What happens when your cluster shuts off write access in the middle of the day and you dont know how to fix it? is there someone you can call, or do you have to become a ceph expert just to operate?

jb_wisemo said:
Are you telling the world that Ceph with 3 copy redundancy across 3 nodes will shut off write access if single node unavailability causes only 2 copies to be online?

If Ceph doesn't let you write is because some PG(s) don't have enough OSD to fulfill the size/min.size set on a pool. In a 3 host Ceph cluster, for that to happen you either have to:

Lose 2 hosts: you won't have quorum neither on Ceph nor on PVE and your VMs won't work until at least one host comes back or do disaster recovery on Ceph to get quorum with one monitor + pvecm expected 1 + set Ceph pool min.size 1. (Having a pool set to size 3, min.size 1 permanently is the second worst idea on Ceph, just behind using a size 2, min.size 1 pool). Your data is still in the surviving disks.
Lose at least two drives in two hosts more or less "at once". In this situation you won't be able to write to some PGs temporarily. When a disk fails, after mon_osd_down_out_interval (default 10 minutes), Ceph will use the last copy of the data on the affected PGs and create replicas on the remaining OSDs in the cluster. While a size 3, min.size 2 pool only has 1 replica in a PG, Ceph can't fulfill that requirement and you'll have no writes in that PG until the replica is created. For all this to work, you'll need enough free space in Ceph or you'll endup with too_full OSDs and, yeah, won't be able to write

jb_wisemo said:
Ceph documentation has been made unreadable by excessive use of nonstandard terminology,

It's fully standard termilogy, Ceph wise

IMHO maybe you expected RAID or "storage" terminology but Ceph has other components and does things in ways different that require other terms. Learning curve is steep specially at start. You could install 3 VMs with PVE on some hypervisor and practice how Ceph works.

bbgeek17 said:
A properly selected SAN will provide production-level support for snapshots, thin provisioning, and all the other critical storage functions.

We both know there are wonderful SANs but PVE support is still a bit lacking on most, hence I believe Ceph would offer a better features/support/price ratio, specially if there are no specific performance requirements that standard Ceph hardware can't provide.

Suggestions for low cost HA production setup in small company

New Member

Famous Member

Distinguished Member

New Member

Renowned Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

New Member

New Member

New Member

Distinguished Member

New Member

Famous Member

New Member

Distinguished Member

We value your privacy