Query: 2 node setup with HA (3rd box for quorum) > ZFS datastore / better-other?-options?


Renowned Member
Jun 4, 2008
Hi, I am hoping to just make a sanity check for guidance to see if I'm overlooking something. Feedback is greatly appreciated!

I'm hoping to setup a modest size cluster for a client, project is ramping up with 'fairly modest' requirements and hopes the business will grow and it needs to scale
Short term they don't want to invest in a HA multi-storage device pool (ie, 2+ bricks with dual controller iscsi san "equallogic" style gear). The storage needs are small enough that a ceph cluster does not make sense. (ie, they hope for 2 proxmox physical nodes plus 1 PBS server physical node who also acts as the cluster quorum 3rd member for 'cluster up status sanity check vote'). The actul resources needed are approx 60Gb ram for VM Guests/ 1Tb for disk storage for VMs / 24 cores of CPU (approx).

I have the impression right now my options are
(a) use ZFS replication between proxmox1<>2 nodes. This will be only async, as there appears to be no realtime CoW/Sync replication support? ie, we can only schedule (replicate data every 5? or 1? minutes, for example). (?) If there is a way to do 'realtime' sync replication with ZFS data replication between nodes and I'm just missing it somehow? please do kick me along in right direction?

(b) use ceph with a highly constrained node setup. ie, I have to push client for 3x physical proxmox nodes with sufficient disk to make it functional. Dedicated 10gig for ceph storage interconnect. Maybe we have a modest size HW Raid1 Mirror Disk for the base proxmox volume on each node, then approx 4 x SSD drives present in each proxmox node / for example 4 x 1Tb SSD. Then with ceph we might get a total of 3-4? Tb usable space (2N+1 copies of data spread over 3 nodes) for the ceph HA storage pool. (I generally have the feeling that Ceph wants to run - on 'bigger more serious' deployments, ie, 6-10+ nodes, where you want multi-multi TBS of storage, have lots and lots of VMs needed, etc, ie, not such a great fit for "yeah, we have maybe 8-ish VMs and ~=<1Tb of storage footprint needed maybe kinda". But maybe I'm not being proper-fair in my assessment?).

(c) use DRBD is not really option I want to explore; I tested this a version or two back on proxmox and general feeling I had was that - it is kind of painful to get it working reliably and is a bit more fragile than I like.

(D) just forget about the true HA storage, go with a 'robust modest decently fault tolerant' QNAP NFS Filer as shared storage target (redundant power, disks, but non-redundant controller basically) - so we have single point of failure here on the qnap, which is not great, but generally these are solid and run well, so outages will be few. Regular backups on PBS cluster mean worst-case we can do a bare-VM-restore from PBS into a local storage in case we have a fail on the backing QNAP storage device and we want to get production VMs back online 'quickly, with human intervention, but clearly not an HA-style auto-recovery' sort of thing.

(e) use an external ZFS HA storage solution of some kind, let it deal with the HA storage and let proxmox just use that as the 'trusted storage tank'. I recently was given a heads up about this ZFS stack: https://haossoft.com/requirements/ - appears to be pure ZFS standard-based under the hood / with the team on this project putting together the 'glue' to 'make it work'. But they clearly say, "hey, this is not production ready, we are testing / asking for feedback'. So I am not positive I should 'test, deploy' in production. :)

Anyhoo. Figured I would put this out there in case anyone wanted to give me a kick in the right direction , or remind me of obvious things I am overlooking.

Many thanks!

With ZFS replication you can go down to a 1minute interval. If that is enough for your client, that I would highly suggest this solution as VM IO is bound to the local node and does not need any network. Once the initial replication is done, it should be quick enough to be done within a few seconds if the guests don't write a lot each minute.

Ceph can work fine in a 3Node cluster as well. You can lose 1 node and the cluster will stay functional. If you don't need to expland the cluster in the foreseeable future, you can use a Full-Mesh network for Ceph and skip the switch.

Ceph does redundancy on the node level, that means, each node needs to have the same amount of storage available. By default, it will create 3 copies of the data and will continue working in read and write mode if 2 copies are still available.
That means that each node should have the same number of disks present and that will be your storage size.

Any storage that is exposed as a network share will need a decently fast network as well.

A few other considerations: Add more RAM than just for the VMs. ZFS and Ceph want RAM for their services and ZFS expecially can benefit a lot from extra RAM as it is used as read cache. For example, I personally run a small 2 node + QDevice cluster with ZFS replication between the 2 nodes.
Each node as 128GB of RAM. The guests use about half of that and that leaves plenty of RAM for the system and the ZFS cache (ARC) which varies between 30 to 45GB in size depending on the usage.

For the 3rd vote you can just install the corosync-qnetd on the PBS backup node and not a full PVE. See https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_external_vote_support
Hi Aaron! Thanks for this reply. This all sounds good. Kind of about what I had thought might be the case, more or less, based on reading I've done so far. I'm pretty sure for this deployment a 2-node and ZFS @ 1minute replication might be simpler / with 3rd vote on PBS box.

I must admit I am not sure I knew ZFS might want as much as 64gig ram. I think I had read some stuff in the past for ZFS re: 1Gig per Tb being managed but maybe that was assuming the use of SSD drives dedicated for cache / and not using ram as cache (which I think is what you are saying can be done / and is good for performance presumably compared to ssd cache?)

Client is still firming up the budget/requirement details. We may end up just going with the 'simple' model of
2 x proxmox nodes
shared storage (NFS on a qnap via 2x10gig ports trunk port set / for bandwidth and redundancy)
PBS server for 3rd vote in the HA Cluster
In the unlikely event that the shared QNAP storage blows up. We plan on doing restore-from-PBS into the remaining alive Proxmox node (ie, a manual failover intervention). And have our PBS server doing backups hourly for example. So that any 'disaster recovery restore from backup' will be 'pretty current but maybe not 100% current as good enough'.

None of the VM servers are really going to be doing heavy Disk IO. So this should be doable.

But it is great to know the 3-node ceph / or the ZFS minute-level model - are also viable.


ZFS does not need that much of RAM but if you have it, it can help to speed up read operations.

By default, ZFS will use up to 50% of RAM if free. If read operations can be served from the ARC (cache in RAM), they will be obviously much faster than accessing the actual disk and will still be quite a bit faster than an SSD cache (L2ARC). Even an L2ARC will need a bit of additional RAM to hold the index.

The measure for the ARC is the hit ratio. If you have some performance monitoring you can add this as well. On my mentioned 2 node cluster, I get during normaler operation a hit ratio in the range of close to 100%. Usually 99.xx%. For other operations, like backups or installing software / updates, it will drop considerably for a short time.

The tool arcstat will print you the current size of the ARC and how many misses it had (inverse of the hit ratio). man arcstat will give a detailed description of each column.

Besides ZFS replication or not, I would still opt to use ZFS instead of a HW RAID controller. It gives more flexibility in the features it has and if you have a hardware failure and need to recover data, you can attach the disks to another machine that can handle ZFS without being bound to a specific RAID controller model & firmware.
Thanks for added detail!

In this case the client is the sort who prefer to pay for extended warranty and 3-hour onsite service from dell. ie, they are not the kind who will ever 'attach disks to a different box that is not precisely the same as the original make-model of the original box'.

IMHO the model of 'hardware agnostic' is a very nice one, very versatile, you can basically deploy your proxmox on one standard box, and if you have a failure of the node, you can swap disks over to 'similar but not mandatory to be identical' box - or even fairly dissimilar - and things will 'just work' (ie, no hardware raid dependency).

however, at least for this project, the client is setting their preferences (ie, we pay extra for dell warranty, we expect to always run this stack on precisely the same boxes for life cycle of the project, etc). They are not interested in 'optimizing budget' by having 'more flexability' or 'not paying for vendor warranty uplift'. Not my choices / not my preferences. Just the way the project will run.

Also very good to know- the ram cache is a 'nice optional' feature in ZFS, ie, not mandatory. (I'm pretty sure they will not want to allocate 64gb ram for ZFS / if they did go the ZFS route. So using SSD dedicated cache volume will likely be sufficient for their needs which are ironically fairly modest, other than a major desire for 'best uptime they can manage with the designated budget'.)

So .. I'm still not sure which use case is going to be best for them. (ie, ZFS or Ceph or Shared Storage). But at least we have 3 choices / scenarios which are possible / to address their core requirements. So client budget allocation and general preference will be the deciding factors I am guessing.



The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!