Considering building a proxmox 3 node cluster with ZFS replication

offerlam

Renowned Member
Dec 30, 2012
223
0
81
Denmark
Hi all,

We are considering having a setup where we have nodeA with the production data and node B and C where production is replicated to every one or two minuts.

The idea is to have VMs from nodeA auto failover to nodeB and if nodeB fails auto failover to nodeC.

My question is about ZFS. I would need this to do this. But i have never played around with it before.

It looks from our setup that we need to go with NVME drives attached to PCI express cards to get the storage we need. Is this a problem? We intent to buy enterprise since i see lots about not using consumer. This was with SSD but i guess it goes for NVME aswell.

Also what raid level would you recommend. I see that you can do Zraid with parati 3 even.

Basicly im asking about do's and don't here.

Any advice would be greatly appreciated.
 
Hi,
my five cents.

ZFS isn’t a cluster filesystem → it’s single-host by design. It can’t natively do “live shared storage” across multiple nodes (like Ceph or GlusterFS).
For replication/failover setups like you describe (A → B → C every 1–2 min), you’d use:
zfs send / zfs receive (native replication, often wrapped by tools like syncoid or zrep).
Or something like Pacemaker/Corosync + ZFS + replication tool to manage failover.
Failover logic (moving VMs to another node) is not built into ZFS. You’d need clustering software (Proxmox HA, Pacemaker, or a hypervisor with HA logic).

ZFS works with vdevs → you build pools out of groups of disks.
RAIDZ (single, double, triple parity):

RAIDZ1 (like RAID5) is not recommended anymore for large drives.
RAIDZ2 (like RAID6) is the minimum I’d suggest for production.
RAIDZ3 (triple parity) is good if you have many disks per vdev or very large capacity drives.

ZFS mirrors (RAID10 equivalent) give the best IOPS for VMs.
For VM workloads, mirrors are generally better than RAIDZ because random I/O is much faster.
So: if this is a VM backend, mirrored vdevs are the go-to. RAIDZ is more suited for archive/backup/cold storage.

With replication every 1–2 min, you’re not doing synchronous HA, but you’re getting very fast RPO (low data loss window).
Tools like syncoid can automate near-continuous replication.
Failover will require orchestrating the VMs + imports of ZFS datasets on the target node.
For automatic failover, something like Proxmox VE + ZFS replication + HA could fit your needs out of the box.
 
  • Like
Reactions: UdoB
zfs send / zfs receive (native replication, often wrapped by tools like syncoid or zrep).
If the respective admin knows these tools (and their possible pitfalls) they will work fine. Both tools are well established and tested.

But for a new user of the PVE eco system (with regard to replication) I do highly recommend to stay away from third party apps first. Use the mechanism PVE offers - replication is build in :-)

Only if you find the built-in capabilities to be not sufficient search for an alternative/add-on solution. And be aware you may be leaving the officially supported grounds quickly.

Just my personal 2 €¢...