Considering building a proxmox 3 node cluster with ZFS replication

offerlam

Renowned Member
Dec 30, 2012
224
2
83
Denmark
Hi all,

We are considering having a setup where we have nodeA with the production data and node B and C where production is replicated to every one or two minuts.

The idea is to have VMs from nodeA auto failover to nodeB and if nodeB fails auto failover to nodeC.

My question is about ZFS. I would need this to do this. But i have never played around with it before.

It looks from our setup that we need to go with NVME drives attached to PCI express cards to get the storage we need. Is this a problem? We intent to buy enterprise since i see lots about not using consumer. This was with SSD but i guess it goes for NVME aswell.

Also what raid level would you recommend. I see that you can do Zraid with parati 3 even.

Basicly im asking about do's and don't here.

Any advice would be greatly appreciated.
 
Hi,
my five cents.

ZFS isn’t a cluster filesystem → it’s single-host by design. It can’t natively do “live shared storage” across multiple nodes (like Ceph or GlusterFS).
For replication/failover setups like you describe (A → B → C every 1–2 min), you’d use:
zfs send / zfs receive (native replication, often wrapped by tools like syncoid or zrep).
Or something like Pacemaker/Corosync + ZFS + replication tool to manage failover.
Failover logic (moving VMs to another node) is not built into ZFS. You’d need clustering software (Proxmox HA, Pacemaker, or a hypervisor with HA logic).

ZFS works with vdevs → you build pools out of groups of disks.
RAIDZ (single, double, triple parity):

RAIDZ1 (like RAID5) is not recommended anymore for large drives.
RAIDZ2 (like RAID6) is the minimum I’d suggest for production.
RAIDZ3 (triple parity) is good if you have many disks per vdev or very large capacity drives.

ZFS mirrors (RAID10 equivalent) give the best IOPS for VMs.
For VM workloads, mirrors are generally better than RAIDZ because random I/O is much faster.
So: if this is a VM backend, mirrored vdevs are the go-to. RAIDZ is more suited for archive/backup/cold storage.

With replication every 1–2 min, you’re not doing synchronous HA, but you’re getting very fast RPO (low data loss window).
Tools like syncoid can automate near-continuous replication.
Failover will require orchestrating the VMs + imports of ZFS datasets on the target node.
For automatic failover, something like Proxmox VE + ZFS replication + HA could fit your needs out of the box.
 
  • Like
Reactions: UdoB
zfs send / zfs receive (native replication, often wrapped by tools like syncoid or zrep).
If the respective admin knows these tools (and their possible pitfalls) they will work fine. Both tools are well established and tested.

But for a new user of the PVE eco system (with regard to replication) I do highly recommend to stay away from third party apps first. Use the mechanism PVE offers - replication is build in :-)

Only if you find the built-in capabilities to be not sufficient search for an alternative/add-on solution. And be aware you may be leaving the officially supported grounds quickly.

Just my personal 2 €¢...
 
@UdoB : I agree with you. I just noted the important points. He has a choice to decide how he should act.When you want to use free software, with the idea of it being semi-professional or professional, it is normal to take risks, especially if you are not good at administration. We have given the guidelines after he stepped on it, and we can help, we will do it as much as we can.
 
  • Like
Reactions: UdoB
Hi guys,

Thanks so much for your inputs!
Dont hold back :)

It was our intention to run with the built in solution with proxmox. Not adding more outside feature.

The idea being KIS. Keep it simple. We dont want the complexcity of storage clusters and we can live with one to two minut potential data loss.

We do need the failover to be automatic which was why we would want to introduce HA. And therefore a 3 node setup.

I do feel though that the HA in proxmox is very stable and proven. So i dont see the same complexcity here. Am i wrong?

What about disc in this kind of setup.
I read for ssd only go enterprise due to zfs doing more writes then other file systems and you risk to kill the disc sooner then expected. Also i read somewhere about reduced iops if you use consumer.

We may have to use nvme. Are the same true for these like ssd? I was thinking of mounting 4 nvme on pci 3x 16 port and do raid10. Is this completely bongers?
 
Last edited:
  • Like
Reactions: Johannes S and UdoB
I read for ssd only go enterprise due to zfs doing more writes then other file systems and you risk to kill the disc sooner then expected. Also i read somewhere about reduced iops if you use consumer.
Yes. And yes. It has been discussed so many times... I won't repeat the arguments.

We may have to use nvme. Are the same true for these like ssd?
Yes.

I must admin that in my Homelab all of my NVMe lack PLP (while nearly all SSDs have it). For my $Dayjob cluster I would never do so.

Yes, NVMe has a much higher IOPS (and bandwidth) than SATA, but the stress factor is the same. Currently I have a dying ("Percentage Used: 101%", but still working...) 2 TB mirror with consumer NVMe after only 10 months in use. While there are several unproblematic VMs running on it only one VM was writing continuously monitoring data. This won't happen with "good" devices.

I was thinking of mounting 4 nvme on pci 3x 16 port and do raid10. Is this completely bongers?
Sorry, with no actual experience I can not answer that.
 
Last edited: