[SOLVED] HA with ZFS

kenneth_vkd · Jun 3, 2021

Hi
We are looking into hardening our infrastructure to better handle outages due to network or hardware failtures.
We use OVH dedicated physical servers for our infrastructure and currently have 3 node PVE cluster. This cluster is currently configured with 2x4TB nVME drives per node, with one of the nodes restricted to primarily Windows VMs and therefore it has only 1 CPU to bring down the licensing costs due to a low volume of clients requesting services that have to run on Windows.
All servers have access to the OVH vRack system.

Currently all servers have the disks configured as a ZFS mirror, but we are looking into configuration of HA in Proxmox. We can however see that HA works best (only?) with shared storage (network share) or distributed storage (Ceph).
As our monitoring of storage health is currently based on the output from emails generated zfs-zed, we would prefer not to redo our tooling and was therefore thinking of alternatives to handling this.

We therefore have the following questions, which we hope that someone can help us find the answers for:
- Can you implement HA by enabling replication of the required VMs between nodes in the same HA group and still benefit from the online/live migration or do we need to implement something like Ceph?
- If we have to implement Ceph, can it then be done on top of the ZFS pools for us to keep current storage monitoring tools or would we need to bring up Ceph directly on the bare disks and implement new tooling for storage health monitoring?
- If necessary to bring up Ceph to get wokring HA, can we bring up new nodes in our existing cluster and only deploy Ceph across the new nodes and then migrate VMs from current non-HA setup in same cluster?

fiona · Jun 7, 2021

Hi,

kenneth_vkd said:
We therefore have the following questions, which we hope that someone can help us find the answers for:
- Can you implement HA by enabling replication of the required VMs between nodes in the same HA group and still benefit from the online/live migration or do we need to implement something like Ceph?

Shared storage is better, but HA (and online migration with replicated disks) also works with replicated ZFS nowadays (qemu-server >= 6.1-9, pve-ha-manager >= 3.1-1). In case of a node failure, the data since the last replication will be lost, so it's best to choose a tight enough replication schedule.

kenneth_vkd said:
- If we have to implement Ceph, can it then be done on top of the ZFS pools for us to keep current storage monitoring tools or would we need to bring up Ceph directly on the bare disks and implement new tooling for storage health monitoring?

No, Ceph needs control over its disks.

kenneth_vkd said:
- If necessary to bring up Ceph to get wokring HA, can we bring up new nodes in our existing cluster and only deploy Ceph across the new nodes and then migrate VMs from current non-HA setup in same cluster?

Yes, you can configure a HA groups to ensure that VMs are only migrated to where the shared storage is actually available.

kenneth_vkd · Jun 7, 2021

Thank you for the reply
Are there any recommended tools that can help monitor disk health when Ceph has control over the disks?
With zfs-zed, our system-administrators get a notification when a disk failure is detected

fiona · Jun 8, 2021

See here and here. I'm not sure there's anything for email notifications out of the box, but a simple script checking the cluster health might do the job.

kenneth_vkd · Jun 10, 2021

Thank you for the replies
We will be looking in to how we can use Ceph. Some initial testing shows that it gives the necessary HA, so we just need to figure out the monitoring of disks

Search

Search

[SOLVED] HA with ZFS

kenneth_vkd

Well-Known Member

fiona

Proxmox Staff Member

kenneth_vkd

Well-Known Member

fiona

Proxmox Staff Member

kenneth_vkd

Well-Known Member

We value your privacy