[Pathfinder4DIY] PVE and high performance ZFS-over-ISCSI Storage with High-Availability (Active/Passiv) via TCP or RDMA on commodity HW

floh8

Renowned Member
Jul 27, 2021
1,229
158
88
!!! 1.Hint: This Thread is not for Ceph fan boys. Go away! !!!!

Occasion:
In the Thread "Vote for Feature in ZFS-over-ISCSI" my one (the word "one" is always related to a human being, not a matter or person or number) had to put question marks in the row "
own build with free software possible" of the added PDF in the beginning because my one didnt know ad hoc how to make this possible in a simple way. The problem is that one cant define once a iscsi device+lun for the cluster and it works for ever because the iscsi block devices and iscsi luns will create during operation, every time one create a vdisk with zfs-over-iscsi on the PVE. RSF-1 support all this out of the box but RSF-1 is based on its own cluster stack. Therefor one is extremely limited in tuning the storage solution for own use case. And of course one payed license costs to run it. Also, for a DIY way of this solution there are no Howtos or Manuals in the internet. So the challenge was accepted.

IMPORTANT INFORMATION:
This is only a base manual to realize such a solution, but not ready for production. Its not tested in production environment. For production one have to use a different stonith device, watchdog and more resource agents and of course have to tune the pacemaker/stonith/watchdog timeouts for your environment. This solution is also not addressed to light weight admins. One need a high learning curve to comprehend these different technologies like pacemaker and stonith. My one started this challenge with zero knowledge in corosync, pacemaker and targetcli. So my one need some days to get this simple and small test environment to work error free. Once you have found the right settings, building such a solution for a small pool takes just a few minutes. But one need deeper know-how to repair such a cluster in strange failure situations.


Use case for such a solution:
This solution is especially for companies which wants to go with a high available, full-featured, cheap and extreme fast storage solution used with PVE and zfs-over-iscsi. Of course, one can use this solution without RDMA support.


Available functions:
  • all zfs functions like thin provisioning, compression, deduplication, slog and L2arc
  • High-Availability Storage (active/passive)
  • SAS Multipathing support
  • slog and L2arc with SDDs or NVMEs
  • full flash SSDs and NVMEs pools
  • PVE VM snapshots
  • RDMA support possible
  • direct LUN access without network-filesystem- and qemu-file-overhead
  • separat LUN for every vdisk (no concurrent iscsi access)
  • individuell volblocksize for every vdisk
  • Web-UI
  • expandable to a active-active storage cluster when combined with NFS, SMB or NVMEoF
Some hints for full flash NVME storage solution:

zfs-over-iscsi uses, like the designations already says, iscsi as access protocol. That means for network connection the iscsi protocol is used and the cluster node have to convert iscsi commands to nvme commands because NVMEs only speaks nvme protocol (similar to ceph: it does'nt use nvme protocol for inter cluster node connection). This results in higher latency and lover bandwith. Unfortunately, there exist no zfs-over-nvme Plugin for PVE to use the power of NVMEs, but your one can vote for it: https://bugzilla.proxmox.com/show_bug.cgi?id=6339.
If one really wants set free the real power of NVMEs with PVE one can use on the one hand "local zfs" or on the other hand this solution here in combination with shared LVM thick ontop of NVMEoF. Both possibilities have its own disadvantages in comparisen to zfs-over-iscsi. Your one can find a summary of these in the attached PDF file of this thread. How to build such a full speed NVME HA storage for PVE your own can read here in my last thread to this serie.

Missing functions:
My one did'nt find a simple solution to have pool load-balancing and therefor scale-out how RSF-1 can. For this, one have to change the pacemaker config at working time. The best way to make this happen is to edit the zfs-over-iscsi Plugin of PVE and add the according pacemaker and targetcli commands. This is hard scripting work and my one is not a good bash scriptwriter nor a developer. This would need to much time. Therefor it exists a bugzilla enhancement for a Proxmox Storage Server product especially for zfs-over-iscsi and a lot more functions. Would be nice if your one would vote for that. If one configure a second zfs pool with an other access protocol than ISCSI like NFS or NVMEoF than of course one can have a load-balanced active-active HA storage cluster with this solution. One can combine these different access protocols in one singel zfs storage cluster (active/active) but one need different pools for that.


Key configuration of this solution:
The iscsi-target config is available for the active node over a shared storage.


HW-Requirements:
  • 2 Nodes with 2 high bandwidth network ports in a bond optional with RDMA, 1 Managment Port, HBA SAS Controller with external ports
  • 1 Dual-Controller JBOD Shelf (my advice for production: use 3 JBOD Shelfs and stripe a mirrorc3 Pool over these; so one get higher HW-availability)
Admins who prefer storage cluster with non shared JBOD solution is said that with choosing shared shelfs combined with a clever JBOD shelf connection design like using 2 or 3 shelfs with single-controller and build a zfs pool on it which is spread over all shelfs not only save money but also increase reliability.

SAN Design could look like this:

JBOD-Design.png



Operation System:
My test environment was build on Debian but for production my advice is to use a Redhat clone distro like Rocky. For storage it's important to use the stablest Linux distro with corosync, pacemaker and RDMA support that your one known of. If your one like Freebsd: This is not a good decision, because the corosync and pacemaker packets have no maintainer at the moment and the PVE zfs-over-iscsi Plugin supports only the user space iscsi target of freebsd. So therefor your one get not the best performance out of the iscsi connections.


My test environment base setup:
Node1:
  • Hostname: d1
  • Storage Net IP: 192.168.3.2
  • MGMT Net IP: 192.168.56.101
  • Cluster Net IP: 192.168.2.2
Node2:
  • Hostname: d2
  • Storage Net IP: 192.168.3.3
  • MGMT Net IP: 192.168.56.103
  • Cluster Net IP: 192.168.2.3
Cluster IPs:
  • Storage cluster IP: 192.168.3.1
  • MGMT cluster IP: 192.168.56.11

Additional used packets:
targetcli-fb, pacemaker, corosync, pcs, zfsutils-linux, network-manager, sbd
In Redhat clone distros some packages are designated different like "targetcli".


Needed resource agents:
  • zfs
  • ipaddr2
  • systemd

Fencing solution:
In my test environment is used softdog+sbd, but for production environments it's better to use HW-Watchdog and a fencing agent like "fence_iscsi".


Configuration:
  • create your zfs-pool
--> following on both nodes:
  • add both nodes to /etc/hosts
  • disable rtslib-fb-targetctl for system boot
  • create symlink for /etc/rtslib-fb-target to zfs pool like: # ln -s /zpool1/iscsi-target /etc/rtslib-fb-target
  • create corosync passwort: # passwd hacluster
  • configure sbd and activate kernelmodul softdog in /etc/modules-load.d/
--> following on one cluster node:
  • create iscsi target with initiator-acls, Storage Cluster-IP as Portal-IP and enable RDMA if used
  • create base corosync cluster
# pcs host auth d1 d2
# pcs cluster destroy (in Debian one must first delete the default cluster-config)
# pcs cluster setup zfs-cluster d1 addr=192.168.3.2 addr=192.168.2.2 d2 addr=192.168.3.3 addr=192.168.2.3
# pcs cluster start --all
  • pacemaker config with
# pcs property set no-quorum-policy=ignore
# pcs property set stonith-enabled=true
# pcs resource defaults update resource-stickiness=200
# pcs resource create res_zpool1 ZFS pool="zpool1" importargs="-d /dev/" op start timeout="90" op stop timeout="90"
# pcs resource create res_cluster-ip IPaddr2 ip=192.168.3.1 cidr_netmask=24 iflabel=cluster
# pcs resource create res_cluster-ip_MGMT IPaddr2 ip=192.168.56.11 cidr_netmask=24 iflabel=cluster
# pcs resource create res_iscsi_target_load systemd:rtslib-fb-targetctl
# pcs resource group add grp_zfs-cluster res_zpool1 res_cluster-ip res_cluster-ip_MGMT res_iscsi_target_load (for ordering and colocation)
# pcs stonith create stonith_d1 fence_sbd devices=/dev/sdd pcmk_host_list=d1
# pcs stonith create stonith_d2 fence_sbd devices=/dev/sdd pcmk_host_list=d2
# pcs constraint location stonith_d1 avoids d1
# pcs constraint location stonith_d2 avoids d2
PVE configuration:
Your one use the zfs-over-iscsi Storage Plugin and use the LIO provider. The other fields should be filled in as usual (For IP-Adresse -> the Storage-Cluster IP, etc.) -> See point sources

Tuning:
  • For better failure testing a cluster failure my one changed the resource defaults values with # pcs resource defaults update migration-threshold=2 failure-timeout=10s
  • For fast failover in case of network incident one can set # pcs resource update res_cluster-ip op monitor interval=2s timeout=5s
RDMA support:
PVE-Host: Like in this post mentioned one can edit the zfs-over-iscsi-Plugin to use RDMA.
Storage-Node: Set enable_iser true in the portal ip folder of your targetcli configuration.

Web-UI:
The pcs daemon is build with a Web-UI for corosync+pacemaker for configuration and monitoring. Also a cockpit plugin exists. For debian one have to build it for your own from the pcs source, but the Redhat clone distros are already packed with it. For cockpit exists also ZFS plugins like them from 45drives (Rocky 8+9). From napp-it comes a ZFS GUI for Debian.

Errors and solutions:
Some pacemaker errors light up, therefor the tuning tips. Like upwards mentioned there exist a default corosync configuration in Debian that has to be deleted.

Tested failure scenario:
  1. Cluster network card breakdown
  2. Full Node breakdown
Unfortunately a JBOD failure on one Node was planned, but fsfreeze does'nt function with zfs. Please give some hints to achieve this!

Used information sources:
Extended projects:
  • Can one have a high available PBS solution based on this storage solution? -> yes, look here
  • Can one use it as samba share NAS with Active-Directory connection for example a big media division.

Everyone can also send a direct message (no PMs, floh8 is no person or matter and floh8 is not located on a ship) to floh8 if there are requests to this solution.

AddOn:

Comparison between this DiY storage solution and RSF-1

DiY HA zfs-over-iscsiRSF-1 HA zfs-over-iscsi
Base OSRedhat, Rocky, Alma Linux, Oracle Linux, DebianRocky, Alma Linux, Debian, Solaris, OmniOS, FreeBSD
High availibilityyes
(active / passive)
yes
(active / active)
Scale up / outyes / noyes / yes
Load balancingnoyes
(pool dependent)
RDMA supportyesyes
SAS Multipathingyesyes
iSCSI Multipathingnono
Usage as SMB/NFS NAS with Active-Directorypossibleyes
Commodity HWyesyes
GUI supportpossibleyes
Customizable (timeouts, intervals)yes, highno
Professional Supportwith Redhat licenseyes
Admin know-how levelhighmiddle
Cluster SW build timesome days or weeks
(depend on the know-how)
max. 2 h
pricefree1. year: 599 $
(lifetime license, unlimited storage space)
 
Last edited:
added Web-UI links and feature comparison between this DiY solution and RSF-1
 
add link to the full speed NVME HA storage solution