[Pathfinder4DIY] PVE and high performance Full NVME Storage with High-Availability (Active/Active) via TCP or RDMA on commodity HW

floh8

Renowned Member
Jul 27, 2021
1,229
158
88
!!! 1.Hint: This Thread is not for Ceph fan boys. Go away! !!!!

Occasion:

In the Thread zfs-over-iscsi HA Storage solution my one (the word "one" is always related to a human being, not a matter or person or number) mentioned that zfs-over-iscsi is not the best protocol stack for NVMEs. So the question occure what would be a good NVME HA storage solution for PVE because there exist no zfs-over-nvme Plugin for PVE to use the power of NVMEs, but your one can vote for it: https://bugzilla.proxmox.com/show_bug.cgi?id=6339.
Also today the prices of NVME devices are in a very attractive price regions for SMBs to build extreme fast storage for example databases VMs. There are already various providers on the market offering full NVME appliances with High-Availability mostly active-passive but these are very expensive solutions and mostly offer not the full zfs feature spectrum. The simplest way to get a fast solution with PVE is of course to use "local zfs". Although one only loose data for small time interval if a PVE Failover happens "local zfs" is no option for bigger SMBs with a high data change rate in there data vdisks. As alternative is often mentioned ceph but ceph does'nt exhaust the full write power of NVME speed because it does'nt use the NVME protocol for inter data exchange. One can mitigate this by configuring RDMA but in comparison to local zfs or a NVME appliance ceph is quite sluggish. For some use cases this is acceptable but power user wanne the most speed out of there storage investment and prefer to miss some features as to be slower. Which alternatives are available compatible with PVE to be faster than ceph and got the most features of ceph and local zfs? A HA ZFS storage combined with shared JBOD and connected with shared LVM over NVMEoF would be an answer.


IMPORTANT INFORMATION:
This solution is based on my zfs-over-iscsi HA Storage solution. So the same Information there are also relevant for this solution. My one tested it only in a test environment and not in production. An extra hint to NMVEs and NVMEoF: just as one can optimize the iSCSI Target and Client configuration to have more throughput or lower latency there exist many dependencies with NMVE usage like CPU tuning, NVME Queue tuning, diverse other parameter aso. NVMEoF has also some issues in comparison to iSCSI your one should be aware of. These are well explained in this articel. More information your one will find under the subtitle "Limitations".


Use case for such a solution:
This solution is especially for companies which wants to have the maximum performance out of there NVME devices from a high available Storage solution with zfs and a shared JBOD shelf. Some would also accept a slower ceph but there PVE Nodes are not upgradeable to be NVME compatible.


Available functions:
  • all zfs functions like thin provisioning, compression, deduplication, slog and L2arc
  • High-Availability Storage (active/active)
  • NVME Multipathing (when using Dual-Controller Shelfs)
  • Storage snapshots
  • RDMA support
  • direct LUN access without network-filesystem- and qemu-file-overhead
  • Web-UI
  • Pool dependend loadbalancing and scale out

Limitations:
The main disadvantages of Thick LVM ontop of NVMEoF your one can read in the attached PDF about PVE storage typs comparison of this thread and in this great article from blockbridge. So the question is how to work around this limitations?:

1. Performance Degradation with chained qcow2 snapshots technology

Do not activate this feature when create the LVM thick storage in PVE.

2. No snapshot support because of above point 1

There are 2 solutions to resolve this. First, why one need snapshots when one have a backup solution like PBS implemented. Also auto snapshot with an interval of 5 min is possible with a short backup task interval. To use the backup solution has an other advantage over snapshots. At the moment PVE does'nt support the feature to individuel exclude vdisks from a VM snapshot. So for example, if one wanne only snapshot the OS vdisk because of the regular OS update installation then this is not possible.
The second solution is to use the storage snapshot feature of zfs but this means one have to create a zvol, the pacemaker config, the PVE LVM and storage configuration for every new vdisk. In my eyes this makes only sense if one use the fast NVME storage only for a handful vdisk for example for DB VMs.

3. Linux does not implement multi-host starvation controls (“Dynamic Queue Limits”) for NVMe

This is a big problem for companies with a high write load on a single big shared LUN architecture. This was also a problem for iscsi in the past. The admins solve this problem with a simple trick. To lower the write load of a single LUN one configured not 1 big LUN but many smaller LUNs. With LVM ontop of them one can span 1 big volume group across all single LUNs to combine the single LUNs to one big LVM thick storage. The same feature that VMFS offer for block storage.


HW-Requirements:
  • 2 Nodes with many PCIe v4 slots, dual socket CPU with full bandwith support to the PCIe Lanes, NVME Controller with external Ports, 2 high bandwith network ports in a bond with RDMA support, 1 Managment Port,
  • 1 Dual-Controller JBOD Shelf (my advice for production: use 3 single or dual JBOD shelfs and stripe a mirrorc3 pool over these; so one get higher HW-Availability)
NVME disk shelfs are very expensive especially the ones with dual-controller. Admins who prefer storage cluster with non shared JBOD solution is said that with choosing shared shelfs combined with a clever JBOD shelf connection design like using 2 or 3 shelfs with single-controller and build a zfs pool on it which is spread over all shelfs not only save money but also increase reliability (see graphic in HA zfs-over-iscsi project thread) .


Operation System:
The same as in the HA zfs-over-iscsi project. So have a look there.


My test environment base setup:
The same as in the HA zfs-over-iscsi project. So have a look there.


Additional used pakets:
pacemaker, corosync, pcs, network-manager, sbd, nvme-cli, zfsutils-linux
In Redhat clone distros some packages are designated differently.


Needed resource agents:
  • zfs
  • ipaddr2
  • nvmet-subsystem
  • nvmet-namespace
  • nvmet-port

Fencing solution:
The same as in the HA zfs-over-iscsi project. So have a look there.


Configuration:
The configuration steps for the cluster build and the pacemaker configuration for zfs and the cluster-ip are all the same as in the HA zfs-over-iscsi project. So have a look there.
Only the resource agent for the iscsi service and the resource group are not necessary. Here follows the additional steps for the HA NVME configuration.

---> all next steps have to be made on both nodes
  • install nvme cli and nvme tcp modul
Code:
# apt install nvme-cli
# echo "nvmet" > /etc/modules-load.d/nvmet.conf
--> next steps only of one node
  • create a temporary pacemaker configuration file and load it
Code:
# pcs cluster cib nvmet_config
# pcs -f nvmet_config resource create res_nvme-subsystem nvmet-subsystem nqn=nvme-nqn0
# pcs -f nvmet_config resource create res_nvme-namespace nvmet-namespace nqn=nvme-nqn0 namespace_id=10 backing_path="/dev/zd0" uuid=aea2016e-7ebc-4be3-ba44-789d4cb4d17c nguid=a004f44f-a0bb-46d5-bec3-20407700cca1
# pcs -f nvmet_config resource create res_nvme-port nvmet-port port_id=0 type=tcp addr_fam=ipv4 svcid=4420 addr=192.168.3.1 nqns=nvme-nqn0
# pcs -f nvmet_config resource group add grp_nvme res_zpool1 res_cluster-ip res_cluster-ip_MGMT res_nvme-subsystem res_nvmet-namespace res_nvme-port
# pcs cluster cib-push nvmet_config --config
Info: Use the uuidgen utility to generate the uuid and nguid UUIDs for the namespace.

How a cluster status could look like:

NVME-Cluster-status.png

PVE configuration:
--> on every PVE Host
  • install nvme cli with # apt install nvme-cli
  • load nvme_tcp modul at boot with # echo "nvme_tcp" > /etc/modules-load.d/nvme_tcp.conf
  • make nvme target connect persistent with # echo "-t tcp -a 192.168.3.1 -s 4420 -p" | tee -a /etc/nvme/discovery.conf
  • enable nvme auto connect service at boot with # systemctl enable nvmf-autoconnect
--> on 1 PVE Host
  • create LVM Volume Group in menu /Host/PVE/Disks/LVM
  • create Thick LVM with enabled option "shared" and "Wipe Removed Volumes" in menu /Datacenter/Storage
  • a good guide for PVE and LVM thick your own can find on blockbridges site (the part for multipathing your own can skip)

RDMA support:

--> on every PVE Host
  • install the driver and modules of your rdma network card
  • Cluster: in the pacemaker command for resource res_nvme-port one have to change the parameter type=tcp to type=rdma.
  • PVE-Host: Change parameter -t tcp to -t rdma for the persistent command.


Tested failure scenario:
  1. Cluster network card breakdown
  2. Full Node breakdown

Loadbalancing configuration:

Because our own need no namespace configuration for block devices at working time one can use this solution with load balancing. Therefor one have to create a new zfs pool and zvols and the hole NVMEoF pacemaker configuration for a 2. nvme subsystem. Then one must pin each zfs pool and subsystem to one of the both Cluster Node.

Used information sources:
The same as in the HA zfs-over-iscsi project. So have a look there. Additional the following:

Extended projects:
  • My own do'nt find a configuration to use the native multipathing feature of the linux kernel modul NVMEoTCP/RDMA stack but some internet posts show that this is possible with actual stable linux kernels. Why access protocol multipathing is always better than LAG load balancing (OSI-Layer 2) your own can read in diverse internet article about iscsi multipathing.
Everyone can also send a direct message (no PMs, floh8 is no person or matter and floh8 is not located on a ship) to floh8 if there are requests to this solution.
 
Last edited:
  • Like
Reactions: UdoB