!! Voting for feature request for zfs-over-iscsi Storage !!

floh8

Renowned Member
Jul 27, 2021
1,229
158
88
Only a small group of pve admins recognised that zfs-over-iscsi beat all the other pve storage connection possibilities in functionality and flexibility. See the attachment to this post - a detailed storage comparison. The only disadvantages of zfs-over-iscsi at the moment are missing RDMA and multipathing support. The reason for that is at that time the PVE developer decided to use the integrated iscsi implementation of qemu instead open-iscsi to had lower programming efforts.

To change this, vote for the feature request only with a short comment like "+1": https://bugzilla.proxmox.com/show_bug.cgi?id=6638
You will also find additional information there.

Also disappointing is that only 1 vendor product RSF-1 could offer zfs-over-iscsi HA with RDMA support at the moment.
If your one think proxmox should change this than also vote for: https://bugzilla.proxmox.com/show_bug.cgi?id=6250

Thanks a lot. All together our ones can make PVE to the most flexible and fastest virtualization environment in the market.
 

Attachments

Last edited:
This is not a new discussion, I'm not convinced that the reasons against such as feature have changed much since then: https://forum.proxmox.com/threads/shared-remote-zfs-storage.159929/
The link u post is according truenas and zfs-over-iscsi, my one think, but this thread is for rdma and multipathing support for the LIO implementation of zfs-over-iscsi. The decission of proxmox not support individual storage vendors is right in my eyes. Truenas can develop a functional plugin for their own like blockbridge did it if they really are interested in PVE.
 
Last edited:
ZFS over iSCSI already works… what are you talking about.

- Share drives over iSCSI to VM
- Create ZFS pool for boot/data drives in guest

Note that most iSCSI appliances already provide RAID of some sort, making the need to put ZFS on top of it kind of unnecessary, you can typically pass iSCSI virtual drives directly to VM and use a ‘lighter’ disk management system like LVM.
 
Last edited:
  • Like
Reactions: Johannes S
what are you talking about.
Have you read what OP wrote?
The only disadvantages of zfs-over-iscsi at the moment are missing RDMA and multipathing support.
This is sadly true.
The multipath support, however, should be brought up in the upstream QEMU project.

Running everything by hand is a lot of work. I implemented a POC years ago with ZFS-over-FC and tried to handle everything myself, which is a pain, yet it worked.
 
You mean iSCSI-exporting-(insert ZFS implementation here), not ZFS-over-iSCSI iSCSI doesn’t have any filesystem integration, it is just SCSI over Ethernet, so the question/description does not make sense. I know some YouTuber probably calls it that, it is inaccurate and needs to be banished.

If you want Proxmox devs to make a ZFS storage appliance that integrates, they are entering a market that already has a number of solutions and they already have a superior solution in Ceph.

You’re looking for a storage plugin for <insert storage appliance here>, that is for the storage appliance maker or some enthusiast to do. Many appliance makers now have Proxmox plugins, some official support, some experimental/at your own peril.
 
  • Like
Reactions: Johannes S
Only a small group of pve admins recognised that zfs-over-iscsi beat all the other pve storage connection possibilities in functionality and flexibility.
that depends on how you dice the data. If a "pve admin" is just the infrastructure admin, storage is provided by the storage team. if its a home user, not sure that what they recognize is of particular importance.

The only disadvantages of zfs-over-iscsi at the moment are missing RDMA and multipathing support.
Not from my viewpoint- these are "nice to haves". what make zfs-over-iscsi a poor choice is the complete lack of storage PRODUCTS that support it, making the prospect of deploying meaning you need to roll your own storage- with no provision for HA, BC, or maintainable support. Its not DIFFICULT, it just expands the scope of administration greatly. With regards to RDMA... have you actually benchmarked both ways, and are convinced it would solve some problem- or are you just chasing hero numbers? As for multipathing- yeah that's a weakness in the implementation. I'd imagine it would be fixable.

that only 1 vendor product RSF-1 could offer zfs-over-iscsi with RDMA support at the moment.
This is the first time I hear of them. would love to get real world reviews.
 
  • Like
Reactions: Johannes S
The problem with iSCSI is that it’s a SPOF. Not sure what PVE admins you are talking about, but Ceph is far superior in flexibility and reliability and scale-out performance. RDMA isn’t really necessary for anything going <100Gbps these days but iSER is generally not available in ‘el cheapo’ iSCSI NAS (you generally also need an expensive fabric)
 
  • Like
Reactions: Johannes S
rdma should be easy to implement in qemu initiator, it's just iser:// instead isci:// in the url.


direct lun access through the kernel is not easy, because proxmox is distributed, and currently, if you remove a lun or resize it, or replace it (delete/add), the other nodes don't known about it. (So, maybe rescan/update of specific lun it should be done at vm start, but it's not implemented currently).

also, it need to scale with you have 1000~2000 luns exposed in the kernel.

Not sure if we have any benefit vs bonding for zfs over iscsi, until we have multi-master zfs node. (if it's master-slave nodes, a simple vip failover + bond should be enough)

nvme over tcp could be interesting in the future, as they are native multipath, but qemu don't have native implementation currently.
 
Last edited:
  • Like
Reactions: Johannes S
The problem with iSCSI is that it’s a SPOF. Not sure what PVE admins you are talking about, but Ceph is far superior in flexibility and reliability and scale-out performance. RDMA isn’t really necessary for anything going <100Gbps these days but iSER is generally not available in ‘el cheapo’ iSCSI NAS (you generally also need an expensive fabric)
One dont comprehend what you guys talking about. RSF-1 offer a HA zfs-over-iscsi solution with load balancing. See my storage comparison attached pdf. Ceph latency is much higher than direct iscsi what zfs-over-iscsi does. RSF-1 HA zfs-over-iscsi solution can one install and configure on commodity hardware in less than a 30 mins. RSF-1 itself is extremely cheap. In the pdf you guys see that zfs-over-iscsi beat ceph also in functionality.
The entry post say that already.
 
rdma should be easy to implement in qemu initiator, it's just iser:// instead isci:// in the url.


direct lun access through the kernel is not easy, because proxmox is distributed, and currently, if you remove a lun or resize it, or replace it (delete/add), the other nodes don't known about it. (So, maybe rescan/update of specific lun it should be done at vm start, but it's not implemented currently).

also, it need to scale with you have 1000~2000 luns exposed in the kernel.

Not sure if we have any benefit vs bonding for zfs over iscsi, until we have multi-master zfs node. (if it's master-slave nodes, a simple vip failover + bond should be enough)

nvme over tcp could be interesting in the future, as they are native multipath, but qemu don't have native implementation currently.
yea, indeed qemu uses libiscsi and support therefor iser. According qemu docu one have to add the iscsi option "transport=iser" to the iscsi qemu command. The disappointing result is that one can't find any example or experience or feedback for this configuration in the internet. Therefor my one think its not well supported. There are certainly more than one way to implement this and each has its advantage and disadvantage. libiscsi is fully implemented in userspace and dont offer multipathing therefor its slower than open- iscsi but maybe has other advantages. Maybe one of the proxmox community could test such a iser config with qemu/libiscsi and give feedback. Multipathing could one replace with LAG in case of need.
 
The RSF-1 docs already specify how to setup Proxmox.

Also RSF-1 is not “really” HA, it makes a backup every so often to a second node, if one node fails, you may potentially be missing whatever your snapshot delta is. That may be fine for some data storage like archives, but rare to be acceptable for production VM backend.

Latency depends on your network and hardware, if you had synced writes to 3 independent nodes, not 15m replication windows, then you’d also have high latency, you basically need to have 3 independent iSCSI targets in RAID1 to make a comparison.
 
  • Like
Reactions: Johannes S
yea, indeed qemu uses libiscsi and support therefor iser. According qemu docu one have to add the iscsi option "transport=iser" to the iscsi qemu command. The disappointing result is that one can't find any example or experience or feedback for this configuration in the internet. Therefor my one think its not well supported. There are certainly more than one way to implement this and each has its advantage and disadvantage. libiscsi is fully implemented in userspace and dont offer multipathing therefor its slower than open- iscsi but maybe has other advantages. Maybe one of the proxmox community could test such a iser config with qemu/libiscsi and give feedback. Multipathing could one replace with LAG in case of need.
https://patchew.org/Libvirt/2020051...at.com/20200515034606.5810-4-hhan@redhat.com/
https://github.com/qemu/qemu/commit/e0ae49871ae697b5d1a8853e79cbee35fda2145b

if you want to test:

Code:
diff --git a/src/PVE/Storage/ZFSPlugin.pm b/src/PVE/Storage/ZFSPlugin.pm
index 99d8c8f..9041d61 100644
--- a/src/PVE/Storage/ZFSPlugin.pm
+++ b/src/PVE/Storage/ZFSPlugin.pm
@@ -291,7 +291,7 @@ sub path {
     my $guid = $class->zfs_get_lu_name($scfg, $name);
     my $lun = $class->zfs_get_lun_number($scfg, $guid);
 
-    my $path = "iscsi://$portal/$target/$lun";
+    my $path = "iser://$portal/$target/$lun";
 
     return ($path, $vmid, $vtype);
 }
@@ -308,7 +308,7 @@ sub qemu_blockdev_options {
 
     return {
         driver => 'iscsi',
-        transport => 'tcp',
+        transport => 'iser',
         portal => "$scfg->{portal}",
         target => "$scfg->{target}",
         lun => int($lun),
 
in the pdf you guys see that zfs-over-iscsi beat ceph also in functionality.
I can speak about my solaris 2 nodes zfs that died 15year ago, with 3 days of downtime because of zil crash on 2 disk at the same time if you want ;)
Never had a single downtime with ceph since 2015.

what about rollback on 2 previous snapshot on zfs and create a new snapshot branches?
what about cloning a specific zfs snapshot ?
what about horizotal scalability when your storage is full ?
what about when your zfs fs is corrupted ? (do you have realtime replication to another zfs fs, with transparent failover)
what about migrating your full zfs array to newers disk on another host ?

sure you can't reach same write latency currently (but persistent local cache is possible with rbd),
but don't compare pear vs apple
 
I can speak about my solaris 2 nodes zfs that died 15year ago, with 3 days of downtime because of zil crash on 2 disk at the same time if you want
In production with a shard jbod its better to go with mirror3c.
what about rollback on 2 previous snapshot on zfs and create a new snapshot branches?
Explain!
what about horizotal scalability when your storage is full ?
See pdf. Scale up and out with shared jbod
what about when your zfs fs is corrupted ?
Zfs has corruption repairing. For produ tion is better to have mirror3c.
failover)
what about migrating your full zfs array to newers disk on another host ?
Yes, of course u can do that
 
Last edited:
The RSF-1 docs already specify how to setup Proxmox.

Also RSF-1 is not “really” HA, it makes a backup every so often to a second node, if one node fails, you may potentially be missing whatever your snapshot delta is. That may be fine for some data storage like archives, but rare to be acceptable for production VM backend.

Latency depends on your network and hardware, if you had synced writes to 3 independent nodes, not 15m replication windows, then you’d also have high latency, you basically need to have 3 independent iSCSI targets in RAID1 to make a comparison.
Read the pdf or the rsf-1 docu and write no bullshit. "Shared jbod" is mentioned. Everywhere these trolls.
 
sure you can't reach same write latency currently (but persistent local cache is possible with rbd),
but don't compare pear vs apple
thx for the hint. persistent write cache is a good hint. But u shouldnt forget that ceph is a distributed filesystem so there is always a additional latency. And there are limitation in use case. There are also rare information to integrate this. Here one can read a good summery:https://static.opendev.org/docs/ope...2023.1/config-persistent-write-log-cache.html

The PDF will be updated for ceph.
 
Last edited:
So there was a while gone if my one comprehend that rbd persistent write and read cache is only client side. So this is not the same like with zfs-over-iscsi and slog. If the ceph client pve host crashs the local data are possibly not be written to the ceph cluster. When the vm reboots on a second pve node the vm would work with older data.
This could be healed if every pve node is connected to a shared jbod with raid function or similar with cluster file system on it like gfs2 for rbd write cache. This produces of course additional latency but it speeds up a hdd pool for instance.
 
Last edited:
After recognising the ceph cache- tiering feature the pdf will be updated as well.
 
But u shouldnt forget that ceph is a distributed filesystem so there is always a additional latency.
Technically, only cephfs is the filesystem, but I get what you want to point out. I haven't found the information on the RSF-1 webpage about to ensure ALL writes to be consistent on the JBOD drives. Synchronous writes work via multiple JBOD slog devices, but asynchronous ones are not immediately written down to the disk, so acknowledged asynchronous writes can be lost if you unplug the power to the machine. This is the main problem with any ZFS HA implementation I've ever seen and I'm curious on how they solved this if at all. ZFS has its limitations, because it's a local filesystem and it was never designed for any clustered setup. RSF-1 seem to use a similar system to this one on GitHub. You will also have higher latencies in a failover case as with any other storage solution I've ever seen, e.g. in enterprise SANs, in which you share also the cache itself between controllers. In the failover case, you need to load a crash consistent pool and walk through the changes in the slog devices. Have you tested this thoroughly?

I read in your PDF that you value ZFS-over-iSCSI faster than Ceph (fast as in performance or slower as in IOPS), which may be true with SAS/SATA, but not true with NVMe. Setting preferred reading on local disks on ceph, you will outperfom the network, which is the bottleneck in any setup, especially with iSCSI and not having multiple 400G links. You'll saturate 25Gb with ONE NVMe OSD per node. Are there multi-chassis/multi-port NVMe available yet? If not, NVMe ceph will always outperform any SAS/SATA storage via iSCSI easily ... I know apples and oranges, but NVMe CEPH is the norm nowadays.

You can also have 2/3 copies on ceph, if you use erasure coded pools.

It would also be nice if you would compare it to storage vendors prividing all of the features (and more) like the Blockbridge Storage for PVE.

The point filesystem in the PDF is also missleading, because zvol, ZFS and CEPH are all no filesystems, just ZFS datasets and XFS are filesystems. All other are block storages, which is even better for virtualization, because you have less software layers between the actual VM guest VM.
 
  • Like
Reactions: Johannes S