!! Voting for feature request for zfs-over-iscsi Storage !!

I haven't found the information on the RSF-1 webpage about to ensure ALL writes to be consistent on the JBOD drives. Synchronous writes work via multiple JBOD slog devices, but asynchronous ones are not immediately written down to the disk, so acknowledged asynchronous writes can be lost if you unplug the power to the machine. This is the main problem with any ZFS HA implementation I've ever seen and I'm curious on how they solved this if at all.
async writes is a function of the filesystem and not of RSF-1. RSF-1 brings the functions of load balancing and HA to zfs. zfs make the filesystem stuff. But you didnt comprehend the meaning of async writes from my prospective. async write are never persistent for all file system also not for ceph. Thats why zfs offers slog for sync writes. And of course slog is HA compatible.
You will also have higher latencies in a failover case as with any other storage solution I've ever seen, e.g. in enterprise SANs, in which you share also the cache itself between controllers.
The question of fail over time was also in my interests. RSF-1 need max. 16 sec is the statement from RSF-1. My tests with nothing shared RSF-1 Cluster show 10 sec, but without load. Therefor RSF-1 is slower than a hardware vendor, but the normal file sytem time out stay at 30 sec. So this is in limit.
I read in your PDF that you value ZFS-over-iSCSI faster than Ceph (fast as in performance or slower as in IOPS), which may be true with SAS/SATA, but not true with NVMe. Setting preferred reading on local disks on ceph, you will outperfom the network, which is the bottleneck in any setup, especially with iSCSI and not having multiple 400G links.

With ceph NVME and local read tuning you can be faster than network of course, but you need expensive NVME for the hole storage. But write performance is always slower because of the ceph design. Each write have to be acknowledged by every ceph node. U could configure local persistent write cache, but to make it high available u need technology that ceph not offer or additional IT know how to implement it and with a HDD-pool u get no read cache. So with zfs-over-iscsi u can have both write and read cache also for HDD-pool and that fast over iscsi and possibly with rdma if that works with libiscsi. Even if u compare nvme ceph with high available write cache and read tuning with zfs-over-iscsi with nvme r/w cache than both will be quite the same in performance but nevertheless in functionality zfs-over-iscsi beat ceph. See block size and deduplication.
Are there multi-chassis/multi-port NVMe available yet?
Yes, but expensive

NVMe CEPH is the norm nowadays
NVME is the norm for ceph but "ceph NVME" not for storage.
You can also have 2/3 copies on ceph, if you use erasure coded pools.
This is stupid with 3 Ceph nodes, because if a replicated node fail u only have 1 data copy.
 
Last edited:
According to the hint from @spirit for the rdma hack. My one searched for the file on PVE and found it in the path: /usr/share/perl5/PVE/Storage/ZFSPlugin.pm. There u only have to change the line beginning with "transport" to:

transport => 'iser',

So now every vdisk which is created newly uses RDMA for iSCSI. Really nice. Please put this as Option in the zfs-over-iscsi Plugin, pve developer!

Here you can find a performance comparison between libiscsi iscsi/tcp zu iscsi/rdma: https://www.snia.org/sites/default/files/SDC/2016/presentations/storage_networking/Shterman-Grimberg_Greenberg_Performance Implications Libiscsi_ RDMA_V6.pdf

This is a huge benefit.
 
Last edited:
But shared storage backend ZFS (not unique to RSF) is still 1 SPOF. I thought you were talking about HA storage. Stop moving the goalposts to make comparisons. Are we having a 3-way HA solution, in that case, the consistency latency is >1-15 minutes, which is ridiculously long compared to the micro-to-milliseconds of Ceph or other scale-out storages, unless, as I said, you emulate 3-way RAID1 over the network, which would triple your latency and be worse than Ceph too.

If the argument is cost, then perhaps, but I’ve noticed the maintenance and licensing costs are well beyond the cost of the physical storage.

@floh: that presentation is almost 10 years old, it considers 100k IOPS “high end”. RDMA makes sense on very high end systems. Back then that was 25/40G, today 100G is the ‘standard’ in new datacenters and modern CPU and cheap NIC can push those rates without RDMA. RDMA has always had a niche, but only if your current network/disk/cpu designs are being challenged (today RDMA is used primarily on dual 200, 400 or even 800G links to feed 8-way or higher GPU systems, iSCSI is too complex and NVMoF has RDMA)
 
Last edited:
  • Like
Reactions: Johannes S
But shared storage backend ZFS (not unique to RSF) is still 1 SPOF. I thought you were talking about HA storage.
A shared JBOD with 2 controller is HA. The only part what could damage which is not redundant is the backplane, but that is extremely rare because its a passive component. Every dual controller storage has the same design. Would u say netapp is not HA?

Are we having a 3-way HA solution, in that case, the consistency latency is >1-15 minutes, which is ridiculously long compared to the micro-to-milliseconds of Ceph or other scale-out storages, unless, as I said, you emulate 3-way RAID1 over the network, which would triple your latency and be worse than Ceph too.
U dont comprehend the differences of the data way between a standard dual node storage and a distributed storage system. My advice: ask the Internet or chatgpt to get it. Not JohannesS.


that presentation is almost 10 years old, it considers 100k IOPS “high end”. RDMA makes sense on very high end systems. Back then that was 25/40G, today 100G is the ‘standard’ in new datacenters and modern CPU and cheap NIC can push those rates without RDMA. RDMA has always had a niche, but only if your current network/disk/cpu designs are being challenged (today RDMA is used primarily on dual 200, 400 or even 800G links to feed 8-way or higher GPU systems, iSCSI is too complex and NVMoF has RDMA)
1. That this pdf is old is not relevant. It shows only the relative speed push between iscsi/tcp and iscsi/rdma. The exact data depend always on your used hardware and are for this statement uninteresting.
2. U dont comprehend the differences between bandwidth and latency. RDMA lower the latency. A network card with 100 GB/s has more bandwidth as a one with 40.
Therefor one should also pay attention to the latency of a storage switch not only the price.
 
Last edited:
A shared JBOD with 2 controller is HA. The only part what could damage which is not redundant is the backplane, but that is extremely rare because its a passive component. Every dual controller storage has the same design. Would u say netapp is not HA?

The entire chassis is not redundant, the backplane electrically is likely going to be redundant, if you use SAS drives, you can reach the disk from ‘two channels’. Backplanes are not passive components, they have controllers, depending on model they likely have a management chip if not a switch. However, any failure in any component brings down the entire thing. Yes, I’ve seen backplane failures, the entire line of 4 disks in that case went down, there are backplanes with 24 drives. I’ve seen power failures pull down the redundant power supply, I’ve seen 1 bad disk pull down the entire bus (it was just sending garbage signals causing continuous bus resets). Redundancy in network/controller/power are for external issues and maintenance reasons, not for internal issues.

A single NetApp disk box is not redundant/HA, plenty of VMware clusters keel over because a single failure, like a firmware bug. NetApp does offer scale-out storage solution that are similar to Ceph. They have expensive FiberChannel network solutions with redundant FC fabric switches for that reason and even “metro” sync which is not something you get ‘for free’.

U dont comprehend the differences of the data way between a standard dual node storage and a distributed storage system. My advice: ask the Internet or chatgpt to get it. Not JohannesS.

Dual node storage or redundant controller are a huge difference, an entire chassis worth of difference. Distributed storage is an evolution on that early 2000s design to reduce the issues with having a SPOF. That is why Ceph even has rack-distribution and power-distribution awareness, because, you know, racks catch fire, which means fire sprinkler damage or someone does something stupid in them.

1. That this pdf is old is not relevant. It shows only the relative speed push between iscsi/tcp and iscsi/rdma. The exact data depend always on your used hardware and are for this statement uninteresting.
2. U dont comprehend the differences between bandwidth and latency. RDMA lower the latency. A network card with 100 GB/s has more bandwidth as a one with 40.
Therefor one should also pay attention to the latency of a storage switch not only the price.
I do comprehend the differences between latency and bandwidth, my point is that with modern hardware, the tricks we used 10 years ago to get better latency are a moot point, you get better latency today on a quality 100G Ethernet fabric than you got with a ‘low latency’ fabric like InfiniBand a decade ago. There are modern ‘RDMA optimized’ fabrics (we are currently testing an Arista fabric) but they are not what you think of when you say ‘switch’, they’re basically ASIC that emulate just enough of the Ethernet fabric to move packets, the concept of TCP/IP and even ARP/collision domains loses all meaning and those switches do understand RDMA natively, with the nVIDIA DPU they are effectively PCIe switches rather than Ethernet switches.

As far as iSCSI/iSER, the RDMA portion only applies to the data transfer, not the ‘command protocol’ which is still ‘the same’ (and thus the same latency) as regular iSCSI. So first-byte latency is the same, latency drops when you can actually use RDMA to set up do bulk transfers. But again, Ceph/NVMoF on 100G with large MTU sizes will beat iSCSI ‘today’ when talking about the same redundancy guarantees.
 
Last edited:
So JohannesS forgot to like the last post.
Whats your goal guruevi? This is not a thread to discuss which storage solution have more availability. If it is 99,99% or 99,999%. Its time that the ceph fan boys become grown up and be a man and accept the truth that zfs-over-iscsi just like ceph doesn't fit to any use case, but zfs-over-iscsi beat ceph in functionality.
 
but zfs-over-iscsi beat ceph in functionality.
Thats an interesting take. For someone who derides others for being fanboys, that statement shows an astounding lack of self awareness.

ceph is a scaleout filesystem with multiple api ingress points. zfs is a traditional filesystem and not multi initiator aware. the Fact that you CAN kludge together a method to facilitate doesnt make it of the same scope and focus. I have no interest in engaging on the finer merits of why YOU prefer that method, just pointing it out.
 
he entire chassis is not redundant, the backplane electrically is likely going to be redundant, if you use SAS drives, you can reach the disk from ‘two channels’. Backplanes are not passive components, they have controllers, depending on model they likely have a management chip if not a switch. However, any failure in any component brings down the entire thing. Yes, I’ve seen backplane failures, the entire line of 4 disks in that case went down, there are backplanes with 24 drives. I’ve seen power failures pull down the redundant power supply, I’ve seen 1 bad disk pull down the entire bus (it was just sending garbage signals causing continuous bus resets). Redundancy in network/controller/power are for external issues and maintenance reasons, not for internal issues.
For this argument its missing a notation for your high available JBOD design. For protecting a shared JBOD solution of going offline completely one can build a multi JBOD solution, of course. That means one uses 2 JBODS minimum and create a striped mirror zpool spreaded over both chassis. Therefor 1 JBOD can go offline completely without interruption. This can save also some money because u can go only with one controller per chassis and u only have to use single port devices. That means u could also create a similar design like ceph but you get more features like read cache for hdd pools etc.
 
Last edited:
But you didnt comprehend the meaning of async writes from my prospective.
Yes, we don't talk about the same thing. It's not ONLY the filesystem, it's also the backend storage (where there is no filesystem) and that's what I'm talking about. Unless you have a SHARED write cache, as good enterprise SAN storage solutions do with a fast bus and cache-consistent writes over multiple nodes, you will have the same drawbacks in ZFS HA as you would have in CEPH if you want consistency.

Image this real-world scenario:

PVE writes a block over ZFS-over-iSCSI to the storage and it acknowledges the write so the client thinks the data is written correctly. On the storage side, ZFS gets the write as an async write and stores it in its ARC to be commited and if it's not on the disk, but the node fails before the data has been written, the data is lost and the data on disk is inconsistent to the state the app things the data is in, because the ZFS write cache (ARC) is not synchronously written to the other node. If RSF-1 has a solution for that, please show me the documentation and/or the code for that. That specifically would be really interessting. This is a VERY hard problem in computer science so I explicitly ask for its solution.
If on the other hand, the ZFS storage takes it as a sync write, it'll be written to the slog and the consistency is ensured, but you will be limited by the slog speed, as well as you will be in ceph. Ceph needs to write the data 3 times, your SLOG will have to write it once, so CEPH will be slower on writes. In the async case, CEPH also writes it directly as a sync write, so you will not have the consistence problem there, but everything will be "consistent slow" all the time. Any enterprise storage I know will write it to the write cache of the storage (replicated or accessible) from both nodes and backed up by batteries or big capacitors, so that the data is also always consistent for any normal interruptions like power outage or controller cache.

I would really love to hear your arguments (documentation, code) on how this is solved. Without it, I can not endorse to even use this.


In my experience with customer projects, the big differene between ZFS-over-iSCSI (again, i love this technology, too) and CEPH is the availability and scalability and that for all customers I know the bigger selling point. If you have a power surge in your ZFS HA two-node-setup and it shreds e.g. both SLOG or the backplane or corrupts your SAS-bus or any other problem others have already mentioned, you will have your SPOF. And that is besides the cache-coherent implementation. It will also be limited to the available and supported cable length to/from the chassis to the compute nodes and due to its non-fiber-nature, it will also not be galvanically isolated or possible to locate in different fire compartments - even with multi-JBOD-setups, the connection will most probably always be in non-fiber. In ceph, all of that is possible and you can loose whole nodes without interrupting anything. To not have a SPOF is a very hard requirement for most projects. So, sadly, ZFS-over-iSCSI is not a solution for those projects.

In the end I think that comparing ZFS-over-iSCSI with CEPH is like comparing apples and oranges. You will always find features where one of them is better than the other, but it will need to fit your requirements. If you only compare single rack solutions with each other, if the price point is right and the cache-coherent-write-problem is non-existing, you will have a working alternative to a single HA storage like a Fujitsu DX60/100 or a Dell ME5024. Do you have a quote on the costs for a ZFS-HA system of let's say 20 TB netto in 12G dual-port SAS drives? How does it compare to something like Blockbridges PVE storage?


but you need expensive NVME for the hole storage.
Maybe relevant for the home labber, but in the enterprise realm, prices are cheaper for NVMe in comparison to 12 Gb SAS SSDs if you consider the performance difference. Most enterprise storage system nowadays also provide the LUNs with NVMe-over-fabric, so iSCSI is too slow to keep up.


This is stupid with 3 Ceph nodes, because if a replicated node fail u only have 1 data copy.
Of course it's stupid, but you still can do it. Some will argue, that 3 node ceph is already stupid, because it cannot self heal itself. In ZFS you will have want to have at least copies present to do the same. You can also do the same stupid argument of only one copy left for any RAID5/RAIDz1. Depends on what you want.
I just ask to add it to the summary, because it can also be done with CEPH.