Shared Remote ZFS Storage

guruevi · Feb 13, 2026

Just because a vendor does it, doesn’t mean it’s a good idea. Vendors sell some shady stuff and it works, most of the time. Most VMware clusters, even relatively large ones indeed have a storage design that basically resembles DRBD, mdraid or Proxmox ZFS+replication. And most vendors will use some or all of the same open source software. I found out recently mdraid runs some expensive “proprietary” external RAID controllers that won’t take unbranded disks, and mdraid without the proprietary controller as a result can recover those with any disk, with some data loss because the cache battery that made it go vroom were a faulty SPOF.

Most small to medium VMware clusters have a single point of failure at the storage layer, even larger systems like vSAN will not guarantee data safety beyond the last snapshot/backup in case of node outage, while ZFS or Ceph out of the box are safe even in case of disastrous failure, the vendors simply run into the same problems ZFS or Ceph do - I’ve got NVMe and it goes no faster than my network link, why is this so slow - okay, so let’s remove the safety and now it works at NVMe speeds and with sufficient hardware redundancy so that it only catastrophically fails during power outages or disastrous node/software failures - which are rare enough, most people will have clusters with 10 years or more uptime, until they don’t and lose minutes worth of data.

No vendor will support/repair data corruption though. They’ll say sorry and maybe give you a discount at renewal or point you at the docs that will say, do not use for databases or enable the ‘slow’ feature for important data. If you want to make ‘safe’ go ‘fast’ expect to pay according, look up the cost of a pair of 200G ethernet switches + optics.

alexskysilk · Feb 13, 2026

guruevi said:
Just because a vendor does it, doesn’t mean it’s a good idea. Vendors sell some shady stuff and it works, most of the time.

I am really curious about this. Can you point to an "it" a specific vendor does that is not a good idea in your view? more pointedly- what would be an example of "shady stuff?"

guruevi said:
Most VMware clusters, even relatively large ones indeed have a storage design that basically resembles DRBD, mdraid or Proxmox ZFS+replication

Interesting. I wasnt even aware that such solutions exist. would you be so kind as to point me to a vendor that produces such a solution? I think you even pointed out earlier in the thread that traditional SAN is most common- so which is it?

guruevi said:
I found out recently mdraid runs some expensive “proprietary” external RAID controllers that won’t take unbranded disks, and mdraid without the proprietary controller as a result can recover those with any disk, with some data loss because the cache battery that made it go vroom were a faulty SPOF.

Please share who. such information would be most invaluable to the forum.

guruevi said:
even larger systems like vSAN will not guarantee data safety beyond the last snapshot/backup in case of node outage

Sure it can, if you're willing to sacrifice usable capacity. tinstaafl. Most operators are fine operating short periods degraded but you CAN have additional replication if you so choose.

guruevi said:
while ZFS or Ceph out of the box are safe even in case of disastrous failure,

No more or less then any other solution utilizing the same mechanism at the same parity (replication, EC)

guruevi said:
No vendor will support/repair data corruption though. They’ll say sorry and maybe give you a discount at renewal or point you at the docs that will say, do not use for databases or enable the ‘slow’ feature for important data. If you want to make ‘safe’ go ‘fast’ expect to pay according, look up the cost of a pair of 200G ethernet switches + optics.

Fast and Slow is relative. design the solution for the need, not the other way around. if you NEED 20GB/S through a single link then 200G switches are a prerequisite regardless of price; alternatively why are you pricing them if you dont NEED them...

guruevi · Feb 13, 2026

@alex:
- Dell 'managed' IT - they'll sell Kubernetes (RHEL OpenShift) with an Emulex FibreChannel SPOF proprietary RAID - Ceph is available upon request, but the per-TB licensing cost is (much) higher. I've gotten similar constructions from just about anyone trying to sell me a small cluster.

- Classic file storage like Synology (business line), Nexenta, TrueNAS and a few more - that's just BTRFS/ZFS with a jacket, the replicated storage is just a snapshot that is replicated every minute or so - there are a host of them sold as VMware and other hypervisor plugin storage, which means, they could have up to a minute or more of lost data if using file-level storage. Where possible, there may be protocol-level replication (the S3 implementation/proxy replicates to both sides for example) and for most of the hypervisor plugins effectively creates a software RAID1 solution between 2 iSCSI targets.

- Modern stuff like VAST, HPE GreenField, Dell ObjectScale and a few other "object storage" systems are just plain Intel w/RHEL/Debian "controllers" using Docker, Kubernetes or similar containerization with some special sauce to manage existing open source software, the protocols like NFS or iSCSI/NVMoF are bolted on top of that. Most object storage is eventually consistent, Dropbox, Netflix etc all had to work around for AWS with stuff like S3mper or Snitch and we're running into that issue right now with one of those vendors where the NFS write followed by a read does not provide consistent metadata between multiple clients. I don't know how much I can tell about a few of those, but they have a write cache on every node, meaning the node becomes a SPOF.

Most of these proprietary vendors (including IBM, NetApp etc) will recommend unsafe practices to get the customer to attain the marketing numbers.

My point was yes, it is possible, but you run into the same issues that people complain about here, hence why this topic exists - Ceph is too slow and expensive, I want something cheaper and faster. You can make Ceph run just as cheap and fast, if you sacrifice the guarantee of your data. Just like you can run ZFS with forced async. And then people will say "well it worked on vSAN or with my existing SAN" - yes, because you don't understand the underpinnings of that storage. Your storage cannot magically go from point A to point B at higher speeds than the interface between point A and B, or else it lies to you. If your vSAN goes NVMe speeds over 10-25G networking, something is seriously wrong.

alexskysilk · Feb 13, 2026

I think I understand your point, but I dont subscribe to the viewpoint that "if I can do it myself there is no justification for someone else to get paid."

There is nothing wrong, in my view, for a Nexenta or truenas to use ZFS internally; zfs is tried and true and works well. When you buy a storage solution from these vendors, they provide a product that does a thing. they spent time building it, debugging it, continuing its support, and engineering next generation/improvements/etc. They also provide documentation and engineering staff to help their customers achieve their goals. If you dont want to pay for that and provide those services on your own- that is an option available to you. (I didnt mention Synology, QNAP, etc as I'm not too familiar with their range of offerings in the enterprise, but its safe to assume they would fall into this category as well.) Same goes for the larger, broader offerings (Dell, IBM, etc.) they operate at a different scale, and cater to a larger type of shop who wants their services. The technologies employed are not particularly relevant as long as they meet their contractual obligations.

For ANY deployment, you have the choice of engineering a solution yourself or buying it from someone else. Capex or Opex, you'll pay one way or the other.

guruevi said:
but you run into the same issues that people complain about here, hence why this topic exists

Oh for sure. the difference is one way you have to figure them out yourself - and the other is to have the vendor do so. I'm not advocating the superiority of either, just saying there are customers for both.

ness1602 · Feb 13, 2026

Why you have external help? When External Audit comes(Grant Thornton,KPMG etc etc) first they ask you is "what if you get hit by a bus", who will support that and that. That is why you need to have external maintenance contracts.

guruevi · Feb 13, 2026

@alex: I agree, but it is often sold as a "High Available" - people have to be more precise with language what they mean with HA (marketing vs reality).

But when people here say that something isn't fully HA because there is a gap and then "why not, my proprietary stuff does it" - well, they don't, not really. Or that eg mdraid across iSCSI is a bad thing. It is, really bad because there are some serious failure scenarios, but it is not uncommon in proprietary "HA" storage 'today'.

alexskysilk · Feb 13, 2026

guruevi said:
people have to be more precise with language what they mean with HA (marketing vs reality).

Ah. this is a sore point for me. the ONUS of decisionmaking is on the purchaser. If the purchaser doesn't understand what he is buying (be it for any reason short of outright fraud) then the fault lies there too; NO technology is without edge cases, and 99.999% uptime still means over 5 minutes of outage per year ON AVERAGE. Marketing can IMPLY a lot, but anyone who has the responsibility to make the decisions HAS TO READ THE FINE PRINT TOO.

guruevi said:
but when people here say that something isn't fully HA because

Like Vincent told Mia, "they talk a lot, dont they..." why should this be of any relevance? and "people" dont have any requirement to know what they are talking about. or if its true 100% of the time, or at all. On the other hand, they're probably right within the framework of what they have experienced- so what?

guruevi said:
It is, really bad because there are some serious failure scenarios, but it is not uncommon in proprietary "HA" storage 'today'.

Seems like the type of thing "people" say. its not true in my experience in any meaningful ratio. HA # infallible in any case, which is why any serious operator gets the best of possible solutions available to them AND THEN deploy a multilayer backup/DR/BC strategy anyway.

somedude · Saturday at 15:16

guruevi said:
Most of these proprietary vendors (including IBM, NetApp etc) will recommend unsafe practices to get the customer to attain the marketing numbers.

That is libelous nonsense. Unless you can share some names and show us the proof.
Disclosure: a NetApp employee, but I would have said this anyway.

guruevi · Saturday at 16:37

somedude said:
That is libelous nonsense. Unless you can share some names and show us the proof.
Disclosure: a NetApp employee, but I would have said this anyway.

For data to be safe, it has to be able to survive 2 failures. There are virtually none that will have 3-way replica of every piece of data. NetApp specifically - write cache is redundant in (up to) 2 places, so when one controller is in downtime (upgrades) the data is no longer redundant in cache. NetApp claims you can continue operations despite the fact that sh*t happens, data can get silently corrupted in cache, disks fail at inopportune times, upgrade windows stretch from minutes to hours.

Unless NetApp fixes that, you can’t compare it to the data safety of ZFS or Ceph where data is explicitly redundant at all times, however, as you know, that inherently introduces cost and latency and sacrifices.

somedude · Sunday at 07:46

> For data to be safe, it has to be able to survive 2 failures.

Wrong.

> There are virtually none that will have 3-way replica of every piece of data.

Maybe that's because that isn't necessary?
- 3-way replica isn't needed because arrays can reliably function with one controller.
- 2-way replica may be desirable in some cases (e.g. Elasticsearch, Splunk), but also works fine with a single healthy controller.

> NetApp specifically - write cache is redundant in (up to) 2 places, so when one controller is in downtime (upgrades) the data is no longer redundant in cache.

Wrong.

- When a controller goes down, the remaining one writes-through and there's no write caching. That's both documented and possible to see in logs
- On one specific product line, E-Series arrays, it is possible to disable write caching on selected volumes, so that writes are always pass-through. The main reason for this feature isn't to mitigate risks of failures due to pending writes, but to optimize performance for heavy write workloads that disproportionately benefit from read caching
- Both ONTAP and E-Series have battery backed cache. All documentation truthfully states when you data is protected and when it's not (which some users prefer, in HPC for example, where re-runing a compute job that can generate the same result every time may be preferred to spending $1 million more on storage for given performance level).

Instead of speculating, you could simply ask or read the docs. Instead, you insult and defame across the board.

FYI, "recommending unsafe practices" is a serious violation of internal rules of conduct in all major IT companies.
Example for NetApp:

> Be truthful and accurate in all reports, statements, certifications, bids, proposals, and claims.
Source: https://www.netapp.com/media/8089-codeofconduct.pdf

It can happen, but to say "most" employees do it is laughable. Make a complaint to NetApp and have the offender(s?) punished (or even fired).
The same goes for others (Dell, IBM and "even" Oracle).

guruevi · Sunday at 16:52

Perhaps you should read the docs:
https://kb.netapp.com/on-prem/ontap/OHW/OHW-KBs/During_an_ONTAP_upgrade_is_there_a_risk_of_losing_uncommitted_data_if_an_NVRAM_failure_occurs#:~:text=or SnapMirror Synchronous)-,Answer,is typically about 10 seconds.

When a node is rebooting with new software, the surviving node manages all workloads. A failure of this second controller during the upgrade would lead to a complete cluster outage. Very few systems are sold with 3 independent controllers, at best you get 2 controllers in 2 chassis, most cases, it's 2 controllers in 1 chassis, effectively you are multi-pathing to 1 SPOF.

Clearly you do not understand your own products, after all, nobody can because they're not open source. Cache does not automatically disable itself, as the above KB clearly states, and as you state you CAN disable it, doesn't mean anybody DOES it, as the KB clearly indicates, even NetApp will not because the system needs to keep performing at advertised values. Nobody scales a proprietary RAID system for 50% of its performance or sufficient spindles to allow for disabling write-throughs, you wouldn't sell them if there was any "during upgrades you'd have less than half your performance and potentially lose 10s->all of your data". 10s of data loss is significant for databases.

Battery backup of a single cache still makes it a SPOF, I've seen the batteries die in RAID controllers (LSI in this case), Lithium batteries that swell up inside the chassis so they can no longer be safely removed from the chassis, causing the entire chassis to require downtime.

My point is not that it's not valid to sell them, my point is that it's less secure than Ceph or ZFS when configured according to recommended settings, and if you were to configure it in the more-safe, you'd get the same performance constraints. NetApp or IBM don't magically make the hardware go faster, they recommend less safe configurations. And the failure rate per TB is not a great value, but with todays 12PB of data in 4U, it does start to compound where even the underlying hardware has sufficient silent read or write failures that it has become a concern when you have literally 10s to 100s of TBs of 'cache' in effectively RAID1.

There is a reason proprietary systems have gone the way of the dodo and all the hyperscalers have implemented software solutions that scale and replicate data multiple times across many datacenters - the failure rate on any particular byte has not gone down on hard drives (or SSD for that matter) in about 3 decades - yet we have grown from measuring in MBs to measuring in TBs at which point a "1 byte in 10TB read" failure rate is now 4 bytes per complete disk read or even a potential 1 byte per second/minute/hour for a total system.

Search

Search

Shared Remote ZFS Storage

guruevi

Renowned Member

alexskysilk

Distinguished Member

guruevi

Renowned Member

alexskysilk

Distinguished Member

ness1602

Famous Member

guruevi

Renowned Member

alexskysilk

Distinguished Member

somedude

Renowned Member

guruevi

Renowned Member

somedude

Renowned Member

guruevi

Renowned Member

We value your privacy