Shared Remote ZFS Storage

Just because a vendor does it, doesn’t mean it’s a good idea. Vendors sell some shady stuff and it works, most of the time. Most VMware clusters, even relatively large ones indeed have a storage design that basically resembles DRBD, mdraid or Proxmox ZFS+replication. And most vendors will use some or all of the same open source software. I found out recently mdraid runs some expensive “proprietary” external RAID controllers that won’t take unbranded disks, and mdraid without the proprietary controller as a result can recover those with any disk, with some data loss because the cache battery that made it go vroom were a faulty SPOF.

Most small to medium VMware clusters have a single point of failure at the storage layer, even larger systems like vSAN will not guarantee data safety beyond the last snapshot/backup in case of node outage, while ZFS or Ceph out of the box are safe even in case of disastrous failure, the vendors simply run into the same problems ZFS or Ceph do - I’ve got NVMe and it goes no faster than my network link, why is this so slow - okay, so let’s remove the safety and now it works at NVMe speeds and with sufficient hardware redundancy so that it only catastrophically fails during power outages or disastrous node/software failures - which are rare enough, most people will have clusters with 10 years or more uptime, until they don’t and lose minutes worth of data.

No vendor will support/repair data corruption though. They’ll say sorry and maybe give you a discount at renewal or point you at the docs that will say, do not use for databases or enable the ‘slow’ feature for important data. If you want to make ‘safe’ go ‘fast’ expect to pay according, look up the cost of a pair of 200G ethernet switches + optics.
 
Last edited:
Just because a vendor does it, doesn’t mean it’s a good idea. Vendors sell some shady stuff and it works, most of the time.
I am really curious about this. Can you point to an "it" a specific vendor does that is not a good idea in your view? more pointedly- what would be an example of "shady stuff?"

Most VMware clusters, even relatively large ones indeed have a storage design that basically resembles DRBD, mdraid or Proxmox ZFS+replication
Interesting. I wasnt even aware that such solutions exist. would you be so kind as to point me to a vendor that produces such a solution? I think you even pointed out earlier in the thread that traditional SAN is most common- so which is it?
I found out recently mdraid runs some expensive “proprietary” external RAID controllers that won’t take unbranded disks, and mdraid without the proprietary controller as a result can recover those with any disk, with some data loss because the cache battery that made it go vroom were a faulty SPOF.
Please share who. such information would be most invaluable to the forum.
even larger systems like vSAN will not guarantee data safety beyond the last snapshot/backup in case of node outage
Sure it can, if you're willing to sacrifice usable capacity. tinstaafl. Most operators are fine operating short periods degraded but you CAN have additional replication if you so choose.

while ZFS or Ceph out of the box are safe even in case of disastrous failure,
No more or less then any other solution utilizing the same mechanism at the same parity (replication, EC)

No vendor will support/repair data corruption though. They’ll say sorry and maybe give you a discount at renewal or point you at the docs that will say, do not use for databases or enable the ‘slow’ feature for important data. If you want to make ‘safe’ go ‘fast’ expect to pay according, look up the cost of a pair of 200G ethernet switches + optics.
Fast and Slow is relative. design the solution for the need, not the other way around. if you NEED 20GB/S through a single link then 200G switches are a prerequisite regardless of price; alternatively why are you pricing them if you dont NEED them...
 
@alex:
- Dell 'managed' IT - they'll sell Kubernetes (RHEL OpenShift) with an Emulex FibreChannel SPOF proprietary RAID - Ceph is available upon request, but the per-TB licensing cost is (much) higher. I've gotten similar constructions from just about anyone trying to sell me a small cluster.

- Classic file storage like Synology (business line), Nexenta, TrueNAS and a few more - that's just BTRFS/ZFS with a jacket, the replicated storage is just a snapshot that is replicated every minute or so - there are a host of them sold as VMware and other hypervisor plugin storage, which means, they could have up to a minute or more of lost data if using file-level storage. Where possible, there may be protocol-level replication (the S3 implementation/proxy replicates to both sides for example) and for most of the hypervisor plugins effectively creates a software RAID1 solution between 2 iSCSI targets.

- Modern stuff like VAST, HPE GreenField, Dell ObjectScale and a few other "object storage" systems are just plain Intel w/RHEL/Debian "controllers" using Docker, Kubernetes or similar containerization with some special sauce to manage existing open source software, the protocols like NFS or iSCSI/NVMoF are bolted on top of that. Most object storage is eventually consistent, Dropbox, Netflix etc all had to work around for AWS with stuff like S3mper or Snitch and we're running into that issue right now with one of those vendors where the NFS write followed by a read does not provide consistent metadata between multiple clients. I don't know how much I can tell about a few of those, but they have a write cache on every node, meaning the node becomes a SPOF.

Most of these proprietary vendors (including IBM, NetApp etc) will recommend unsafe practices to get the customer to attain the marketing numbers.

My point was yes, it is possible, but you run into the same issues that people complain about here, hence why this topic exists - Ceph is too slow and expensive, I want something cheaper and faster. You can make Ceph run just as cheap and fast, if you sacrifice the guarantee of your data. Just like you can run ZFS with forced async. And then people will say "well it worked on vSAN or with my existing SAN" - yes, because you don't understand the underpinnings of that storage. Your storage cannot magically go from point A to point B at higher speeds than the interface between point A and B, or else it lies to you. If your vSAN goes NVMe speeds over 10-25G networking, something is seriously wrong.
 
I think I understand your point, but I dont subscribe to the viewpoint that "if I can do it myself there is no justification for someone else to get paid."

There is nothing wrong, in my view, for a Nexenta or truenas to use ZFS internally; zfs is tried and true and works well. When you buy a storage solution from these vendors, they provide a product that does a thing. they spent time building it, debugging it, continuing its support, and engineering next generation/improvements/etc. They also provide documentation and engineering staff to help their customers achieve their goals. If you dont want to pay for that and provide those services on your own- that is an option available to you. (I didnt mention Synology, QNAP, etc as I'm not too familiar with their range of offerings in the enterprise, but its safe to assume they would fall into this category as well.) Same goes for the larger, broader offerings (Dell, IBM, etc.) they operate at a different scale, and cater to a larger type of shop who wants their services. The technologies employed are not particularly relevant as long as they meet their contractual obligations.

For ANY deployment, you have the choice of engineering a solution yourself or buying it from someone else. Capex or Opex, you'll pay one way or the other.
but you run into the same issues that people complain about here, hence why this topic exists
Oh for sure. the difference is one way you have to figure them out yourself - and the other is to have the vendor do so. I'm not advocating the superiority of either, just saying there are customers for both.
 
  • Like
Reactions: Johannes S
Why you have external help? When External Audit comes(Grant Thornton,KPMG etc etc) first they ask you is "what if you get hit by a bus", who will support that and that. That is why you need to have external maintenance contracts.
 
@alex: I agree, but it is often sold as a "High Available" - people have to be more precise with language what they mean with HA (marketing vs reality).

But when people here say that something isn't fully HA because there is a gap and then "why not, my proprietary stuff does it" - well, they don't, not really. Or that eg mdraid across iSCSI is a bad thing. It is, really bad because there are some serious failure scenarios, but it is not uncommon in proprietary "HA" storage 'today'.
 
  • Like
Reactions: Johannes S
people have to be more precise with language what they mean with HA (marketing vs reality).
Ah. this is a sore point for me. the ONUS of decisionmaking is on the purchaser. If the purchaser doesn't understand what he is buying (be it for any reason short of outright fraud) then the fault lies there too; NO technology is without edge cases, and 99.999% uptime still means over 5 minutes of outage per year ON AVERAGE. Marketing can IMPLY a lot, but anyone who has the responsibility to make the decisions HAS TO READ THE FINE PRINT TOO.

but when people here say that something isn't fully HA because
Like Vincent told Mia, "they talk a lot, dont they..." why should this be of any relevance? and "people" dont have any requirement to know what they are talking about. or if its true 100% of the time, or at all. On the other hand, they're probably right within the framework of what they have experienced- so what?

It is, really bad because there are some serious failure scenarios, but it is not uncommon in proprietary "HA" storage 'today'.
Seems like the type of thing "people" say. its not true in my experience in any meaningful ratio. HA # infallible in any case, which is why any serious operator gets the best of possible solutions available to them AND THEN deploy a multilayer backup/DR/BC strategy anyway.
 
Most of these proprietary vendors (including IBM, NetApp etc) will recommend unsafe practices to get the customer to attain the marketing numbers.
That is libelous nonsense. Unless you can share some names and show us the proof.
Disclosure: a NetApp employee, but I would have said this anyway.
 
That is libelous nonsense. Unless you can share some names and show us the proof.
Disclosure: a NetApp employee, but I would have said this anyway.
For data to be safe, it has to be able to survive 2 failures. There are virtually none that will have 3-way replica of every piece of data. NetApp specifically - write cache is redundant in (up to) 2 places, so when one controller is in downtime (upgrades) the data is no longer redundant in cache. NetApp claims you can continue operations despite the fact that sh*t happens, data can get silently corrupted in cache, disks fail at inopportune times, upgrade windows stretch from minutes to hours.

Unless NetApp fixes that, you can’t compare it to the data safety of ZFS or Ceph where data is explicitly redundant at all times, however, as you know, that inherently introduces cost and latency and sacrifices.
 
Last edited:
  • Like
Reactions: Johannes S
> For data to be safe, it has to be able to survive 2 failures.

Wrong.

> There are virtually none that will have 3-way replica of every piece of data.

Maybe that's because that isn't necessary?
- 3-way replica isn't needed because arrays can reliably function with one controller.
- 2-way replica may be desirable in some cases (e.g. Elasticsearch, Splunk), but also works fine with a single healthy controller.

> NetApp specifically - write cache is redundant in (up to) 2 places, so when one controller is in downtime (upgrades) the data is no longer redundant in cache.

Wrong.

- When a controller goes down, the remaining one writes-through and there's no write caching. That's both documented and possible to see in logs
- On one specific product line, E-Series arrays, it is possible to disable write caching on selected volumes, so that writes are always pass-through. The main reason for this feature isn't to mitigate risks of failures due to pending writes, but to optimize performance for heavy write workloads that disproportionately benefit from read caching
- Both ONTAP and E-Series have battery backed cache. All documentation truthfully states when you data is protected and when it's not (which some users prefer, in HPC for example, where re-runing a compute job that can generate the same result every time may be preferred to spending $1 million more on storage for given performance level).

Instead of speculating, you could simply ask or read the docs. Instead, you insult and defame across the board.

FYI, "recommending unsafe practices" is a serious violation of internal rules of conduct in all major IT companies.
Example for NetApp:

> Be truthful and accurate in all reports, statements, certifications, bids, proposals, and claims.
Source: https://www.netapp.com/media/8089-codeofconduct.pdf

It can happen, but to say "most" employees do it is laughable. Make a complaint to NetApp and have the offender(s?) punished (or even fired).
The same goes for others (Dell, IBM and "even" Oracle).
 
Perhaps you should read the docs:
https://kb.netapp.com/on-prem/ontap/OHW/OHW-KBs/During_an_ONTAP_upgrade_is_there_a_risk_of_losing_uncommitted_data_if_an_NVRAM_failure_occurs#:~:text=or SnapMirror Synchronous)-,Answer,is typically about 10 seconds.

When a node is rebooting with new software, the surviving node manages all workloads. A failure of this second controller during the upgrade would lead to a complete cluster outage. Very few systems are sold with 3 independent controllers, at best you get 2 controllers in 2 chassis, most cases, it's 2 controllers in 1 chassis, effectively you are multi-pathing to 1 SPOF.

Clearly you do not understand your own products, after all, nobody can because they're not open source. Cache does not automatically disable itself, as the above KB clearly states, and as you state you CAN disable it, doesn't mean anybody DOES it, as the KB clearly indicates, even NetApp will not because the system needs to keep performing at advertised values. Nobody scales a proprietary RAID system for 50% of its performance or sufficient spindles to allow for disabling write-throughs, you wouldn't sell them if there was any "during upgrades you'd have less than half your performance and potentially lose 10s->all of your data". 10s of data loss is significant for databases.

Battery backup of a single cache still makes it a SPOF, I've seen the batteries die in RAID controllers (LSI in this case), Lithium batteries that swell up inside the chassis so they can no longer be safely removed from the chassis, causing the entire chassis to require downtime.

My point is not that it's not valid to sell them, my point is that it's less secure than Ceph or ZFS when configured according to recommended settings, and if you were to configure it in the more-safe, you'd get the same performance constraints. NetApp or IBM don't magically make the hardware go faster, they recommend less safe configurations. And the failure rate per TB is not a great value, but with todays 12PB of data in 4U, it does start to compound where even the underlying hardware has sufficient silent read or write failures that it has become a concern when you have literally 10s to 100s of TBs of 'cache' in effectively RAID1.

There is a reason proprietary systems have gone the way of the dodo and all the hyperscalers have implemented software solutions that scale and replicate data multiple times across many datacenters - the failure rate on any particular byte has not gone down on hard drives (or SSD for that matter) in about 3 decades - yet we have grown from measuring in MBs to measuring in TBs at which point a "1 byte in 10TB read" failure rate is now 4 bytes per complete disk read or even a potential 1 byte per second/minute/hour for a total system.
 
Last edited:
> Battery backup of a single cache still makes it a SPOF

No, it doesn't.
In your scenario, another, prior and unaddressed, failure is required to make NVRAM (and/or other component(s)) a potential single point of failure.

It doesn't make it a SPOF during normal cluster operation, which is what the HA claim describes.
When the paired controller is unavailable (being rebooted, upgraded, serviced, etc.) the cluster cannot withstand another failure, but that doesn't make the cluster "not HA".

> if you were to configure it in the more-safe, you'd get the same performance constraints.

This thread wasn't about performance constraints, but about "shady claims" and (NetApp recommending) "unsafe practices".

Now it turns out the only thing you can highlight is a public NetApp KB article that transparently informs users about product design and behavior.

As far as public claims go, there's no ambiguity: the workings of HA and controller mirroring are clearly documented.
https://docs.netapp.com/us-en/ontap/concepts/high-availability-pairs-concept.html

Users who care about 2 minutes of exposure during an upgrade commonly schedule upgrades in off hours and backup their data before each upgrade.
As far as the recommending usafe practices is concerned, it reality it's the opposite - users who absolutely can't risk losing any writes are recommended to use MetroCluster and the way it works is clearly documented for various failure cases and scenarios.
https://docs.netapp.com/us-en/ontap...mirroring-work-in-metrocluster-configurations

Other enterprise vendors are similar. I don't of any major vendor who makes false claims in their documentation.

> There is a reason proprietary systems have gone the way of the dodo and all the hyperscalers have implemented software solutions that scale and replicate data multiple times across many datacenters

LOL... Dude, I'm sorry, but I'm not interested in your rants.
(FYI, all the major hyperscalers have also implemented ONTAP, e.g. https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/how-it-works-fsx-ontap.html).

I just wanted to know what "unsafe" practices were recommended by NetApp. After several back-and-forths, it turns out - none.
 
Even though I believe @somedude, I also agree with @guruevi that there are shady vendors. Mabye shady is a little bit extreme, let me rephrase that to "vendors I would not trust my data".

Just like NTFS is IMHO a shady FS.

For example, I only recently learned about https://graidtech.com/about-us because Hardwareluxx uses them as their new server. Now fu**ing way would I run my server RAID based on some NVIDIA Quadro based RAID from a 4 person company.
 
  • Like
Reactions: Johannes S
Mabye shady is a little bit extreme, let me rephrase that to "vendors I would not trust my data".
There are vendors, and there are vendors. NetAPP is first tier. the fact that my wife's nephew put together a NAS using gum and bailing wire doesnt make him of the same caliber. As for trusing your data... on prem storage exists precisely so you dont have to. if you meant trust in their technology- thats a good start. make sure you have a backup strategy- and thats true in with a first tier solution too.

Just like NTFS is IMHO a shady FS.
...why? if its just your "feeling" driving your opinion, I would humbly suggest you dont trust your own opinion. NTFS is among the most stable and mature FS available today.

For example, I only recently learned about https://graidtech.com/about-us because Hardwareluxx uses them as their new server. Now fu**ing way would I run my server RAID based on some NVIDIA Quadro based RAID from a 4 person company.
Just because its made of gum and bailing wire doesnt mean it cant be made use of. I have no idea who Hardwareluxx is, and why they decided to use that solution- but maybe it works for them in their specific use case. You wouldnt run it- neither would I. but here is a live use case- it would be worth looking at it as an example where our "absolute" rules might be wrong.
 
NetAPP is first tier.
Sure. But first tier providers also sometimes do not so smart stuff. That argumentum ad verecundiam is not that strong IMHO. Some would probably even argue that Synology is first tier and look what they do with BTRFS ;)
Simply because it has no checksums.
maybe it works for them in their specific use case. You wouldnt run it- neither would I. but here is a live use case
Or maybe it is just not that great? Maybe it works for now, until it doesn't anymore? Or maybe they simply don't know better? Could be, they benchmark SEQ1M Q8T64 for a webserver. Or maybe that benchmark was just for show, I have no idea.
our "absolute" rules might be wrong.
These are just my personal rules, feel free to ignore them :)
But yeah, I only trust ZFS and maybe BTRFS.
 
Last edited:
  • Like
Reactions: Johannes S
But first tier providers also sometimes do not so smart stuff. That argumentum ad verecundiam is not that strong IMHO.
I would retort with calling a solution shady because its imperfect to be an irrelevant argument to begin with. The difference between "first tier" and not ISNT that they are (perfect), its that they have the engineering capacity and support staff to identify, document, and resolve in a timely manner. Expecting a "perfect" solution would necessarily relegate ALL solutions to "shady" status, and the distinction loses any meaning.

Simply because it has no checksums.
Again, your assertion was that this makes NTFS "shady." It is interesting that as that is your metric, you mentioned NTFS but ignored ALL OTHER filesystems that predate data checksumming- EXT, XFS, HFS, etc. If you're like me I bet you've used those without issue for decades. Its not a rational position to equate "missing features not required at the time" = "shady."

Or maybe it is just not that great? Maybe it works for now, until it doesn't anymore? Or maybe they simply don't know better? Could be, they benchmark SEQ1M Q8T64 for a webserver. Or maybe that benchmark was just for show, I have no idea.
I wont argue. I've tested this solution in the past and found no redeeming qualities. all I was pointing out is that by your own testimony someone does find merit in it.

But yeah, I only trust ZFS and maybe BTRFS.
You do realize that ZFS was Sun's attempt to make a NetAPP, and for many years people would freak out that Netapp would sue zfs operators for infringement... There are other more established file systems and solutions then either zfs or btrfs. Speaking of the latter, btrfs has been considered green for many years and I have not considered it for production use until maybe a year ago. ZFS had massive teething problems and performance regressions for years as well. It takes a LONG TIME to mature a modern file system. I would sooner trust a 30 year old filesystem such as NTFS over a 5 year old ZFS.
 
  • Like
Reactions: waltar
you mentioned NTFS but ignored ALL OTHER filesystems that predate data checksumming- EXT, XFS, HFS, etc.
I did not ignore them. I did not mention them, because NTFS was only an example. You are spot on, I also don't trust ext4, xfs....
You do realize that ZFS was Sun's attempt to make a NetAPP, and for many years people would freak out that Netapp would sue zfs operators for infringement...
I do, yes.
Speaking of the latter, btrfs has been considered green for many years and I have not considered it for production use until maybe a year ago.
That is why I wrote "maybe" for BTRFS. For me personally, it is not battle tested enough. (as we saw with the RAID issue).
ZFS had massive teething problems and performance regressions for years as well.
Sure. And performance is still an small issue IMHO. But performance is a totally different metric. If I want performance, I go with something like the linked GRAID.
I would sooner trust a 30 year old filesystem such as NTFS over a 5 year old ZFS.
Me too. Thankfully that is no way close to our actual real world.
 
Last edited:
  • Like
Reactions: Johannes S
I would sooner trust a 30 year old filesystem such as NTFS over a 5 year old ZFS.
Luckily enough both are not five years old, the first version of ZFS was released in 2006 and development started in 2001 if I recall correctly.

In contrast the BSD filesystems as well as xfs or ext2/ext3/ext4 are basically decades old but they lack checksumming/bitrot detection, So if that a feature is important for you, the age of filesystems doesn't help very much. It's even worse with non-journaling filesystems (like the older BSD ones and ext2), they are definitively mature. But the duration of their fck run makes them unsuitable for most workloads today imho. Again their maturity doesn't help you much in terms of trust . On the other hand in my experience an ext3 or ext4 often can be recovered after a crash which might not be that easily possible with more advanced filesystems. So pick your poison ;)
Personally I still use ext4 on my notebook and Linux vms and ZFS on my Proxmox nodes. I'm considering migrating my notebook to btrfs at some point though (for a single disk install of Debian I don't want to use an out-of-tree fs like ZFS). And xfs is said to be more performant for large databases so that's an possible option for my PBS vms/vserver or SQL vms
 
  • Like
Reactions: IsThisThingOn
Yeah,the new kid in the block is bcachefs and i've been testing it a few months. Has some really neat features, but similar to zfs it is not in the kernel.