Shared Remote ZFS Storage

Just because a vendor does it, doesn’t mean it’s a good idea. Vendors sell some shady stuff and it works, most of the time. Most VMware clusters, even relatively large ones indeed have a storage design that basically resembles DRBD, mdraid or Proxmox ZFS+replication. And most vendors will use some or all of the same open source software. I found out recently mdraid runs some expensive “proprietary” external RAID controllers that won’t take unbranded disks, and mdraid without the proprietary controller as a result can recover those with any disk, with some data loss because the cache battery that made it go vroom were a faulty SPOF.

Most small to medium VMware clusters have a single point of failure at the storage layer, even larger systems like vSAN will not guarantee data safety beyond the last snapshot/backup in case of node outage, while ZFS or Ceph out of the box are safe even in case of disastrous failure, the vendors simply run into the same problems ZFS or Ceph do - I’ve got NVMe and it goes no faster than my network link, why is this so slow - okay, so let’s remove the safety and now it works at NVMe speeds and with sufficient hardware redundancy so that it only catastrophically fails during power outages or disastrous node/software failures - which are rare enough, most people will have clusters with 10 years or more uptime, until they don’t and lose minutes worth of data.

No vendor will support/repair data corruption though. They’ll say sorry and maybe give you a discount at renewal or point you at the docs that will say, do not use for databases or enable the ‘slow’ feature for important data. If you want to make ‘safe’ go ‘fast’ expect to pay according, look up the cost of a pair of 200G ethernet switches + optics.
 
Last edited:
  • Like
Reactions: Johannes S and UdoB
Just because a vendor does it, doesn’t mean it’s a good idea. Vendors sell some shady stuff and it works, most of the time.
I am really curious about this. Can you point to an "it" a specific vendor does that is not a good idea in your view? more pointedly- what would be an example of "shady stuff?"

Most VMware clusters, even relatively large ones indeed have a storage design that basically resembles DRBD, mdraid or Proxmox ZFS+replication
Interesting. I wasnt even aware that such solutions exist. would you be so kind as to point me to a vendor that produces such a solution? I think you even pointed out earlier in the thread that traditional SAN is most common- so which is it?
I found out recently mdraid runs some expensive “proprietary” external RAID controllers that won’t take unbranded disks, and mdraid without the proprietary controller as a result can recover those with any disk, with some data loss because the cache battery that made it go vroom were a faulty SPOF.
Please share who. such information would be most invaluable to the forum.
even larger systems like vSAN will not guarantee data safety beyond the last snapshot/backup in case of node outage
Sure it can, if you're willing to sacrifice usable capacity. tinstaafl. Most operators are fine operating short periods degraded but you CAN have additional replication if you so choose.

while ZFS or Ceph out of the box are safe even in case of disastrous failure,
No more or less then any other solution utilizing the same mechanism at the same parity (replication, EC)

No vendor will support/repair data corruption though. They’ll say sorry and maybe give you a discount at renewal or point you at the docs that will say, do not use for databases or enable the ‘slow’ feature for important data. If you want to make ‘safe’ go ‘fast’ expect to pay according, look up the cost of a pair of 200G ethernet switches + optics.
Fast and Slow is relative. design the solution for the need, not the other way around. if you NEED 20GB/S through a single link then 200G switches are a prerequisite regardless of price; alternatively why are you pricing them if you dont NEED them...
 
@alex:
- Dell 'managed' IT - they'll sell Kubernetes (RHEL OpenShift) with an Emulex FibreChannel SPOF proprietary RAID - Ceph is available upon request, but the per-TB licensing cost is (much) higher. I've gotten similar constructions from just about anyone trying to sell me a small cluster.

- Classic file storage like Synology (business line), Nexenta, TrueNAS and a few more - that's just BTRFS/ZFS with a jacket, the replicated storage is just a snapshot that is replicated every minute or so - there are a host of them sold as VMware and other hypervisor plugin storage, which means, they could have up to a minute or more of lost data if using file-level storage. Where possible, there may be protocol-level replication (the S3 implementation/proxy replicates to both sides for example) and for most of the hypervisor plugins effectively creates a software RAID1 solution between 2 iSCSI targets.

- Modern stuff like VAST, HPE GreenField, Dell ObjectScale and a few other "object storage" systems are just plain Intel w/RHEL/Debian "controllers" using Docker, Kubernetes or similar containerization with some special sauce to manage existing open source software, the protocols like NFS or iSCSI/NVMoF are bolted on top of that. Most object storage is eventually consistent, Dropbox, Netflix etc all had to work around for AWS with stuff like S3mper or Snitch and we're running into that issue right now with one of those vendors where the NFS write followed by a read does not provide consistent metadata between multiple clients. I don't know how much I can tell about a few of those, but they have a write cache on every node, meaning the node becomes a SPOF.

Most of these proprietary vendors (including IBM, NetApp etc) will recommend unsafe practices to get the customer to attain the marketing numbers.

My point was yes, it is possible, but you run into the same issues that people complain about here, hence why this topic exists - Ceph is too slow and expensive, I want something cheaper and faster. You can make Ceph run just as cheap and fast, if you sacrifice the guarantee of your data. Just like you can run ZFS with forced async. And then people will say "well it worked on vSAN or with my existing SAN" - yes, because you don't understand the underpinnings of that storage. Your storage cannot magically go from point A to point B at higher speeds than the interface between point A and B, or else it lies to you. If your vSAN goes NVMe speeds over 10-25G networking, something is seriously wrong.
 
  • Like
Reactions: Johannes S
I think I understand your point, but I dont subscribe to the viewpoint that "if I can do it myself there is no justification for someone else to get paid."

There is nothing wrong, in my view, for a Nexenta or truenas to use ZFS internally; zfs is tried and true and works well. When you buy a storage solution from these vendors, they provide a product that does a thing. they spent time building it, debugging it, continuing its support, and engineering next generation/improvements/etc. They also provide documentation and engineering staff to help their customers achieve their goals. If you dont want to pay for that and provide those services on your own- that is an option available to you. (I didnt mention Synology, QNAP, etc as I'm not too familiar with their range of offerings in the enterprise, but its safe to assume they would fall into this category as well.) Same goes for the larger, broader offerings (Dell, IBM, etc.) they operate at a different scale, and cater to a larger type of shop who wants their services. The technologies employed are not particularly relevant as long as they meet their contractual obligations.

For ANY deployment, you have the choice of engineering a solution yourself or buying it from someone else. Capex or Opex, you'll pay one way or the other.
but you run into the same issues that people complain about here, hence why this topic exists
Oh for sure. the difference is one way you have to figure them out yourself - and the other is to have the vendor do so. I'm not advocating the superiority of either, just saying there are customers for both.
 
  • Like
Reactions: Johannes S
Why you have external help? When External Audit comes(Grant Thornton,KPMG etc etc) first they ask you is "what if you get hit by a bus", who will support that and that. That is why you need to have external maintenance contracts.
 
@alex: I agree, but it is often sold as a "High Available" - people have to be more precise with language what they mean with HA (marketing vs reality).

But when people here say that something isn't fully HA because there is a gap and then "why not, my proprietary stuff does it" - well, they don't, not really. Or that eg mdraid across iSCSI is a bad thing. It is, really bad because there are some serious failure scenarios, but it is not uncommon in proprietary "HA" storage 'today'.
 
  • Like
Reactions: Johannes S
people have to be more precise with language what they mean with HA (marketing vs reality).
Ah. this is a sore point for me. the ONUS of decisionmaking is on the purchaser. If the purchaser doesn't understand what he is buying (be it for any reason short of outright fraud) then the fault lies there too; NO technology is without edge cases, and 99.999% uptime still means over 5 minutes of outage per year ON AVERAGE. Marketing can IMPLY a lot, but anyone who has the responsibility to make the decisions HAS TO READ THE FINE PRINT TOO.

but when people here say that something isn't fully HA because
Like Vincent told Mia, "they talk a lot, dont they..." why should this be of any relevance? and "people" dont have any requirement to know what they are talking about. or if its true 100% of the time, or at all. On the other hand, they're probably right within the framework of what they have experienced- so what?

It is, really bad because there are some serious failure scenarios, but it is not uncommon in proprietary "HA" storage 'today'.
Seems like the type of thing "people" say. its not true in my experience in any meaningful ratio. HA # infallible in any case, which is why any serious operator gets the best of possible solutions available to them AND THEN deploy a multilayer backup/DR/BC strategy anyway.