Just because a vendor does it, doesn’t mean it’s a good idea. Vendors sell some shady stuff and it works, most of the time. Most VMware clusters, even relatively large ones indeed have a storage design that basically resembles DRBD, mdraid or Proxmox ZFS+replication. And most vendors will use some or all of the same open source software. I found out recently mdraid runs some expensive “proprietary” external RAID controllers that won’t take unbranded disks, and mdraid without the proprietary controller as a result can recover those with any disk, with some data loss because the cache battery that made it go vroom were a faulty SPOF.
Most small to medium VMware clusters have a single point of failure at the storage layer, even larger systems like vSAN will not guarantee data safety beyond the last snapshot/backup in case of node outage, while ZFS or Ceph out of the box are safe even in case of disastrous failure, the vendors simply run into the same problems ZFS or Ceph do - I’ve got NVMe and it goes no faster than my network link, why is this so slow - okay, so let’s remove the safety and now it works at NVMe speeds and with sufficient hardware redundancy so that it only catastrophically fails during power outages or disastrous node/software failures - which are rare enough, most people will have clusters with 10 years or more uptime, until they don’t and lose minutes worth of data.
No vendor will support/repair data corruption though. They’ll say sorry and maybe give you a discount at renewal or point you at the docs that will say, do not use for databases or enable the ‘slow’ feature for important data. If you want to make ‘safe’ go ‘fast’ expect to pay according, look up the cost of a pair of 200G ethernet switches + optics.
Most small to medium VMware clusters have a single point of failure at the storage layer, even larger systems like vSAN will not guarantee data safety beyond the last snapshot/backup in case of node outage, while ZFS or Ceph out of the box are safe even in case of disastrous failure, the vendors simply run into the same problems ZFS or Ceph do - I’ve got NVMe and it goes no faster than my network link, why is this so slow - okay, so let’s remove the safety and now it works at NVMe speeds and with sufficient hardware redundancy so that it only catastrophically fails during power outages or disastrous node/software failures - which are rare enough, most people will have clusters with 10 years or more uptime, until they don’t and lose minutes worth of data.
No vendor will support/repair data corruption though. They’ll say sorry and maybe give you a discount at renewal or point you at the docs that will say, do not use for databases or enable the ‘slow’ feature for important data. If you want to make ‘safe’ go ‘fast’ expect to pay according, look up the cost of a pair of 200G ethernet switches + optics.
Last edited: