Shared Remote ZFS Storage

To me it looks like a ceph cluster (as hyper converged on pve or separate) generates much often and heavy trouble - as seen here in threads every week new - as any nas server while a ha-nas like a netapp/isilon just has a problem when power is off which can be serviced with a metro (2nd coupled) installation while switches run with emergency power. In a time range of 15y I've seen 5 ha systems crashed at 1 customer and checkup and service/data repairing such a system which was in chaos takes a bunch time. Sometimes it would be just better to stop a service, check and run otherwise than running into an increasingly troubles to data while ha thinks it can any further do it's job but it don't.
 
In these cases waltar, people usually don't read enough, and afterwards don't test before pushing to production, and then write to forums with cries for help. CEPH is really battle-tested software, and i've worked with numerous companies who would never leave it. But on the other side, i've also worked with companies who had really bad ceph experience(the network part) ,and they moved usually to single node PVE.
As always, get a good engineer(hyper+network) and you should aleviate all problems.
 
We actually have more trouble with our Isilon (now PowerScale) than Ceph. Both have cluster availability of 100% over the past few years but we regularly run into bugs on individual nodes on the PowerScale during provisioning. Currently on a 2 week bender with Dell on their remote support feature not working properly and we are talking to HPE about hardware because of support issues. But we’ve had issues with non-standard NFS and SMB, performance issues that are unclear and undocumented and obviously hardware issues.

Problem with proprietary stuff is that *you* can’t fix it, Ceph, any issue I can get fixed in hours, I can read the source code and understand the problem and fix it myself if necessary, the Isilon is FreeBSD and a lot of it is readable Python or other code but it is not running said BSD in a standard in any way and most of it is not documented.

As far as hardware failures, they do happen, I’ve seen VMware clusters go under because a single proprietary SAN controller goes down and then the blame game begins between several vendors. Luckily most VMware shops are now going to KVM+Ceph in some fashion.
 
  • Like
Reactions: Johannes S
Yeah, the Isilon/PowerScale OS is a little be funny, you can change nfs server options which are active as confirmed and functioning on client ... but reset again to default when doing a nfs service restart (all on webui) - who the hell programmed that and what talk the customer about to ?!? You should even manipulate customer DNS service to integrate into a isilon but estimated is/should be the isilon integrate into customer network and not the opposite. Isilon file service reads tends to slow and isn't really an often platform for virtualization. Special switch backend and even special at node growing.
HA-nfs works really cool at Isilon but in summery it's a special with pros and cons to deal with.
 
  • Like
Reactions: Johannes S
In these cases waltar, people usually don't read enough, and afterwards don't test before pushing to production, and then write to forums with cries for help. CEPH is really battle-tested software, and i've worked with numerous companies who would never leave it. But on the other side, i've also worked with companies who had really bad ceph experience(the network part) ,and they moved usually to single node PVE.
As always, get a good engineer(hyper+network) and you should aleviate all problems.
They also have empty wallets.....
 
Yeah, the Isilon/PowerScale OS is a little be funny, you can change nfs server options which are active as confirmed and functioning on client ... but reset again to default when doing a nfs service restart (all on webui) - who the hell programmed that and what talk the customer about to ?!? You should even manipulate customer DNS service to integrate into a isilon but estimated is/should be the isilon integrate into customer network and not the opposite. Isilon file service reads tends to slow and isn't really an often platform for virtualization. Special switch backend and even special at node growing.
HA-nfs works really cool at Isilon but in summery it's a special with pros and cons to deal with.
I ran NetApp for years, never had an issue. You get what you pay for, with ceph it feels more like you get half.... Spend twice as much on network, compute and memory just to get it working correctly.
 
https://github.com/xrobau/zfs

That builds a NFS server from a standard Ubuntu machine. iSCSI is making it massively overcomplicated and far harder to back up and disaster recovery.

You can also tie it in with https://github.com/xrobau/zfs-replicate which sets up automatic replication jobs between servers.

This is not terribly complicated, which is why I suspect it's not a commercial product!
 
That builds a NFS server from a standard Ubuntu machine.
That's never been an issue. making a iscsi host is relatively trivial, and using something like Trunas quite beginner friendly.

You can also tie it in with https://github.com/xrobau/zfs-replicate which sets up automatic replication jobs between servers.
Thats good for disaster recovery, not so much for HA (high availability.)

This is not terribly complicated, which is why I suspect it's not a commercial product!
Without guest quiescence and application management it's also not particularly dependable. consider that anything that is in the guest's RAM cache and not committed to disk will not be on the sent snapshot, leaving files in open/partial/outdated state. Its not a commercial product because its not of commercially required quality.
 
Without guest quiescence and application management it's also not particularly dependable. consider that anything that is in the guest's RAM cache and not committed to disk will not be on the sent snapshot, leaving files in open/partial/outdated state.
As someone who *has* written and managed a real HA solution (I do VoIP), I assure you that you are 100% in the wrong market.

If you are caring about 'never losing a transaction regardless of what fails', you need to be looking into IBM Z-Series mainframes or the equivalent. What you want has (effectively) three VMs running in lockstep, and writing the data to 4 different places, with every write call not returning until it's committed to at least three.

Luckily for us, most people in the real world are happy for a 500 error and to click 'refresh' to finish what they're doing. A great example of how this works in the real world is Netflix's "Chaos Monkey" which is more than happy to instakill machines and even whole clusters - without writing anything in memory to disk.

The reason behind using a standard distro is so that nothing is locked behind a confusing UI, and you're going to be running exactly the same packages a million other machines are using.
 
  • Like
Reactions: Johannes S

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!