best storage for a cluster in a hostile environment ?

Chentoa · Aug 31, 2021

What is the best storage choice for a Proxmox cluster in a hostile environment ? By hostile environment I mean an environment with frequent power outages... VERY frequent power outages ... Like 300 power outages in 2 years...

The best of course would be to eliminate these power outages by installing UPS, but I just can not count on that, and there's not much I can do about it. Sometimes there is a UPS, sometimes not, sometimes there is one but it isn't working...

The existent situation : 2 years ago I installed a standalone Proxmox server in a remote area of an African country. This server hosts a few containers (postfix, samba) and a VM (a database application). The server did very well in the last 2 years, surviving without a glitch numerous power outages, like I said about 300 in 2 years. It's a HP ProLiant server with HW-RAID. Proxmox is installed on EXT4. I think one of the reason why it behaved so well despite the harsh conditions is the HW-RAID with its battery-backed cache.

Now we want to set up a 3-nodes cluster. Question is : given the electrical environment, what would you recommend for storage ?

I'd appreciate to have your opinions about my analysis, which is :

- Software-based RAIDs like ZFS and Ceph should probably be avoided. I read on this forum a few posts about people having problems with ZFS and power failure (https://forum.proxmox.com/threads/zfs-pool-failure-after-power-outage.90400/ and https://forum.proxmox.com/threads/zfs-data-volume-lost-after-power-failure.28663/). I guess it's the same with Ceph, but I couldn't find anything relevant.
I think it's better to rely on the HW-RAID, at least we'll benefit from its battery-backed cache.

- Local EXT4 storage would of course work well, because that's what we have now. So, that would be my choice by default if nothing else works, but I would rather prefer a shared storage.

- I think I'll give GlusterFS a try. How do you think would HW-RAID + EXT4 + GlusterFS behave with frequent power failures ? Theoretically it should be as stable as plain HW-RAID + EXT4, no ?

ness1602 · Aug 31, 2021

In my experience, would stay away from shared storage if you have electricity problems, this is just not that stable unfortunately.

Chentoa · Aug 31, 2021

@ness1602 : may I ask what kind of shared storage you used in your experience ?

t.lamprecht · Aug 31, 2021

If shared storage is a requirement I'd recommend ceph In general, as it's very resilient and (by default) requires that two out of three replicas are confirmed written before returning the OK for that write IO request to the application that submitted it.

It still must be coupled with disks that can flush all pending writes from the disk internal cache to its actual persistent storage cells, as that is something the file-system has no control over. Enterprise SSDs often use a capacitor for achieving that.

Chentoa · Aug 31, 2021

OK thanks. Given that I have plain HDDs, the best would probably be to go with local storages

I guess my assumption that a GlusterFS solution based on a battery-backed HW-RAID would prove more resilient seems to be wrong...

ness1602 · Aug 31, 2021

I've tested Ceph, Gluster and NFS, but when you have unstable network, sometimes the network(and thus storage) doesn't come up in a timeline manner,.

spirit · Aug 31, 2021

maybe zfs with a good small datacenter ssd (with supercapacitor) for the log journal.

Chentoa · Sep 1, 2021

@spirit : thanks for the idea, I've never heard of supercapacitor SSD's before. It's worth a try.

As I side note, I am surprised that there are people experiencing ZFS problems after power loss (https://forum.proxmox.com/threads/zfs-pool-failure-after-power-outage.90400/ and https://forum.proxmox.com/threads/zfs-data-volume-lost-after-power-failure.28663/), given that Oracle states clearly that "the file system can never be corrupted through accidental loss of power or a system crash".

https://docs.oracle.com/cd/E19253-01/819-5461/zfsover-2/ :
"ZFS is a transactional file system, which means that the file system state is always consistent on disk. Traditional file systems overwrite data in place, which means that if the system loses power, for example, between the time a data block is allocated and when it is linked into a directory, the file system will be left in an inconsistent state. (...)
With a transactional file system (...) the file system can never be corrupted through accidental loss of power or a system crash."

Reading those lines my conclusion could be that ZFS is exactly what I am looking for.

t.lamprecht · Sep 1, 2021

Chentoa said:
@spirit : thanks for the idea, I've never heard of supercapacitor SSD's before. It's worth a try.

FYI: That's the same thing I talked about

Normally the filter for such SSDs is "PLP" or "power loss protection".

Chentoa said:
As I side note, I am surprised that there are people experiencing ZFS problems after power loss (https://forum.proxmox.com/threads/zfs-pool-failure-after-power-outage.90400/ and https://forum.proxmox.com/threads/zfs-data-volume-lost-after-power-failure.28663/), given that Oracle states clearly that "the file system can never be corrupted through accidental loss of power or a system crash".

As said, the FS has no control whatsoever about how the actual disk controllers manage their fast but ephemeral cache.
If the disk controller tells the OS that a write operation went through, but it is only in the cache and thus still in-flight (not yet written to the actual persistent layers), then a power loss could still mean that there's a corrupt state. Enterprise SSDs, often also called "datacenter SSDs" avoid that either through the capacitors that provide enough emergency-power to flush the caches on a power loss or by ensuring that any write operation got actually written to the persistent flash before telling the OS in the first place, that takes a performance hit though, and it actually does not solve every power loss related issue.
See the following answer for a more in-depth explanation of data corruption in SSDs due to power loss: https://serverfault.com/a/925597
PLP gives you the best of both worlds, faster IOPS while not giving up any data safety.

Note, above issue needs really some bad luck and consumer HW to actually happen, especially with transactional storages like ZFS which really handles power loss much better than most other FS (albeit it can only be as good as the underlying HW allows it to be).
But, you talk about hundreds of outages per year, that would mean there's quite a chance that you'd run into something like this sooner or later.

Also note that with ZFS being a check-summing file-system you actually notice such errors there, and you also have a higher chance to resolve them (automatically or at least manually, the latter is quite involved though).
Other FS may also break but not even notice that some data got corrupted in some minor or even major way, if it isn't frequently in use and/or there are no higher level checks from the applications.

IMO both ZFS and Ceph can be an OK fit, but with the amount of power outages you always need HW that can work around the issues that creates. With both you can also "cheat" by only using such HW for the (log) journal and still having cheaper HW for the remaining system.

In any way: I'd still recommend for some heavy testing and a tested backup strategy before deploying important systems on any setup.

Chentoa · Sep 1, 2021

t.lamprecht said:
FYI: That's the same thing I talked about Normally the filter for such SSDs is "PLP" or "power loss protection".

Yep, your are are right, I overlooked that one, sorry !

PLP SSD's really seem to be the right thing to have in an unstable electrical environment

So, to sum it all up, the best choices would be :

- Either PLP SSD + local ZFS + PVE-zsync
- Or PLP SSD + shared Ceph/RBD

I'm still wondering how battery-backed HW-RAID + GlusterFS would do ? GlusterFS can take advantage of HW-RAID (contrary to ZFS or Ceph, which need direct access to the disks). And the HW-RAID battery should insure some level of consistency. @ness1602, in your test with GlusterFS did you use some kind of HW-RAID ? In your post you mention "unstable network ... thus storage". But I plan to use the 3 Proxmox nodes as shared storage, with a mesh network, so network shouldn't be an issue.

In the end, given all the comments here, I plan to :
- Test HW-RAID + GlusterFS (mainly because it doesn't involve additional disks to buy)
- And if the result isn't satisfying, fall back on ZFS or Ceph, with PLP SSDs

With, obviously, frequent backups in place !

Search

Search

best storage for a cluster in a hostile environment ?

Chentoa

Well-Known Member

ness1602

Renowned Member

Chentoa

Well-Known Member

t.lamprecht

Proxmox Staff Member

Chentoa

Well-Known Member

ness1602

Renowned Member

spirit

Distinguished Member

Chentoa

Well-Known Member

t.lamprecht

Proxmox Staff Member

Chentoa

Well-Known Member