Filesystem consistency with NFS, iSCSI and LVM/XFS

Xandrios · Apr 3, 2024

Hi,

I'm looking at how best to use shared storage to keep things consistent. Preferably NFS, though iSCSI could also be used if that turns out to be safer.

We would run two NAS nodes that replicate for redundancy. We rely on the replication mechanism within the NAS to keep its internal filesystem consistent for NFS shares.

Then on the NAS there would be a qcow2 file for each VM using NFS. That file is used for VM storage, which means that the qcow2 file actually has the VM's filesystem within (For example LVM+XFS). The NAS would not know about that filesystem (It just knows that there is a qcow2 file) and therefore the replication to the secondary NAS node would not keep that in mind. It would just replicate according to its scheduled replication jobs.

Am I understanding correctly that if one such snapshot is taken mid-write - that the second NAS could actually contain an corrupt filesystem within the qcow2 file? Though, I would guess that the journalling functionality of XFS should theoretically be able to fix that if you need to use that replicated backup?

Then there is the matter of sync or async NFS mounts. I've found that running in sync mode greatly impacts performance. I'm leaning towards using sync mode 'to be sure', however what kind of risk are we talking about here. If the NAS (or more likely: The network) would fail, and we end up with an inconsistent XFS filesystem, should I expect journaling to be able to bring us back to a consistent state? Or are there other potential risks that I am missing?

On the application level about the most risky thing is probably the databases, which would have its own journal/WAL to - theoretically - recover from an abnormal shutdown. I'm not sure about the OS itself - though I would assume that any modern OS (Like RHEL) would be able to handle an inconsistent state of its application-level files. Or would that be incorrect?

I've played around with iSCSI as well. One thing I noticed is that writes seem to not be synchronous, the performance is actually better than asynchronous NFS. If I wanted to make those writes synchronous, would that have to be configured on the SAN side - or is that something that proxmox could control?

Regarding data consistency, there is no filesystem on the SAN side. Attaching the LUN directly to the VM means that the VM's filesystem will be the only thing on there, for example XFS. Does that mean that a network or SAN or network interruption, at worst, will cause a repair through the journal (And potentially on the application level as well)? Or are there other risks I'm missing?

The one thing I would fear with the iSCSI approach is having two initiators write to the same filesystem. Practically that means having two hosts have the same VM running through misconfiguration or an HA split brain. Which could then cause substantial corruption that cannot be recovered, as older data could be overwritten... which journalling is not going to protect us from. Thinking about it, this could actually also happen with the qcow file via NFS in a split-brain situation.

To prevent that scenario, would it make sense to implement some kind of locking mechanism on the storage - e.g. verify (through a lockfile on the shared storage) that there's nobody else running the same VM?

Any advice would be appreciated

Tmanok · Apr 4, 2024

Hi Xandrios,

Wow, ok lots to read and digest in your thoughts, but some gaps remain despite the lengthy post.
Let's start with questions:

What is your Enterprise NAS?
How will your Enterprise NAS synchronize?
How big is the PVE cluster / planned cluster?
Will you setup two networks to reach the NAS or a single network?

Some important notes that don't appear to be entirely or clearly considered on your end:

NFS Requires a file system on the backend- it can't just be blocks. Your NAS will have a filesystem- it sounds like you've done some testing, so what is it and why?
When using iSCSI, you cannot create snapshots. This is not a deal breaker but it means that things like backups require VMs to be shut down.
- The exception to this rule is if you use Oracle Clustered File System 2, which will multi-path one of your iSCSI hosts to your cluster, enable file system locking, and create a directory mount for Proxmox VE, thus enabling .qcow2 and its features mentioned below.
When using NFS, containers cannot be snapshotted because .raw is .raw- there is not extensibility like in .qcow2 or vmdk or vhdx where snapshot files are stored within and the emulation layer has a grasp on what happens between the virtualized environment and the physical environment. I use NFS personally and run 100s of LXCs without issue, but only because my applications are HA and my backup tasks are configured precisely.
There is an alternative to the two technologies you've mentioned if your hosts are GNU/Linux based. GlusterFS will synchronize and provide PVE with a GlusterFS-Client mountable storage. However, LXCs are not supported on this storage type (I forget why).

Please respond to the 4 questions above and bring forward any unanswered questions.
Cheers,

Tmanok

Xandrios · Apr 4, 2024

Thanks @Tmanok!

At the moment I'm testing PVE with a Synology, just because we had it on hand and it supports both NFS and iSCSI. I believe its an RS3617xs+ and has a 4TB raid-1 SSD read/write cache. For production use we are still looking for a suitable product. If we choose to go with iSCSI most likely it would be an HPE MSA 2060 (as all our equipment is with them, and its a product we have worked with before). For NFS its unclear what a suitable vendor could be, I find that the most tricky thing is finding a vendor that offers worldwide on-site hardware support (Like HPE does). Perhaps an HPE server with TrueNAS but not sure yet as that has challenges with finding an HPE certified HBA controller for ZFS - if that even exists.

Cluster size: Small. 3-6 nodes. But we deploy many clusters so I'm looking to establish a model that we can repeat reliably and efficiently.
Network: It will definitely be a redundant network (Two 10Gbit switches), though due to the small cluster size we may be combining storage and other traffic on the same switch. Strictly separated on dedicated ports/segments though.

I've tested NFS and iSCSI for performance, where I was able to test NFS in both synchronous and async mode. Write performance is quite bad with sync mode enabled. But if that is the only way to be sure there won't be unrecoverable errors in the case of powerloss or other error, then thats the only way to go and we'll have to use that method. I haven't yet been able to test iSCSI in synchronous mode.

At an power/crash event losing the last few seconds of data should not be an issue. Losing the whole dataset due to corruption would be unacceptable. So I'm trying to gauge what risks there would practically be if we were to run asynchronous.

We only run VM's and no containers. The GlusterFS option does sound interesting, but would that run on the PVE hosts or NAS/SAN?

Xandrios · Apr 5, 2024

Thinking about the following scenario, and how this would be handled. Scenario: There is a redundant NAS that synchronises its data every 15 minutes with a secondary HA node. And a PVE node that runs VM's from that NAS through NFS. A fairly typical setup I'd expect.

00:00 PVE node boots VM from primary NAS
00:05 VM does a bunch of write and deletes
00:15 NAS takes a snapshot and replicates that to the secondary NAS
00:20 VM does a bunch of write and deletes
00:25 Primary NAS goes down
00:26 Secondary NAS comes up containing the data as it was at 00:15
00:30 VM does a bunch of writes and deletes

This is a very extreme example, usually the time between NAS replication would be much smaller (nearly instantaneous), however the principle remains the same.

Would this scenario, at 00:30, not corrupt the data? Because the VM itself has kept running, and has kept an in-memory state of how it believes the filesystem looks like. At 00:26 that filesystem was reverted back to an older version, and where the VM expects there to be files - there may no longer be any. And where it does not expect there to be files/data, there may now be data stored. Which it could potentially overwrite.

I'm not very familiar with how the internals of a filesystem work. Would the fylesystem (eg ext4 or xfs) check its in-memory state with the disk before doing any writes? Or, since from the VM perspective storage has not been down, it would believe its in-memory filesystem layout, indexes and such are still valid and leading?

Tmanok · Apr 6, 2024

Hey Xandrios,

It sounds like you and your team are attempting to deploy a repeatable and affordable "edge" or "branch" model, rather than a data centre model, which begs the question: Where is the concern for data loss and what pressure is there to begin with? In no way do I wish to dissuade you from the worthy endeavour of data resiliency or the perfectly justifiable aim to increase uptime, but I'd like to help you with your decision making if you're open to input.

Naturally, we all face the triangle: Good, fast, and cheap. The balance will almost always hit 2 of those more in one way or another. Something overlooked is also "reasonable" as in reasonable to implement- fitting the constraints of a data centre or network closet that is too small, limiting heat output due to limited cooling capacity, limiting electrical consumption due to limited available power...

If you are deploying this to multiple sites, I and many others can offer you great ideas, but in order to meet your requirements you may need to sacrifice something... If it's "edge" or "branch" deployment style like I've guessed, you may be limited by all of the above items, which means your budget will need to go up or you will need to sacrifice your data resiliency aspirations.

So that you may make your own choices and not be misguided by someone like me who doesn't know your scenario intimately, here are all of the options from most to least expensive:

Expensive - Supported - Hardware
- Proprietary high performance synchronous self-healing multi-chassis/node NAS with support (I suggest BlockBridge because they are partnered with Proxmox and have storage drivers). Alternatives include Dell VxRail and Cisco Hyperflex, the latter I am certain supports NFS shares.
Modest - Supported - Software
- TrueNAS Enterprise is a proprietary add-on to TrueNAS Scale or Core (Core is soon to be deprecated), requires 3 nodes and uses Gluster or Minio under the hood. I am pretty sure they want you to buy their hardware due to technical limitations of commodity hardware but I would only do it for support. https://www.truenas.com/community/threads/truenas-scale-ha-how-to.104733/
- LinBit HA - DRDB + Support offers a simple way to make two commodity hardware Linux servers become HA storage servers with HA NFS. https://linbit.com/high-availability/
Free - DIY - Software
- CEPH - Distributed object storage system with native Proxmox support... If you don't need extremely high performance and have hypervisor host resources to spare, this is worth looking into for hyperconverged. Minimum 6-8 nodes.
- LustreFS - Clustered high performance computing locking file system on any traditional local disk, shared using NFS. Minimum 3 nodes (2 data, 1 meta data).
- DRDB Open Source - Clustered and locking file system on any traditional local disk, shared using NFS. Minimum 2 nodes + witness.
- GlusterFS - Clustered and locking file system on any traditional local disk, shared using Gluster protocol or technically anything you wish to overlay on top of that- GlusterFS can operate on its own between the storage nodes only, but I'm not sure that would be as good for data resiliency despite TrueNAS doing this. Minimum 2 nodes, though I strongly recommend 3 to avoid split brain.
Single Node - Any - Any
- OCFS2 - Single locking file system for iSCSI with snapshots for VMs.
- GFS - Single locking file system for iSCSI with snapshots for VMs.
- NFS - Single file system for VMs with snapshots..
- iSCSI - Snapshotless block storage.
- ZFS over iSCSI - Block storage with snapshots for VMs.

The benefit to the DIY software is that you can run it on your own hardware and manage everything yourself. However, as you are looking for vendor support, you may find that with enough money, your personal ideal solution is in the first category (e.g. BlockBridge or Cisco HyperFlex). If you're somewhere in the middle, DRDB is a good alternative with its available support.

Now that you know most or at least many of the options, here are a couple I would consider:

If you are really worried about data resiliency and footprint, please consider CEPH with enterprise SSDs.
- Minimum 8-10 disks ("OSDs" per hypervisor, and maybe try 7 hypervisors to avoid split brain for both PVE and CEPH).
- Requires more hypervisor resources (1.5x-2xRAM, 20%+ CPU, possibly additional NIC).
- Smaller footprint because no NAS.
- Advanced technology requiring skills but not as daunting as you think- courses available online.
If you are more worried about support in addition to data resiliency, consider BlockBridge or Cisco HyperFlex.
If you aren't made of money, still want data resiliency, consider DRDB (support available) or GlusterFS.

Before I sign off: I have no affiliation with any of the above listed products or solutions, including open source projects or Proxmox (PBS/PVE/PMX/POM/the company/3rd party support). I'm a Solutions Architect with his own gear who operates as a risk manager at heart wherever I go, and sometimes that means taking on the risk of trying new things in order to find the least risky solution at the end of the day.

Thanks, hope the above compiled gives you sufficient food for though. I've compiled it all over the years while seeking similar solutions to similar problems and I've tried most of those solutions and many more not mentioned above.

Tmanok

Tmanok · Apr 6, 2024

Xandrios said:
Thinking about the following scenario, and how this would be handled. Scenario: There is a redundant NAS that synchronises its data every 15 minutes with a secondary HA node. And a PVE node that runs VM's from that NAS through NFS. A fairly typical setup I'd expect.

00:00 PVE node boots VM from primary NAS
00:05 VM does a bunch of write and deletes
00:15 NAS takes a snapshot and replicates that to the secondary NAS
00:20 VM does a bunch of write and deletes
00:25 Primary NAS goes down
00:26 Secondary NAS comes up containing the data as it was at 00:15
00:30 VM does a bunch of writes and deletes

This is a very extreme example, usually the time between NAS replication would be much smaller (nearly instantaneous), however the principle remains the same.

Would this scenario, at 00:30, not corrupt the data? Because the VM itself has kept running, and has kept an in-memory state of how it believes the filesystem looks like. At 00:26 that filesystem was reverted back to an older version, and where the VM expects there to be files - there may no longer be any. And where it does not expect there to be files/data, there may now be data stored. Which it could potentially overwrite.

I'm not very familiar with how the internals of a filesystem work. Would the fylesystem (eg ext4 or xfs) check its in-memory state with the disk before doing any writes? Or, since from the VM perspective storage has not been down, it would believe its in-memory filesystem layout, indexes and such are still valid and leading?

This is a beautifully thought out question. While I cannot answer it accurately for every operating system under the sun, I can give you guidance as to how it likely will happen.

When storage is taken away from a computer while it is running, what happens? The operating system will keep running, but many parts of the system will begin to crash and eventually, even when storage is returned it must reboot after a certain period of time has passed. What happens when a system is shut down suddenly? It will reboot, and anything that was being written, will only be half written or not synchronized at all causing some specific data losses.

It's important to consider a few things before continuing:

File systems have journals (hah except NTFS which has an incomplete and lousy one) which enable them to repair files after a sudden disruption from writing them or acknowledge that they were never written. Sometimes those files were incomplete, meaning the file system keeps them as existing files, despite not being the original intended file. Sometimes, the files had enough information that the file system could finish writing to them or at least index that they exist now if they weren't properly journaled (marked as written). Sophisticated file systems and especially those on redundant disks are best at this. Less sophisticated file systems (looking at you NTFS/FAT/EXT1/HFS/etc) lacked features to reliably recover from sudden disruptions to write operations.
- A side note about virtual guests. No matter what file system is on the virtual disk, so long as your physical storage is reliable and synchronized, it will remain reliable for the guests. When running something incredibly resilient such as ZFS or CEPH, you will have a much lower probability of data loss as the data is copied on write and then scrubbed for errors regularly. While NTFS could still send a write to a buffer which never makes it to ZFS in the case of a power failure, if it is committed, then it will have a far better chance of remaining valid than just NTFS on raw disk.
  - NTFS, like many file systems, can be told to write synchronously. I believe this is called write-through-caching.
  - This is assuming no unforseen chaotic OS, hypervisor, firmware, or storage bug occurs.
An operating system isn't rewriting every file all the time- it takes a surprisingly specific time to completely break a computer by turning it off quickly- e.g. during an update. So even when statistically less reliable operating systems (looking at you Windows!) with less sophisticated file systems (say NTFS) crash or lose power suddenly, they reboot just fine. When they don't, it could be the boot partition was being updated or a particular OS file was being modified/updated. Most of those issues go away just fine because those are no "unique" files- recovery media has spare copies of those files, update servers have the latest boot managers, etc.
User data, that which is considered most important because it is often unique, certainly less ubiquitous than OS files anyway, is what we all fear losing. User data is likely to be lost if being actively written to, especially databases and other highly volatile data. Most of the time, it came from somewhere- inputted or downloaded/copied from. Sometimes, it is like this forum post and is being typed live (truly unique, like an essay), and similar to this forum post it is auto-saved while editing. My point being, so rarely is data not being written to a disk, to the point where loss is minimal "in the moment".
Speaking of databases and services in general- highly available applications are in many ways more important than highly available storage because they themselves have two separate storages (in order to be truly HA/fault tolerant). Do yourself a favour, never run an important database as a single database- always replicate it and be very careful with REDIS/Memcached if you care about resiliency as your top priority.
- Side note, HA applications and databases are not always fault tolerant to maintenance errors- consider test and QA environments.
Data that was written to disk, will remain on disk with the right file system and storage resiliency. In order of resilience: ZFS, EXT4+RAID/MDADM/Gluster, BTRFS, XFS, and lastly NTFS if you hate resiliency. Once it is committed, the fear of losing it before it was saved is over if you aren't actively deleting and wiping it.
Changes which break things sometimes require backups to easily fix them. You've spoken a lot about keeping things running and safe operationally, but what if someone deletes files or an update breaks everything? Backups are incredibly important even on the edge imo.

Your odds of losing a significant amount of data during a power failure are limited (often recoverable). Your odds of losing data during storage loss are limited. Your odds of services and operating systems going offline are of course high- guaranteed if it's remote storage without proper HA. Given the limited probability and scope of data loss, mostly of ubiquitious or small amounts of unique data, the biggest operational challenge isn't losing the data so much as keeping the lights running. But without "True HA" you will never be able to prevent both of those scenarios- I promise some day, something will happen to cause them no matter how careful you are. I promise this grim fact... But even if you cry, sky will still be there when it happens and you will survive having lived through the recovery process.

What does true HA look like? Three examples at different scales:

Distributed overlay cluster - very fancy AWS / Azure level
- CloudStack / Nebula + WHMCS + Proprietary Code
- CEPH / Minio / etc
- 3+ Sites
- Tier 3 DC (multiple WAN and power all on separate telephone poles / circuits coming from separate parts of campus)
Simple WAN-Cluster
- Two separate PVE clusters
  1. Cluster 1 has 5 nodes @ site A, 5 nodes @ site B, 5 nodes at site C.
  2. Cluster 2 has 5 nodes @ site A, 5 nodes @ site B, 5 nodes at site C.
- Two separate CEPH Clusters with multiple pools and geographies. OR;
- Separate HA Storages for each cluster at each site, but replicated between sites A-B-C.
Single site
- Two separate PVE clusters
  - 5 nodes each
- Separate HA storage for each cluster
- Separate racks for each cluster
- Separate switching for storage and cluster activities
- Separate switching for each rack
- Separate (redundant) power, batteries, ideally mains for each rack

But who has the money or time to pull this all off? Some employers, not everyone, and many employers won't listen or don't realistically need true HA for their business anyway.

Cheers, sorry for the ramble. It's late, I shouldn't be permitted to use my laptop this late

Tmanok

showiproute · Apr 6, 2024

Nothing much to add here as my @Tmanok already did an outstanding overview what is possible and which paths you may select.
The only thing which I would like to highlight is ZFS filessystem which can be installed on Proxmox itself, mostly any Linux Enterprise OS (if you want to go the DIY route) and TrueNAS. I am not sure if any other manufacturers use it.

The main benefit of ZFS (beside resilience) is that you can do snapshots of filesystem & blockstorage and sync that to any other ZFS filesystem in a very straight forward way.
Lawrence Systems on YouTube covers the process in a very good video: https://www.youtube.com/watch?v=XIj0iHtZvOg
Just keep in mind that this can be done on any server running ZFS. TrueNAS just offers a useful UI for those things.

Personally I do not use TrueNAS but only ZFS due to the capabilities of it.

Tmanok · Apr 6, 2024

Cheers, @showiproute .

Something I did miss is of course which options come with a nice GUI. I always find new GUIs, some more featureful and maintained than others (PVE and TrueNAS being two actively maintained GUIs). My knowledge of which solutions have GUIs is going to be incomplete, for example a couple years ago I was searching for OpenVPN GUIs (Open source, not their proprietary one) and after giving up I remembered that both TrueNAS and OPNSense have good GUIs for OpenVPN, thus enabling me to plan their use in future deployments (thank goodness, OpenVPN with its certificates and user management is a PITA on the cli imo.

Here's the list that I'm aware of plus some assumptions:

Proprietary NAS
- Blockbridge: GUI
- Cisco HyperFlex: GUI
- Dell VxRail: GUI
- HPE Nimble: GUI
- NetApp: GUI
Mid Tier NAS's and Storage
- DRDB: CLI
- TrueNAS Enterprise: GUI
- 45 Drives: CLI (Recent GUI not 100% featureful)
  - CEPH On 45 Drives: GUI & CLI both by CEPH and by 45 Drives as said above, so you have options
- QNAP / Synology: GUI
Open Source NAS and Storage
- CEPH: GUI & CLI
- ZFS: CLI (GUI with PVE, TrueNAS, 45Drives, Some Appliances e.g. OPNSense for local only)
- OCFS2: CLI
- GFS: CLI
- NFS: CLI (GUI with most NAS')
- iSCSI: CLI (GUI with most NAS')
- ZFS over iSCSI: CLI

Note: I do not recommend OpenMediaVault and purposefully did not mention it. It is not enterprise rated and is not stable in my past (old) experience. I also did not list Unraid because I consider it to be amateur despite the many "cute" homelab features and fascinating (frightening) use of BTRFS + Secret Sauce for block parity underneath.

Another thought, on a scale of reliable to least reliable, I would make the following ranking by experience and love contribution from the community:
File / Data System

ZFS
CEPH
EXT4
XFS

Replication Software

CEPH
DRDB / LustreFS
Nimble Replication
ZFS Replication (not-synchronous sadly, requires snapshot)
GlusterFS / MooseFS / LizardFS (Dead?)
Tahoe LaHFS - Not meant for virtualization, not HA

NAS Systems

HPE Nimble / Cisco HyperFlex / BlockBridge / TrueNAS Enterprise
Homebrew (DIY) / NetApp / Proprietary SAS RAID and JBODs
QNAP / Synology

And finally, I realized after posting that I've missed a few other distributed file systems- one of which is critical: LustreFS! It's very academic but high performance, I would strongly recommend at least exploring with it, there's some complexity in the setup and it is CLI only but far superior to other similar systems such as GlusterFS, MooseFS, LAHFs, and so on. Those are the other two I had forgotten to mention by the way, but they are rather onerous to setup in my experience. You'll also notice the list of NAS's is not at all exhaustive, of course other more amateur NAS's such as AsusStor/RocNAS/etc and other mid tier enterprise NAS's exist such as Dell EMC, but it would take an eon to consider them all.

Thanks, that's really it for me tonight. I've edited my earlier post to include LustreFS.

Tmanok

bbgeek17 · Apr 8, 2024

Hi Xandrios,

After reviewing several of the threads you initiated, certain paths you're exploring pose risks of data corruption. This is especially true regarding your interest in async exports, soft mounts, and the NFS multi-path scheme discussed in another thread. Each of these poses significant individual risks.

To maintain data integrity, it's essential to use sync and hard mounts. Preserving the order of writes to disk is crucial for ensuring consistent databases and file systems. Filesystems and databases rely on transaction logs (i.e., write-ahead and/or intent), necessitating precise ordering of writes. Thus, maintaining order throughout every level of your stack is essential to prevent data corruption.
In cases where "async" is present (or implied) in your storage stack, it's crucial to preserve write barriers. Failure to do so, especially when obscured from the client (as with NFS), can result in data corruption, rendering the filesystem journal ineffective in salvaging partially applied, out-of-order writes.

Consider the layering of an NFS stack:
GUEST FILESYSTEM <-> GUEST BLOCK <-> QCOW <-> NFS CLIENT <-> NFS SERVER <-> FILESYSTEM <-> BLOCK
Your guest filesystem can preserve order using write barriers. However, failure to preserve write order below the guest level will lead to corruption. In this stack, there is nothing to manage barriers between the QCOW image and the NFS SERVER. We recommend "sync" for both the NFS export and the mount to avoid data loss.

Now, let's examine the layering of an iSCSI stack:
GUEST FILESYSTEM <-> GUEST BLOCK <-> iSCSI CLIENT <-> iSCSI TARGET <-> BLOCK
When using iSCSI, there is no chance to reorder the writes. In fact, the Linux iSCSI client does not even have a concept of caching. The only concern is if you are using a broken or misconfigured iSCSI target implementation. By default, I would not expect any properly configured iSCSI target to cache a write in volatile memory. Take special care if you will be using LIO as an iSCSI target that is backed by files in a filesystem instead of block devices. LIO and the filesystem must be appropriately configured to commit writes directly to the underlying block storage.

It's important to note that if you're employing NFS alongside Proxmox HA, implementing an appropriate fencing mechanism for your node is imperative. During failover, ensuring the powered-off state of the failed host is essential to avoid potential retransmitted write-induced data corruption.

Regarding your replication strategy:
a) Your NAS system must support share-wide point-in-time snapshots.
b) Incremental difference transfers should be supported.
c) On failover, reverting to an I/O consistent state (share-wide) is crucial.
d) Before failover, you must power down all guests to mitigate the risk of data corruption of the secondary. The contents of disk must never change beneath the guest while the guest is running. Only after roll-back is performed in its entirety can you power on the VMs.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

bbgeek17 · Apr 8, 2024

Hi Tmanok,

Thank you for acknowledging our product! I'd like to make a few minor corrections.

Blockbridge is a software-defined storage solution that integrates with off-the-shelf hardware. None of the hardware components are proprietary. Some users opt to purchase new hardware, while others prefer to utilize existing resources.

Usually, NAS refers to systems utilizing NFS/CIFS protocols (i.e., "file storage"). Blockbridge belongs to the category of pure block storage solutions (i.e., SAN), supporting both iSCSI and NVMe. As a native SAN device, we don't rely on a traditional filesystem layer, thus avoiding any associated overhead and consistency concerns.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Xandrios · Apr 11, 2024

Thanks all for your valuable feedback. Much appreciated.

Like many people switching over from VMWare I'm attempting to define a model that is both safe to use (Mainly in terms of availability and risk of loss of large sets of data) and still affordable. For us that means moving from a small set of standalone VMWare nodes to a small cluster of Proxmox nodes with shared storage. This allows doing HA and N+1 redundancy through the hypervisor rather than from within the application (And moving to redundancy at the hypervisor level is preferred). Each site is fully standalone/dark/offline and I'm guessing that means it would not be considered an 'edge' node/site. There's no larger network connecting the sites.

However since the sites are so small, it is difficult to find an enterprise storage solution that matches well. When deploying 4 fairly basic hypervisor nodes per site it does not really make sense to spend $ 25K on storage, especially considering that storage is hardly used: The VM's boot an OS, startup the applications and from that point onwards have very little dependency on storage. To the point that a single RAID-1 set of drives could probably handle the whole cluster.

There are plenty of good enterprise products out there that are not expensive for what they offer. However almost all of it is much too large for our deployment size. And with it being over-dimensioned it is also over-priced for what we effectively get out of it. Which is something that would be easy to cover for a few sites. But having to buy hundreds of storage systems to deploy to hundreds of small sites - that has the risk to get too expensive quickly when the product is effectively over-dimensioned for each site.

So, choosing from the good, fast, and cheap options - that'd be Good and cheap. Reliable and affordable is most important, performance is less so due to the small number of hypvervisors/VM's.

My takeaway for now is that in order to guarantee filesystem consistency its imperative to use synchrenous writes for VM images, and in case of HA on the NAS/SAN side it will have to be synchronous replication as well.

Snapshot based replication on the NAS/SAN will cause some amount of data loss upon a storage switchover, as you would effectively revert to the last snapshot. Therefore VM's need to be fenced and restarted after the NAS switchover when this happens.
This means that a switchover would either have to be implemented as a manual action (Not feasable to maintain the 99.999%), or, something custom would have to be developped to fence all the hypervisors/VM's when a HA-switchover occurs on the NAS (and reverting to an older snapshot). That's not ideal - though admittedly - an option. I'm guessing that NAS systems would have hooks to do something like this.

One, possibly useful, side effect of ZFS replication on local storage within proxmox itself is that when an hypervisor HA switch happens, proxmox would fence the failed node - killing both the VM and storage in one go. This removes the risk of having a VM remain active while storage reverts back to a previous snapshot. It solves some of the problems, though I do have to say that I would prefer shared storage for its flexibility. But also this, admittedly, is an option.

On the topic of fencing - I was planning to use IPMI based fencing as we are using HPE equipment having ILO's. Though it does look like that newer versions of Proxmox no longer have support for this method and favour watchdog fencing. Which basically delays switchover until a certain timeout has passed, in which time the failed node is supposed to have reset itself. I may still want to look at IPMI based fencing as I do like that method, but its unclear if supported out of the box.

Thanks again for all your help.

alexskysilk · Apr 11, 2024

storage discussion has been pretty well covered. If you want to ask about specifics, now would be the time- eg-
1. how much usable capacity. classes of storage could be added as well (eg, fast, bulk, etc)
2. how many host node connections
3. relative performance/latency requirements
4. integration with existing network topologies/availability of ports and type/investment potential for networking infrastructure

much can be narrowed down once you start filling out the specifics.

Xandrios said:
On the topic of fencing - I was planning to use IPMI based fencing as we are using HPE equipment having ILO's. Though it does look like that newer versions of Proxmox no longer have support for this method and favour watchdog fencing. Which basically delays switchover until a certain timeout has passed, in which time the failed node is supposed to have reset itself. I may still want to look at IPMI based fencing as I do like that method, but its unclear if supported out of the box.

correct; PVE doesnt support hardware fencing (discussion as to the wisdom of this notwithstanding.) since you're using HPE equipment that is NOTORIOUSLY slow to boot, failover will usually be occurring or complete before a failed node is brought back online. if the detection delta is too long for you, you can deploy a more invasive fencing procedures using ipmitool and scripts, but bear in mind that overly sensitive fencing will cause more harm than good by marking nodes fenced even if they're actually functioning.

bbgeek17 · Apr 11, 2024

Hi Xandrios,

I can appreciate where you are coming from. But if your storage needs are modest and performance isn't an issue, why not run Ceph? I fully understand the pros and cons of Ceph. However, it seems more reliable than some of your proposed solutions. And it's supported.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

alexskysilk · Apr 11, 2024

Xandrios said:
If we choose to go with iSCSI most likely it would be an HPE MSA 2060

Be aware that there is no mechanism to quiesce hosted filesystems before snapshots using pve. Your snapshots may not be sane to redeploy.

Tmanok said:
When using iSCSI, you cannot create snapshots.

with exceptions. if using a supported back end (eg, truenas) snapshots are available using zfs over iscsi. see https://pve.proxmox.com/wiki/Storage:_ZFS_over_ISCSI. the downside is that those solutions are necessarily DYI NAS based, and without major effort no host level HA.

Xandrios said:
Perhaps an HPE server with TrueNAS but not sure yet as that has challenges with finding an HPE certified HBA controller for ZFS - if that even exists.

while I doubt HPE certified any HBAs for the purpose, virtually every modern Smartarray controller supports HBA mode. Should work fine (I have a filer using an AR440 without issue.)

Tmanok · Apr 17, 2024

bbgeek17 said:
Hi Xandrios,

I can appreciate where you are coming from. But if your storage needs are modest and performance isn't an issue, why not run Ceph? I fully understand the pros and cons of Ceph. However, it seems more reliable than some of your proposed solutions. And it's supported.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

BBGeek is correct, if you are placing nodes over a geographic boundary / location, so long as your latency is below 10ms between the sites, you can:
A) Successfully run Proxmox VE to begin with (higher than 10ms is not recommended, I've explored this option in the forums).
B) Install and operate CEPH with Hypervisor managed fencing - please ensure you have adequate link-performance, however, especially for rebalancing.

Cheers,

Tmanok
----

bbgeek17 said:
Blockbridge is a software-defined storage solution that integrates with off-the-shelf hardware. None of the hardware components are proprietary. Some users opt to purchase new hardware, while others prefer to utilize existing resources.

Very good to know, thank you!! I thought it necessitated the hardware to access the software, that's an important distinction.

bbgeek17 said:
Usually, NAS refers to systems utilizing NFS/CIFS protocols (i.e., "file storage"). Blockbridge belongs to the category of pure block storage solutions (i.e., SAN), supporting both iSCSI and NVMe. As a native SAN device, we don't rely on a traditional filesystem layer, thus avoiding any associated overhead and consistency concerns.

While I agree with how the terms are "usually" used, I disagree with that practice of using them in such a way as they directly mean "Network Attached Storage" and "Storage Area Network". When we misuse acronyms due to marketing, we loosen our standards, which makes communication more difficult and erroneous. At least, that's my personal perfectionist mind's opinion on the matter. I try to teach people who seem new to IT or certain aspects of it the "correct meaning" before they go out into the world and pollute themselves with the marketing meanings.

Cheers,

Tmanok

Search

Search

Filesystem consistency with NFS, iSCSI and LVM/XFS

Xandrios

New Member

Tmanok

Renowned Member

Xandrios

New Member

Xandrios

New Member

Tmanok

Renowned Member

Tmanok

Renowned Member

showiproute

Well-Known Member

Tmanok

Renowned Member

bbgeek17

Distinguished Member

bbgeek17

Distinguished Member

Xandrios

New Member

alexskysilk

Distinguished Member

bbgeek17

Distinguished Member

alexskysilk

Distinguished Member

Tmanok

Renowned Member