Cluster(-aware) file systems: Any official support planned?

kofik · Apr 14, 2022

Hi

I've landed in the situation in a job where we have just recently renewed our SAN that servces both an RHV and VMware cluster (the later running specialized software requiring VMware). I've been working with Proxmox standalone and Ceph clusters but that place has had RHV for some time now and it was OK so far.

With Red Hat changing their focus towards OpenShift and pulling the plug on RHV until 2024 and a planned renewal of 3 RHV VM hosts in the upcoming months, I'm re-evaluating our situation. OpenShift Virtualization is way more expensive than either Proxmox (with Support, that is), RHV or vSphere if all we need is just 3 VM hosts. (Also it looks like it is designed for larger environments, we're simply to small for them it seems)

Cluster-aware file systems are not officially supported by Proxmox and in that case case plain (thick) LVM is all it would support. It lacks snapshot support and uses more actual storage space. People have got it thngs work with GFS2 and alike for what I read, but it isn't officially supported by Proxmox if I'd open a ticket related to storage...

If the storage appliance I'm stuck with would support NFS, the choice would rather be, but allas, it doesn't. Are there any plans on adding support for cluster-aware shared storage anytime?

From the glimpses I've been able to make below the management layer of oVirt/RHV, their implementation with shared storage also has its complexities: 1 VM host has to become "storage pool manager" which handles locking LVs so they are only in use by 1 host and a couple more things. I could actually understand why Proxmox developers might actually want to avoid such things. ;-)

tom · Apr 14, 2022

Instead of buying new SAN, go for:

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pveceph

As long as you follow the hardware recommendation for small HCI clusters (min. 3, 100Gbit, NVMe for OSD) you will get decent performance and HA storage with snapshot support and other cool Ceph features, all covered by our support teams.

kofik · Apr 14, 2022

Unfortunately the SAN is already here and in production. The problem is that the decision to renew the SAN was made like fall last year based on the lifecycle of the previous storage system that was reaching EoL and had started showing oddities in the last year. Given the current hardware availability situation that new storage system was finally put into production like 2 months ago ... then we got hit by the announcement from Red Hat.

Ripping out the new SAN isn't really going to fly then ... I wish the discussion to buy a storage or HCI would have come up this year, obviously in that case I'd have gone with a 3-node HCI cluster with Ceph. ;-)

LnxBil · Apr 14, 2022

Yeah @kofik, I'm in the same boat. I also tried GFS2 and it crashed everything, so I dropped the requirement completely.

I did however create a HA-VM in my cluster that uses a large chunk of storage and has a ZFS pool inside (SSD metadata, HDD storage from two different SANs) and exports this data as a ZFS-over-iSCSI storage to PVE itself and I run VMs on it, that I need to snapshot. I came to terms that I cannot "provide snapshots for everything all the time", so I switched to "I can provide snapshots for possible anything but not at the same time". If I need to test a special VM (e.g. update), I just move it online from the LVM-SAN to the iSCSI-ZFS, snapshot it, do my stuff and if everything went smoothly, I migrate it back to the LVM. Yes, it's not optimal and not always super-fast, but it is one way to go.

bbgeek17 · Apr 14, 2022

If you have the budget and depending on the type of SAN you have, you could front-end the SAN with Blockbridge software and get: shared storage, snapshots, HA, native proxmox management, etc. It would be something close to "SAS Cluster" architecture listed here:
https://www.blockbridge.com/architectures/

Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

LnxBil · Apr 14, 2022

bbgeek17 said:
If you have the budget and depending on the type of SAN you have, you could front-end the SAN with Blockbridge software and get: shared storage, snapshots, HA, native proxmox management, etc. It would be something close to "SAS Cluster" architecture listed here:
https://www.blockbridge.com/architectures/

Just software or also the hardware endpoints? What is needed on the software side?

bbgeek17 · Apr 14, 2022

LnxBil said:
Just software or also the hardware endpoints? What is needed on the software side?

Hardware would also be needed.
For HA - two servers with reasonable specs. Since the SAN is most likely not NVMe - something like AMD EPYC 7000-series, 64G of RAM and appropriate cards for SAN/Network connectivity. Could be even lower based on overall details/requirements. The vote could be a VM or some anemic 1u server for better isolation. Existing hardware could be used to minimize the investment. We would have to review and "approve" it, since bad hardware becomes a support nightmare for all.

We do also have customers using "Solo" installation, which we highly discourage but for some workflows (backup/archive) it may be ok.

Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

LnxBil · Apr 14, 2022

bbgeek17 said:
Hardware would also be needed.

Most of the time this is a no-brainer (towards no). You normally don't buy a SAN to put another layer before it to fix the shortcommings of your virtualization environment.

I could however see a use case for recycling an old SAN. We have our old-old-san which still works lying around with has 50U+ of disk shelves including 150+ disks (15k 450 GB Dual-FC 3.5), which work via normal (4Gbit) FC-cards in a 4-way setup, yet we lack a good high-available setup around this. We tried it as a proof-of-concept with one node and ZFS over all multipathed disks and it works flawlessly. Unfortunately, we now have a simple 2U HA-SAN with 24x1.2 TB SSD which is much faster and draws significantly less power (and space).

bbgeek17 · Apr 14, 2022

LnxBil said:
Most of the time this is a no-brainer (towards no). You normally don't buy a SAN to put another layer before it to fix the shortcommings of your virtualization environment.

We are absolutely on the same page. Front-ending old, often EOL, equipment to give it a new life is not something we want to support without a good technical reason. Once you put your software in front of it - you "own" it, including all the oddities and performance implications. One case that comes to mind was front-ending all-Flash older SAN to present to Openstack via our Cinder driver to be able to carve it up on the fly into small chunks through automation.

Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

mvs · Apr 14, 2022

kofik said:
Ripping out the new SAN isn't really going to fly then ... I wish the discussion to buy a storage or HCI would have come up this year, obviously in that case I'd have gone with a 3-node HCI cluster with Ceph. ;-)

There is no need to go with one technology. Configure your new servers with HCI in mind, use Ceph and SAN in parallel.

PS: 3 node Ceph cluster isn't that great, for productive use i would prefer to have at least 4 nodes.

Search

Search

Cluster(-aware) file systems: Any official support planned?

kofik

Active Member

tom

Proxmox Staff Member

kofik

Active Member

LnxBil

Distinguished Member

bbgeek17

Distinguished Member

LnxBil

Distinguished Member

bbgeek17

Distinguished Member

LnxBil

Distinguished Member

bbgeek17

Distinguished Member

mvs

Member

We value your privacy