Shared Remote ZFS Storage

jtremblay · Jan 11, 2025

bbgeek17 said:
It's disappointing to see this kind of disagreement in the forum Please be considerate.

The reality is that the requested functionality could indeed be valuable to some end users, particularly those with home labs. However, the costs associated with development, testing, maintenance, and ongoing support will outweigh any incremental revenue gains by many multiples.

Please understand that the PVE team is very thoughtful in prioritizing features that make the most sense for the platform. Given that they already have a cost-effective solution, I can't see this being a top priority. Most people paying for support need more enterprise features and reliability, not less...

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Somehow it feels like your association with BB is bias. BTW why doesn't BB have a entry level open solution???

jtremblay · Jan 11, 2025

bbgeek17 said:
It's disappointing to see this kind of disagreement in the forum Please be considerate.

The reality is that the requested functionality could indeed be valuable to some end users, particularly those with home labs. However, the costs associated with development, testing, maintenance, and ongoing support will outweigh any incremental revenue gains by many multiples.

Please understand that the PVE team is very thoughtful in prioritizing features that make the most sense for the platform. Given that they already have a cost-effective solution, I can't see this being a top priority. Most people paying for support need more enterprise features and reliability, not less...

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Somehow it feels like your association with BB is bias. BTW why doesn't BB have a entry level open solution???

ness1602 · Jan 11, 2025

I don';t get it,if ceph says minimum is 3nodes, why don't just people accept that it works(really greate sometimes) with just 3 nodes? You dont need 15k initial investment, just okay network and good drives. NExt.

UdoB · Jan 11, 2025

ness1602 said:
I don';t get it,if ceph says minimum is 3nodes, why don't just people accept that it works(really greate sometimes) with just 3 nodes?

Because the absolute minimum is not worth a recommendation.

My thoughts about it: https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/

jtremblay · Jan 11, 2025

ness1602 said:
I don';t get it,if ceph says minimum is 3nodes, why don't just people accept that it works(really greate sometimes) with just 3 nodes? You dont need 15k initial investment, just okay network and good drives. NExt.

But it doesn't.... Ceph demands high performance hardware all around. I.E. 10gbe SAN, nvme drives, higher performance CPU that can handle both compute and higher communications demands.

jtremblay · Jan 11, 2025

UdoB said:
Because the absolute minimum is not worth a recommendation.

My thoughts about it: https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/

This is truth!

alexskysilk · Jan 11, 2025

I appreciate the DESIRE for a storage solution that is fast, highly available, and "entry level" (which I just read as cheap or free)

What you dont seem to be grasping is that the COST of fast and highly available is incompatible with the last requirement, but since you seem to believe you can already do it yourself, why arent you offering your solution commercially?

Lets put it differently. I assume you have some method of making a living, and you charge money for it. What would you say to a prospective customer who asks you why you dont charge less (or fo it for free?)

ness1602 · Jan 11, 2025

But people just don't understand that you can have that with CEPH, just try with any enterprise disks and you will see. Yes, for 25node, petabyte cluster you will need all jtreblay says, but for entry level, just plain 10g, with enterprise disks will suffice.
I have a customer, 2.5g, samsung ssds, and ryzen cpus. Started with 3 nodes, now at four. Cluster works great for a combination of ct,vms and one or two window servers. And this is it.

And i've worked as a consultant for more than 10 firms last year, so this is from pure example.

jt_telrite · Jan 11, 2025

bbgeek17 said:
It's disappointing to see this kind of disagreement in the forum Please be considerate.

The reality is that the requested functionality could indeed be valuable to some end users, particularly those with home labs. However, the costs associated with development, testing, maintenance, and ongoing support will outweigh any incremental revenue gains by many multiples.

Please understand that the PVE team is very thoughtful in prioritizing features that make the most sense for the platform. Given that they already have a cost-effective solution, I can't see this being a top priority. Most people paying for support need more enterprise features and reliability, not less...

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

While I an appreciate your corporate loyalty, I don't think it's relevant to the integrity of the "ask". if supporting "zfs over iSCSI" wasn't a good idea it wouldn't be there in the first place.. the fact that there isn't a fully supported solution in the form of an appliance only speaks to "BB" hold on the partnership.

ness1602 said:
But people just don't understand that you can have that with CEPH, just try with any enterprise disks and you will see. Yes, for 25node, petabyte cluster you will need all jtreblay says, but for entry level, just plain 10g, with enterprise disks will suffice.
I have a customer, 2.5g, samsung ssds, and ryzen cpus. Started with 3 nodes, now at four. Cluster works great for a combination of ct,vms and one or two window servers. And this is it.

And i've worked as a consultant for more than 10 firms last year, so this is from pure example.

[TUTORIAL] Thread 'FabU: can I use Ceph in a _very_ small cluster?'

Dec 26, 2024

Ceph is great, but it needs some resources above the theoretical minimum to work reliably. My assumptions for the following text:

you want to use Ceph because... why not?
you want to use High Availability - which requires Shared Storage (note that a complete solution needs more things like a redundant network stack and power supplies)
you want to start as small (and cheap) as possible, because this is... “only” a Homelab

You plan for three Nodes. Each node has s single dedicated disk for use as an “OSD”. This is the documented...

This is the truth!
I've stumbled across all of the short comings pointed out in this article first hand.... I'm just lucky enough to be in "sandbox" mode with a patient employer.... You can't do a small cluster or SMB sized deployment , with a decent workload, without cutting edge, expensive gear.... there are 100s if not 1000s of SMB with racks full of 1u dl380 g9 \ g10 and Cisco 1GB network gear just trying to survive.... yes they also have old FC storage i.e. NetAPP or EMC and yes that is usable with Proxmox. but I can put trueNAS on a 80Tb 2 u server and use replication and PBS to satisfy the need to a degree but a PSS from Proxmox supported by Proxmox that is 1000% OOBE compatible , well that is a game changer!

pveuser113 · Jan 11, 2025

@jt_telrite , you seem to have a soft spot for dl380... https://forums.truenas.com/t/boot-partition-on-sd-card/27220/8
You didnt like TrueNAS answer that free users are just beta testers and are not driving the development roadmap so you decided to try your luck in proxmox forum? https://forums.truenas.com/t/boot-partition-on-sd-card/27220/13

Those who are unwilling to invest in their infrastructure need most support and complain the loudest.

jtremblay · Jan 12, 2025

alexskysilk said:
I appreciate the DESIRE for a storage solution that is fast, highly available, and "entry level" (which I just read as cheap or free)

What you dont seem to be grasping is that the COST of fast and highly available is incompatible with the last requirement, but since you seem to believe you can already do it yourself, why arent you offering your solution commercially?

Lets put it differently. I assume you have some method of making a living, and you charge money for it. What would you say to a prospective customer who asks you why you dont charge less (or fo it for free?)

You're really overthinking, free isn't the bottom line. We will, and without a choice from Proxmox, be putting our money into ix systems.

jt_telrite · Jan 12, 2025

I didn't

pveuser113 said:
@jt_telrite , you seem to have a soft spot for dl380... https://forums.truenas.com/t/boot-partition-on-sd-card/27220/8
You didnt like TrueNAS answer that free users are just beta testers and are not driving the development roadmap so you decided to try your luck in proxmox forum? https://forums.truenas.com/t/boot-partition-on-sd-card/27220/13

Those who are unwilling to invest in their infrastructure need most support and complain the loudest.

like the bad attitude towards the question for help and a complete misunderstanding of it either.... I was asking them the community to assist in converting the instructions to "chain boot" Proxmox from an SD card onto a mirrored pair of SSD. They kept interpreting it as trying to run the boot drive from SD... and shooting it down as a bad idea..... it sucks when arrogance leads the community ........ I can't believe that TrueNAS is MORE active on its boot drive then Proxmox. But I wasn't asking to move the boot drive just grub..... I got it working but I won't be sharing the info....

jtremblay · Jan 12, 2025

Here's the ultimate "roll your own"

ness1602 · Jan 12, 2025

Until they change the license and it's "maintain your own"

guruevi · Jan 13, 2025

25/40/100G networking is not expensive, every single server I’ve purchased in the past 2 years has 25G standard at this point and a 48-port 25G switch is available under $8k. 10G switches are pretty much standard and managed switches are available under $1k, most vendors don’t even sell Gigabit anymore in their new lineups. 100G switches allow for breakouts into 25G too, so I seriously don’t see why any SMB wouldn’t be able to afford this, any decent datacenter will have 10 or 25G standard today. You can get a 8 port 10G at home for under $200.

The overhead from Ceph is really minimal on CPU/RAM as well, obviously if you want high performance you need high performance hardware, but ZFS can also consume quite a bit of resources, the calculation of the crush placement is not much different than the overhead from mirrors, ZFS uses very similar algorithm, just located on a single node.

It is definitely a lot cheaper than downtime from a jenga tower of ZFS 3-way mirrors over iSCSI, you still need the same network bandwidth as Ceph, so I don’t see where you would save any bandwidth with such setup. If you want a single copy or node, you can do that with Ceph too, set the failure domain to 1 node and n+k drives.

I have a 14-node Proxmox + Ceph cluster over dual-10G shared with everything else to the datacenter, works perfectly fine with ~200VMs primarily for VDI, because Ceph is a lot like a mesh, you get the benefit of aggregate bandwidth. I have another 7 node Ceph cluster over dual-100G, for HPC, you don’t need 2x100G unless you want NVMe-level throughput to every single VM. If you need that kind of performance, you will pay regardless of whether you use ZFS or LVM or Ceph, they all have their strengths and drawbacks.

As I said, nobody is preventing you from sharing your local ZFS pool over iSCSI or NFS and using it as a VM backend, just don’t be surprised when your VMs all hang because a node is down.

alexskysilk · Jan 13, 2025

guruevi said:
25/40/100G networking is not expensive, every single server I’ve purchased in the past 2 years has 25G standard at this point

optionally. not included. 25g ports are not trivial in cost- ~$300-500 per port, both at the hba and switch. For a proper ceph deployment, you really want AT LEAST 2 per host, so 6 host bus ports and 6 switch ports.

guruevi said:
The overhead from Ceph is really minimal on CPU/RAM as well

"minimal" is an interesting way of describing 1 core and 4GB ram per daemon. and before the inevitable "my cluster doesnt use that much" do a top on an OSD node during rebalance, just for funsies.

Ceph is a great solution for many different use cases, but that doesn't mean there isnt room for the traditional 2 node 1 store HA cluster that many small shops used with vsphere from the very beginning. A "few 10 thou" here or there may not mean much to you, but a WHOLE SOLUTION using shared iscsi store such as a Dell ME5024 plus two nodes would cost a FRACTION of what you're describing.

jtremblay said:
Here's the ultimate "roll your own"

Just FYI, drbd was supported under PVE in the past. they dropped it, because... well, I hope you dont find out. (hint- split brain)

guruevi · Jan 13, 2025

You can buy a TrueNAS system and get iSCSI HA or as you said, buy a proprietary iSCSI system with dual controllers and get a small cluster going. The problem for HA is the HA part, TrueNAS does something proprietary to get ZFS to switch between nodes, as you will have to make sure the other node is dead before you can import the pool. Any ZFS HA system requires a custom locking extension to the ZFS system, that is why it is generally not done, you put a true HA system like Gluster or Lustre on top of ZFS and replicate the data.

If you want to stay in Proxmox, go FibreChannel or SAS shared storage and use LVM. Technically you could write your own script to take over the pool from the other side and hook into the Proxmox HA features, but ZFS is not a HA storage system, TrueNAS has a proprietary build with proprietary hardware that can guarantee something like a SCSI lock (so you can't use SATA disks), but you need something to lock the inactive node from mounting the ZFS, if you don’t or you do it wrong, at best your pool won’t mount, more likely, your pool will be corrupt. ZFS has a safety feature which prevents it if you don’t force it, there is a reason. Building something like that in Proxmox would require dedicated/specific hardware compatibility, as from experience I know, not every enclosure/drive/controller out there (can/does) enforce host locks through a bus reset.

Not sure where you get the pricing, as I said, PowerEdges and SuperMicros all have at least 2x25G options on their fiber network modules at little to no extra cost to the 4x10G copper, I know for a fact 25G is standard on Dell and HPE, you have to ask for copper 10G.

You can buy an Intel E810 2-port 100G card for $350-400 on Amazon and a 32 port 100G from Dell we buy for ~$11500/pair, that is 2 switches including the active copper cables for the rack, so your price per port is a bit off, 400G is the pricey one right now.

The 1c/4G is also outdated, current documentation and build ranges for modern CPU from 0.25-0.5c and 1-8G RAM per OSD depending on your workload and size of OSD. ZFS likewise recommends 1-2G/TB (so ~100G for a small 100TB build). With modern density, you are likely fitting everything in 6-12 OSD per node and I don’t even know how to configure a server with less than 256GB RAM (each CPU needs 4 modules = 8x32 or 8x64G per server).

But again, what are you doing with Ceph to require such throughput, 100G on all nodes simultaneously you are not going to need without 2000+ clients. I have benchmarks elsewhere, with 200G/server (dual 100G), loading our cluster to the max we are averaging 15-30% CPU on 2x24c/48t Xeon Gold.

jtremblay · Jan 13, 2025

alexskysilk said:
optionally. not included. 25g ports are not trivial in cost- ~$300-500 per port, both at the hba and switch. For a proper ceph deployment, you really want AT LEAST 2 per host, so 6 host bus ports and 6 switch ports.

"minimal" is an interesting way of describing 1 core and 4GB ram per daemon. and before the inevitable "my cluster doesnt use that much" do a top on an OSD node during rebalance, just for funsies.

Ceph is a great solution for many different use cases, but that doesn't mean there isnt room for the traditional 2 node 1 store HA cluster that many small shops used with vsphere from the very beginning. A "few 10 thou" here or there may not mean much to you, but a WHOLE SOLUTION using shared iscsi store such as a Dell ME5024 plus two nodes would cost a FRACTION of what you're describing.

Just FYI, drbd was supported under PVE in the past. they dropped it, because... well, I hope you dont find out. (hint- split brain)

You totally understand!!! TY

guruevi · Jan 13, 2025

A Dell ME5024 (or any PowerVault system) stuffed with disks would cost in the neighborhood of $75k in HA mode with FC or iSCSI once you include all the licenses and management. Not much difference than buying a few extra nodes with disks.

There is nothing preventing you from doing that today, Proxmox supports it, so I don't see why you would need Proxmox to build a special interface, just go into datacenter -> storage -> add -> iscsi

LnxBil · Jan 14, 2025

jt_telrite said:
realistically, how often do you see hardware failures...

a lot ... it is just numbers and if you build in a SPOF, you will hit it, probably at a time at which you don't have time or your infrastructure is very critical (e.g. annual accounts). Every error that can happen, will happen. Therefore you plan for it and add redundancies to what fails first.

Sure can you run your whole infrastructure from a single NAS and everything is fine and you get lucky that nothing breaks or it's not critical. It boils down to what the out-of-order-hour will cost you in wages, not-beeing-producing, not-been-reachable, stress, etc. Most customers we have, the cost is at least 10k per hour and pays for a lot of redundancy options in your hardware.

Shared Remote ZFS Storage

Member

Member

Famous Member

Distinguished Member

Member

Member

Distinguished Member

Famous Member

Member

[TUTORIAL] Thread 'FabU: can I use Ceph in a _very_ small cluster?'

Member

Member

Member

Member

Attachments

Famous Member

Well-Known Member

Distinguished Member

Well-Known Member

Member

Well-Known Member

Distinguished Member

We value your privacy