Storage best practices for large setups

dlmw89 · Dec 15, 2024

Hi there,

currently one of my customers plans to migrate an existing VMWare environment to a KVM based solution.
The VMWare setup is pretty large with about 2.000 CPU cores, ~50TB memory and ~2.000 TB full-flash SAN storage.
Eventhough we would setup several Proxmox clusters, we still need a high performance storage solution which can handle arround ~750 TB.

So the question is, what's the suggested type of storage backend which should be used for that kind of setup?
BTW the new storage needs to support both snapshots and thin-provisioning.
Due to snapshots and thin provisioning are both needed, we can't just use ISCSI or FC.

- Is there any well-known storage solution which supports Proxmox ZFS-over-ISCSI storage backend?
- Would it be feasible to use NFS via highspeed network, like 100 GbE?
- What about GlusterFS? Is it stable enough to serve large amounts of storage in an enterprise setup?
- Any other ideas?

Thanks in advance for your thoughts & input

nightowl · Dec 16, 2024

You may want to look at Blockbridge. They are active in this forum. We use their solution and it would be a great fit for your configuration requirements. I can write more later if you have questions (need to run for now).

dlmw89 · Dec 16, 2024

nightowl said:
You may want to look at Blockbridge. They are active in this forum. We use their solution and it would be a great fit for your configuration requirements. I can write more later if you have questions (need to run for now).

Thanks a lot for that tip! Blockbridge sounds really interesting. You also write that you are actually using Blockbridge. Any notable requirements or things to consider when designing a solution with it? Which integration are you using in Proxmox for that storage?

ness1602 · Dec 16, 2024

For big deployments in my case, i only use CEPH, with redundant switches. Works flawlessly. But if you bought vmware-only storage than this is maybe not feasible for you.

dlmw89 · Dec 16, 2024

ness1602 said:
For big deployments in my case, i only use CEPH, with redundant switches. Works flawlessly. But if you bought vmware-only storage than this is maybe not feasible for you.

OK, what's a big deployment in your case? Also how fast is the network connection you are using, 25 GbE, 100 GbE or even faster?

I have lot's of experience with CEPH but only with a fraction of the capacity needed now, about 30-40TB. With capacity needs which are 20 times more, I'm fairly concerned about CPU time needed and also about performance and stability, if you scale up CEPH to hundreds of TBs.

pvps1 · Dec 16, 2024

cannot help but may I ask you to post your solutions when ready?

you can rarly test such big setups and your experience would be helpfull

dlmw89 · Dec 16, 2024

pvps1 said:
cannot help but may I ask you to post your solutions when ready?

you can rarly test such big setups and your experience would be helpfull

Sure, I'll post details of the solution. ;-)

ness1602 · Dec 16, 2024

dlmw89 said:
OK, what's a big deployment in your case? Also how fast is the network connection you are using, 25 GbE, 100 GbE or even faster?

I have lot's of experience with CEPH but only with a fraction of the capacity needed now, about 30-40TB. With capacity needs which are 20 times more, I'm fairly concerned about CPU time needed and also about performance and stability, if you scale up CEPH to hundreds of TBs.

Big deployments are ceph storages > 100TB i think in my case. We usually start with 10g, but now are moving to 40g, and probably next year to 100g ,because the equipment is finally afordable. For corosync 1gb is enough, and everything is redundant through bonding .
Stability is okay, i don't usually go for more than 20 nodes , i tend to break clusters up just in case.

Alwin Antreich · Dec 16, 2024

dlmw89 said:
OK, what's a big deployment in your case? Also how fast is the network connection you are using, 25 GbE, 100 GbE or even faster?

I have lot's of experience with CEPH but only with a fraction of the capacity needed now, about 30-40TB. With capacity needs which are 20 times more, I'm fairly concerned about CPU time needed and also about performance and stability, if you scale up CEPH to hundreds of TBs.

I can second @ness1602 's answer. Though I'd go for 2x 25 GbE, they usually have better latency over the 40 GbE NICs. And it is not unusual to see multiple 100 TBs on a hyper-converged PVE+Ceph cluster. But at some point a move off of the hypervisor will be feasible/necessary to maintain the client latency.

pvps1 · Dec 16, 2024

I would not scale a hyperconverged proxmox ceph-cluster to this storage capacity. though proxmox ceph is convenient and very well integrated, corosync (aka the pve-cluster per se) doesn't scale well.

Alwin Antreich · Dec 16, 2024

pvps1 said:
I would not scale a hyperconverged proxmox ceph-cluster to this storage capacity. though proxmox ceph is convenient and very well integrated, corosync (aka the pve-cluster per se) doesn't scale well.

In general I agree, but scaling a PVE cluster and storage density per node are two different things, IMO. It heavily depends on the use-case. (the age-old answer

) Anyway, that's semantics.

dlmw89 said:
The VMWare setup is pretty large with about 2.000 CPU cores, ~50TB memory and ~2.000 TB full-flash SAN storage.

Do you have any numbers on the IOps requirements (overall & per VM peak)?

dlmw89 said:
Eventhough we would setup several Proxmox clusters, we still need a high performance storage solution which can handle arround ~750 TB.

This already sounds like you want a dedicated storage for multiple clusters, do I understand correctly?

nightowl · Dec 17, 2024

> Thanks a lot for that tip! Blockbridge sounds really interesting. You also write that you are actually using Blockbridge. Any notable requirements or things to consider when designing a solution with it? Which integration are you using in Proxmox for that storage?

We are using Blockbridge's NVMe-oF plugin for Proxmox which has worked flawlessly. It is very easy to configure (we scripted it easily). Essentially, each virtual disk has a virtual LUN created and mapped to QEMU automatically. All of the native VM storage functions of Proxmox work as you would expect (snapshots, storage live migrations, VM live migrations between hosts, backups with Proxmox Backup, etc.).

Blockbridge supports VMware and all of its features including VAAI, as well as compression, VVOLs, etc., but I wouldn't recommend anyone choose VMware for any new projects. For those environments that can't switch, Blockbridge is extremely fast storage for VMware workloads.

Ceph is also an option. We have quite a few Ceph clusters (10?), with Petabytes of HDD storage as well as about a Petabyte of SSD storage. That said, the big issue with Ceph is performance, specifically latency and write iops, especially for erasure coded pools. It makes sense that it costs of lot in terms of latency due to CPU usage and network latency to compute the CRUSH rule(s) and replicate blocks, including its use of an object storage system as the base for block storage (RADOS). Bandwidth is plentiful with larger block sizes, as long as latency isn't a concern. Our Blockbridge systems have about 1/10th the latency of Ceph, so are more suitable for high-performance workloads that require this level of latency. Ceph SSD storage is our go to for "general purpose" storage (think boot drives or drives that are not used all that much).

Ceph has had quite a few bugs which can cause high stress (human stress) when upgrading large clusters, so plenty of testing and planning is needed for high-availability. There are also situations that can cause disruptions such as network flapping if a switch is failing or ports go bad. Flapping can really cause confusion for any distributed system, if it happens fast enough. We have had situations where Ceph got confused as to which object was newer in a triple replicated configuration when one of our switches decided it was going to die soon.

There are also cases where a single bad OSD (either software crashing or SSD hardware issues) can cause latency in the Ceph cluster. Finding and failing it takes a bit of work unless you have significant monitoring in place, but without automation, it can become an issue pretty quickly.

In the end, Ceph can be great for a lot of things, but it's definitely not "free". It requires quite a bit of knowledge to manage properly and due to its distributed nature, you have to understand distributed systems and the various failure modes. It definitely isn't something you set up and forget.

Blockbridge upgrades are all managed and have 24x7 support. We've been running their systems for more than seven years and have never had any issues. Updates are transparent and non-disruptive, including the Proxmox plugin updates on Proxmox nodes.

If you are running a high-end enterprise Fibre Channel setup, you could add Blockbridge on top. We did this years ago with a multi-PB fiber channel SAN, migrating it slowly to iSCSI on Blockbridge. But it only makes sense if you're dealing with expensive storage since you'll need to buy servers for Blockbridge anyway.

NFS is another option, but QCOW performance is a somewhat poor and there are currently some limitations around snapshots. Finding a highly-available low-latency NFS solution is tough, though.

Note that we only use redundant "everything", so switches (with MLAG), NICs, servers, etc. are all redundant, both for Blockbridge and Ceph systems. It is really the only practical approach to gain as high of availability as possible for network and storage connectivity. In "huge" environments, where an entire rack is a failure domain, I may not recommend this, and let a whole rack fail instead of dealing with the complexities added with these redundancies, but for most environments, this isn't reality.

> OK, what's a big deployment in your case? Also how fast is the network connection you are using, 25 GbE, 100 GbE or even faster?

Just to answer some of your questions that you asked others - we use both 25Gbps and 100Gbps switch ports throughout our environment, with at least 2 active 100Gbps connections connected to active Blockbridge nodes, along with multiple 100Gbps dedicated connections between physical servers that make up a Blockbridge single "node" (a pair of servers). Our servers have dual 25Gbps and some have both dual 25Gbps and dual 100Gbps for separation of workload and storage traffic. We have tested 200Gbps connections too, which are commonly used for inter-Blockbridge server communications where latency must be minimal, but with plenty of available bandwidth for replication.

> I have lot's of experience with CEPH but only with a fraction of the capacity needed now, about 30-40TB. With capacity needs which are 20 times more, I'm fairly concerned about CPU time needed and also about performance and stability, if you scale up CEPH to hundreds of TBs.

It depends on how dense your storage nodes are, as well as how much storage traffic is required. If you only have 5 nodes and each has 750TiB of storage in them, you have to really start to think about how long it will take to recover a node this size, and you should be using dedicated nodes just for storage. Even 64-core nodes are going to use quite a bit of CPU capacity to manage this much storage. Plus, your metadata servers (MONs) should be plenty fast with low latency storage.

Think of this way, with 768,000GiB of available storage (750TiB), where it is 70% full, with 3GiB/sec of recovery performance (triple mirror would cause 9GiB/sec of writes in most cases), you would be waiting nearly 50 hours for it to recover. It might be faster, but likely not due to OSD software throughput limitations. Large storage nodes have this disadvantage - the time to recovery starts to become "scary".

> From @ness1602:
> Big deployments are ceph storages > 100TB i think in my case. We usually start with 10g, but now are moving to 40g, and probably next year to 100g ,because the equipment is finally afordable. For corosync 1gb is enough, and everything is redundant through bonding .
> Stability is okay, i don't usually go for more than 20 nodes , i tend to break clusters up just in case.

Note that having multiple independent connections for Corosync is needed for production environments. Even with bonding (LACP connections), the time for an LAG to re-establish its connection after a link failure may be too long for Corosync.

Plus, there are situations where two switches (stacked switches, for example) can take a while to fail-over and may disrupt traffic, which can cause Corosync to fail that node. The same can happen if there is ever an issue MLAG communication for whatever reason (switch software bugs, bad ports, bad cables, etc.).

I speak from experience that it is highly advised to use both 1Gbps and 10Gbps/25Gbps/100Gbps connections on a server for two Corosync links.

One last situation that can arise is saturation of bandwidth of a port/LAG. If there is enough packet loss due to the port being saturated with traffic, Corosync traffic won't be forwarded in time and will fail. This can easily happen on a 1Gbps port (or even 2x1Gbps LAG). Some documentation indicates to dedicate ports to Corosync, but that isn't very practical.

20 nodes is fine. The reasoning behind this is that Corosync is a single master that replicates to all secondaries, and this master is round-robin'd between all nodes in the cluster. The more nodes, the longer this takes, and the more potential delays can be incurred if some nodes have (relatively) long waits for tasks to finish while it is their turn to be master.

> - What about GlusterFS? Is it stable enough to serve large amounts of storage in an enterprise setup?

GlusterFS is essentially deprecated and has been mostly used as a file server replacement, not a VM host storage platform. CephFS is RedHat's answer, even though it is significantly more complex than GlusterFS since it requires Ceph.

Hope this helps!

alexskysilk · Dec 17, 2024

dlmw89 said:
- Is there any well-known storage solution which supports Proxmox ZFS-over-ISCSI storage backend?

The only commercially supported solution is Truenas, as far as I know. roll your own can be done with comstar/iscsitgt, among others. see https://pve.proxmox.com/pve-docs/chapter-pvesm.html#storage_zfs for more detail.

dlmw89 said:
- Would it be feasible to use NFS via highspeed network, like 100 GbE?

Probably. There's a netapp guy on this forum who might chime in; I dont know many NAS providers that check all boxes like ONTAP does.

dlmw89 said:
- What about GlusterFS? Is it stable enough to serve large amounts of storage in an enterprise setup?

No. just no

even RHEL is sunsetting support for it for RHEV as of the end of the month. its been superceded by ceph for all intents.

dlmw89 said:
- Any other ideas?

As already touched by many, "it depends(tm)"

1. Does the client desire/intend to keep/redeploy their existing storage infra? If yes, Blockbridge may do the trick.
2. If they are interested in scaleout (ceph) do they have in house/on contract engineering available to support? if not, is that desired/budgeted? you really dont want to go into any scaleout solution without expertise. Trust me, I have scars.
3. As mentioned above, NFS could be an option. support would be simplified.

Alwin Antreich · Dec 18, 2024

nightowl said:
20 nodes is fine. The reasoning behind this is that Corosync is a single master that replicates to all secondaries, and this master is round-robin'd between all nodes in the cluster. The more nodes, the longer this takes, and the more potential delays can be incurred if some nodes have (relatively) long waits for tasks to finish while it is their turn to be master.

Maybe I misunderstood you, but corosync is not round-robin, it essentially still is one ring [0]. All messages have a sequence so corosync can identify dropped messages and request a resend. Each node with will hand over the token (like in a relay race) to the next node, which will den process all received messages and then go on to send its own messages.

That's why bigger clusters need some tuning to corosync to scale properly.

[0] https://people.redhat.com/ccaulfie/docs/KnetCorosync.pdf
[1] https://discourse.ubuntu.com/t/corosync-and-redundant-rings/11627 (old, corosync2; just to fill have more context for corosync3)

nightowl · Dec 19, 2024

Alwin Antreich said:
Maybe I misunderstood you, but corosync is not round-robin, it essentially still is one ring [0]. All messages have a sequence so corosync can identify dropped messages and request a resend. Each node with will hand over the token (like in a relay race) to the next node, which will den process all received messages and then go on to send its own messages.

That's why bigger clusters need some tuning to corosync to scale properly.

[0] https://people.redhat.com/ccaulfie/docs/KnetCorosync.pdf
[1] https://discourse.ubuntu.com/t/corosync-and-redundant-rings/11627 (old, corosync2; just to fill have more context for corosync3)

You are absolutely correct. My description was over-simplified. Hopefully, Datacenter Manager will make dealing with multiple clusters easier and can many many smaller clusters as one large cluster.

jt_telrite · Dec 19, 2024

What I would really like to see is a "Proxmox Storage Server" something like PBS but with the ability to export iSCSI shares for "ZFS over iSCSI" consumption by the PVE !

Johannes S · Dec 20, 2024

jt_telrite said:
What I would really like to see is a "Proxmox Storage Server" something like PBS but with the ability to export iSCSI shares for "ZFS over iSCSI" consumption by the PVE !

That's a great idea but propably won't come before a release of Datacenter Manager if at all

Search

Search

Storage best practices for large setups

dlmw89

Active Member

nightowl

Member

dlmw89

Active Member

ness1602

Famous Member

dlmw89

Active Member

pvps1

Renowned Member

dlmw89

Active Member

ness1602

Famous Member

Alwin Antreich

Well-Known Member

pvps1

Renowned Member

Alwin Antreich

Well-Known Member

nightowl

Member

alexskysilk

Distinguished Member

Alwin Antreich

Well-Known Member

nightowl

Member

jt_telrite

Member

Johannes S

Renowned Member

We value your privacy