Glusterfs is still maintained. Please don't drop support!

Johannes S · Jun 2, 2026

alpha754293 said:
Who cares if it's old, if it works (vs. it not working)?

I do. Because running software which doesn't get security updates belong in museums, not in production, even in homelabs.

alpha754293 said:
2) The goal here is to demonstrate with data, from within a VM, that gluster is still viable from a technology standpoint.

Thins won't change anything for ProxmoxVE though. GlusterFS support is deprecated and wil be removed in qemu and thus also from ProxmoxVE due to the stalled development by Redhat. Benchmarks ( no matter on which version) won't change this but convincing qemu developers that glusterfs support should be kept. Ideally also volunteering for maintaining glusterfs and it's support in qemu

alpha754293 · Jun 2, 2026

guruevi said:
Seemed to have been a time-specific bug, because I currently do have at least one server with a MegaRAID SAS which is conveniently still in the Linux kernel: https://github.com/torvalds/linux/blob/master/drivers/scsi/megaraid/megaraid_sas.h - it works on modern Ubuntu kernels, you seem to be pointing to a very time-specific bug in the kernel around 6.8 which seems to have been resolved regarding JBOD mode? I actually have multiple servers running with Proxmox that have some form of MegaRAID controller in them (not my choice, conversion from VMware garbage).

:shrug: dunno if it is time specific.

(I mean, one would only be able to say this, in retrospect. At the time, when it happened, there was little means to know whether it was going to be a (permanent) issue or one that would've been resolved as a function of time. Either way, the point still remains: at the time, I couldn't update on account of it.

And now, whilst I could upgrade to PVE8, but then you'll invariably get others that'll ask the natural question "why not just upgrade to PVE9 anyways?" (and the answer to that question is because PVE9 brings with it, other issues. (cf. e1000 NIC issue).)

Thus, if said e1000 and 9361-8i works in PVE7, why break it?

guruevi said:
NVIDIA Ethernet is cheaper than NVIDIA InfiniBand fabric and NVIDIA is pretty much the most expensive solution out there today. Arista is cheaper and they're still not a 'cheap' option whilst Arista has even lower latency options. Talking datacenter networks here. We just purchased ~300 usable ports worth of NVIDIA 400G IB switches with optics - that's a $600k investment and we don't even have the annual management software license or the NIC-side (ConnectX 8) and NIC-side optics or cabling, all-in all, I'm estimating $1.2M over a 5 year period. There is no 400G Ethernet fabric that costs $4k/link, it's about half to a quarter of that cost depending on your switch gear. I think it's a waste of money, but the religion of IB is strong amongst some people.

Depends on the ethernet adapter.

Right now, you can buy the MCX515A-CCAT off of eBay for $125.92. Conversely, you can by the IB version (MCX555-ECAT) off of eBay for $119.95.

Nvidia (read: Mellanox - and yes, I still call it Mellanox) is expensive because Mellanox has pretty much always been expensive. And it's only recently that other vendors are starting to come out with their own line of products, but in many cases, Mellanox still takes the crown. Myrinet tried. OmniPath tried. Mellanox/IB won.

In terms of what your company purchased - again, it depends.

You can pick up a Edgecore DCS510 AS9716-32D 32-Port 400GbE Bare Metal Switch with ONIE - Part ID: 9716-32D-O-AC-F-US from Colfax Direct, for example, for $16560 which would work out to $517.50 per port. Conversely, you could pick up a Mellanox Quantum-2 MQM9790 64-port Non-blocking Unmanaged NDR 400Gb/s InfiniBand Switch - Part ID: MQM9790-NS2F also from Colfax Direct for $31125 which works out to $486.328125 per port.

Cabling varies depending on how far your runs are going to be and whether the ends are QSFP-DD and/or QSFP112.

Either case, as the data that I have presented shows, you can get IB stuff cheaper than ethernet. And that was still very much the case, back in like 2019 when I bought my switch, because I think it was an 18-port 100 GbE switch that cost almost as much as my 36-port Mellanox 100 Gbps IB switch. I looked at it because the ConnectX-4 cards were VPI cards, so that means I could set the port LINK_TYPE to either ETH or IB using mstconfig. So I could've gone either way, and IB was cheaper than ETH. (At least now, the ETH premium over IB isn't as outrageous as it used to be. It's only a 6.4% premium now. It used to be anywhere from 15-40% more for 100 GbE vs. 100 Gbps IB.

IB is great, if you know how to take advantage of it.

(I didn't buy 100 Gbps IB for HDD based storage. I bought it because the HPC apps that I was running at the time, was able to regularly hit 80-90 Gbps out of the 100 Gbps possible for RDMA/MPI.)

The ability for me to run offload storage traffic onto said 100 Gbps IB was really just a bonus at that point.

guruevi said:
My point was you can't push 100Gbps from a single spinning disk that gives at best 1-10Mbps of throughput (if not reading from cache).

Two things:
1) I agree that you can't push 100 Gbps from a single spinning disk. You might be able to hit 1 Gbps for sequential writes, where the HDD cache might have limited use/benefit. But that, again, wasn't the point of having 100 Gbps IB neither. It was a fringe/leftover benefit from running HPC applications that uses IB/RDMA/MPI.

2) This has very little to do with the performance difference between ceph (~5% of a drive's capability) vs. gluster (~22% of a drive's capability).

alpha754293 · Jun 2, 2026

Johannes S said:
I do. Because running software which doesn't get security updates belong in museums, not in production, even in homelabs.

Yes, and as CVE-2026-43134, CVE-2026-43284, and CVE-2026-43500 shows, updates are great. [/s]

(i.e. if you didn't update your kernels, then you wouldn't have given yourself these LPE exploits that you otherwise, previously, didn't have.

Same thing with CVE-2024-3094, where, again, if you didn't update, then you wouldn't have given yourself this backdoor.

Your statement argues that updated software is more secure and yet, these are just four of the more recent CVEs where the CVSS is 7.8, 7.8, 7.8, and 10.0 respectively.

If you didn't update, they you might not "invited" these issues into your production systems, homelab or otherwise.

Who knows how many more others there are where it was an update that gave or exposed the system to issues, where, if you didn't update, your system would've been fine.

Johannes S said:
Ideally also volunteering for maintaining glusterfs and it's support in qemu

I don't program, therefore; any programming that I would do, to try and help/contribute, would be done entirely via vibe coding. And we've already seen what that's done to Nvidia's own GPU drivers.

If anything, my vibe-coding to "help" maintain gluster and/or qemu is a sure-fire way to kill off any remnants of gluster and/or qemu in the same way that Nvidia's vibe-coding of their own GPU drivers is a sure-fire way to kill of their own drivers.

Perhaps this is the real why you suggested it (so that I would be sure to kill it off, for good) by vibe-coding it, quite literally, to its own death.

(So far, no one has been able to answer my question "if a program is stable, then why does it matter that there aren't as many commits happening?".)

If that's the metric that qemu is as their rationale for dropping support for gluster, then by that logic, bad code that constantly needs to get fixed would win this battle/race because the number of commits per month for crappy code would be astronomical because you're always trying to fix something that's fundamentally and critically broken.

But in terms of commits per month, it'd be a winner according to that metric, and if that's what is what qemu devs use to determine what will be supported and what won't, then bad code, by this commits per month metric, would get more adoption than good code that doesn't need perpetual fixes all the time, just to get it to work properly in the first place.

And if that is the logic/metric that they're using, maybe I should vibe code glusterfs back to being supported by qemu because I can commit each of the garbage output that AI generates and thus, inflate the number of commits per month with AI/vibe coded slop, just to send the number of commits per month through the roof.

Something tells me that I can probably automate that with n8n.

(SIdebar: responses, but still no technical discussion about the fact that gluster is 4.4x faster than ceph (in terms of % of drive capabilities used). Interesting.)

guruevi · Jun 2, 2026

There is a gluster-like replacement and there are others that regularly pop up that run entirely in userspace. Gluster being unstable as per my prior comment is just personal experience. If you do hammer it in real production, it is very likely to eat your data. Almost everything about the disperse volumes is highly unstable code and will bite you in the long run, and 3-way replicated volumes are no better than Ceph when it comes to data usage.

Once you get to thousands of file objects, every file/directory operation triggers network round-trips to traverse bricks for DHT lookups, especially negative lookups (eg new files) become a bottleneck.

It was once a great solution, I've used it about 15 years ago in production, but Ceph was in every way better. If you want a similar-to-Gluster system: LUSTRE is still around. MooseFS is a thing (it only has a 32-bit pointer, so not very useful if you have PBs worth of lots of small files)

Johannes S · Jun 2, 2026

alpha754293 said:
If you didn't update, they you might not "invited" these issues into your production systems, homelab or otherwise.

This is a logical fallacy. Just because the security issues you linked are not pressent in an older kernel doesn't mean that your old kernel doesn't have any security issues. In fact it defenitifely will have known (and unknown) security issues which are already fixed in newer kernels. This will never change since it's quite hard if not outright impossible to proove that a certain software doesn't have any security issue. (For folks who don't mind math: Google for Turing halting problem and Computability theory. This is the same reason why antivirus tools are rather limited in their usecases and kind of threats they are able to detect).

Regarding your "do you want me to vibecode the gluster support": You are missing my point. Since the development funded by Redhat isn't done any more, somebody else needs to do or fund it since the qemu developers deprecated support for it. No amount of benchmark or bickering here will change this.

guruevi · Jun 3, 2026

If you want a technical discussion, you need to do it on technical terms. Your benchmarks (like most benchmarks) are invalid because you're not benchmarking the 'same thing'. Moreover, you're benchmarking 1 thing (Windows?) which, if you know anything about storage - Windows NTFS always tells the subsystem it is doing an async write (yeah, NTFS is bad for your data), whereas with Ceph + QEMU any write is sync.

You can find some benchmarks from 2013 when Ceph was really young that already showed Ceph already winning various major points against Gluster, IOPS, latency but not throughput necessarily. And Ceph has improved remarkably in 13 years while Gluster kind of stagnated, by 2020 Ceph was a consistent winner in most benchmarks. But those are synthetic benchmarks, often in real-world situations, not a homelab. On ancient hardware, there is all kinds of reasons modern code won't do well. On consumer hardware even worse. Consumer hardware has a tendency to lie about data consistency as well and (especially SSD) will cache writes 'by default', Ceph (like ZFS) tends to avoid those problems even on consumer hardware.

So this is a consistent trope on this forum that ZFS and Ceph is 'slower' on consumer hardware than let's call it "naive storage" because "naive" benchmarks show a big (often unrealistic) gap. You see the opposite on real server hardware though. But Gluster is built really for hardware RAID with BBU, an open source alternative to the GPFS and other proprietary SAN systems, whereas Ceph (and ZFS) was literally invented to manage disks directly on cheap hardware, without mediator because around the early 2000s professionals started noticing from real life disaster stories that proprietary/hardware/software RAID, even with BBU cannot be trusted and does not provide the data guarantees and does not scale. I can tell you, Gluster won't scale past ~8 nodes, it will choke during rebuilds at today's scales (several TB), your data won't be consistent and available at all times in real world scenarios when things go wrong, being offline while a brick gets rebuilt is not great for business.

And again, nothing wrong with trying stuff out and experimenting. But here is the big thing: if you don't care about your data, then it doesn't matter what you use. If you care slightly about your data, but you don't need consistency of your data at all times between multiple nodes, then yeah, a replicated Gluster can do that, but most likely ZFS replicated once every minute or even every hour to another node will probably give you better guarantees about your data than Gluster.

alpha754293 · Jun 3, 2026

Win11 VM running on Ceph finished the updates sometime between last night and this morning.

I got home from work and rebooted the VM so that said Win11 updates can go into effect.

You can see, from the screenshot below, when I started the reboot and how, almost an hour later, it's only 23% done through said Win11 updates (first round, post fresh Win11 VM install that was kicked off last night).

Yes, this is how slow ceph is.

guruevi · Jun 3, 2026

You’re running Windows 11 on a CPU from the Windows 7 era. That proves absolutely nothing other than that you can get hardware to do insane things. I can show you a Ceph cluster that does Windows 11 updates in less than 5 minutes on just 10G backbone with SAS drives.

Look at your storage pressure stat, what does it say?

alexskysilk · Jun 3, 2026

I read through the chain, and I dont really understand what everyone is arguing about.

alpha754293 said:
Yes, this is how slow ceph is.

Yes. ceph requires a minimum configuration (network topology, OSD count, etc) before its performant.

Who cares if it's old, if it works (vs. it not working)?

this the is the wrong question to ask. whether someone else cares or not isnt particularly relevent, despite certain forum members forcefully expressing their opinion. It isnt that they're wrong, its whether you are comfortable with operating a legacy environment without security or bugfixes. Its your headache to deal with. There was/is nothing inherently wrong with PVE7, or gluster- their devs simply moved on due to various drivers.

and onto the main event:

It is, for this reason, why a lot of companies don't deploy the bleeding edge technology for mission critical, production systems, because newer can break stuff and cause stuff to stop working which would be a huge problem, for said business.

I dont really know what you're referencing in terms of information or statistics. COMPANIES that deploy SOLUTIONS for their specific production use cases deploy supportable solutions. They have uptime requirements and security liabilities. As long as the solution meets MAC (minimum acceptance criteria) most CTOs are not going to care that much about the technology- although there are those who do. The flip side to your argument is that a company will not deploy unsupportable hardware or software as it will not meet their business insurance criteria.

With regards to IB- its a fantastic low latency transport, and is used in many use cases where latency is king. Unfortunately its also missing a TON of functionality used by more typical usecases (No layer 2 functionality to speak of) which limits it's utility for PVE, for example. Nevertheless, it can be used as long as your comfortable operating SMs in production and dont attempt to use them for vmbrs. I decomissioned all IB from my deployments when 100gbe became nearly as performant but massively cheaper, and with switches that can actually do stuff. I get you already own the hardware so the price isnt a feature for you; the real question you should be asking is what are you actually after?

If you're insistent on using this for whatever purpose- use it. no one will stop you. If you intend to operate it as an enterprise- you've been warned.

alpha754293 · Jun 3, 2026

guruevi said:
There is a gluster-like replacement and there are others that regularly pop up that run entirely in userspace. Gluster being unstable as per my prior comment is just personal experience. If you do hammer it in real production, it is very likely to eat your data. Almost everything about the disperse volumes is highly unstable code and will bite you in the long run, and 3-way replicated volumes are no better than Ceph when it comes to data usage.

Once you get to thousands of file objects, every file/directory operation triggers network round-trips to traverse bricks for DHT lookups, especially negative lookups (eg new files) become a bottleneck.

It was once a great solution, I've used it about 15 years ago in production

I haven't been able to find a list of glusterfs version history with its corresponding release date, but the idea of trying to find that was to see what version of glusterfs you were running, 15 years ago.

I'm probably going to spend more time, playing with it (because my current ceph cluster running on the mini PC will eventually kill said 2242 M.2 NVMe SSDs. (It's already at 33% wearout, so it's only a question of when rather than a question of if.)

There's only one way to find out how well (or how poorly) it performs - try it.

(But again, the current Win11 VM that's still rebooting, this is just one running on a brand new cluster that I set up last night.)

guruevi said:
Once you get to thousands of file objects, every file/directory operation triggers network round-trips to traverse bricks for DHT lookups, especially negative lookups (eg new files) become a bottleneck.

I would have to imagine that's probably no better than erasure coded ceph where it has to read all of the blocks in, especially for a read-modify-write. (The video from 45Drives talks about other scenarios where performance is a problem with erasure coded ceph and again, given the results that I am sharing live (because it takes ceph soooo long to run), the very ceph erasure coded ceph performance issues are very real that enterprise grade U.2 NVMe and/or E1.S SDEFF NVMe SSD is only masking the fundamental fact that ceph has some very real erasure coded performance issues.)

I'll have to throw several thousand of files at my crappy gluster cluster and see what happens.

Like if you had a U.2 NVMe or E1.S EDSFF NVMe 5.0 x4 SSD that's capable of 12 GB/s sequential reads, if you were able to get 22% of that drive's performance capability, then instead of getting ~600 MB/s from each drive, you'd get 2.64 GB/s.

guruevi said:
If you want a similar-to-Gluster system: LUSTRE is still around. MooseFS is a thing (it only has a 32-bit pointer, so not very useful if you have PBs worth of lots of small files)

Yeah, I looked briefly at Lustre, but I haven't figured out how to deploy it yet. (But AI can help with that.)

alpha754293 · Jun 3, 2026

Johannes S said:
This is a logical fallacy. Just because the security issues you linked are not pressent in an older kernel doesn't mean that your old kernel doesn't have any security issues. In fact it defenitifely will have known (and unknown) security issues which are already fixed in newer kernels.

No. You can literally run the PoC scripts for the aforementioned CVEs to test whether those exploits/vulnerabilities exist pre-update.

I mean, you can literally test this with a fresh install of PVE7, 8, and 9, and test it.

It's not that hard.

Again, if this was really an issue, banks, governments, and the military would be the first targets to be hit. (Remember, until 2019, the US SACCS was still using 8 inch floppy disks.)

They can't risk their stuff/systems breaking as a result of an update, and therefore; if what you are saying is true, then they'd be the most vulnerable systems in the world, pursuant to your claim.

And yet, clearly, it isn't like they're have said LPE exploit being taken advantage of regularly and normally.

(or like any/all of the rest of the systems in the world that runs Linux - if what you're saying is true, it'd be open season on Linux systems/servers wherever they're found, throughout the world, with all of the high CVSS CVEs)

Or alternatively, if Linux updates doesn't break stuff, then you can just apply your updates to a PROD system without ever testing it.

Cuz you know - Linux updates never break stuff. [/s]

(There's also a reason why those very same companies and institutions don't run the bleeding edge stuff and wait until it's stable before deploying it. Stable.)

Johannes S said:
Regarding your "do you want me to vibecode the gluster support": You are missing my point. Since the development funded by Redhat isn't done any more, somebody else needs to do or fund it since the qemu developers deprecated support for it. No amount of benchmark or bickering here will change this.

No, I didn't miss the point.

In fact, I talked about both points:

1) I can vibe code it to destroy it (due to the quality of the code that vibe coding often generates) and

2) I can use vibe coding to boost the number of commits per month, on the assumption that's the metric that the qemu devs are using to determine whether a project is viable for them to support (anymore).

Covered both angles.

t.lamprecht · Jun 3, 2026

IMO also bit strange discussion, so only two points from my side:

1. Glusterfs really is not KISS, it might be relatively simple to setup for some use cases, especially testing ones, but there stops the simple side of it. If you find it bein EOL and out of security support a feature, then just run it, no need to convince anybody here. You can also run simple LVM-Thin or the like and do frequent PBS backups, that will be more performant and less hassle on fixing anything without the illusion of live redundancy.
2. Almost all kernel LPE we fixed affected all supported kernels, from 6.8 to 7.0, and we released fixes for all of them. And most affected way older kernels, we just didn't support any there. But sure, with that argumentation I'd recommend going back to Linux 0.1 or so, with the dramatic reduced feature surface it might be actually indeed more secure.

Again, no need to convince me, and at least some of those arguments have rather the opposite effect.

bonus third point: qemu devs don't care at all about commit counts, that's not the actual metric, just correlation. And with them, vibe coding definitively will have the opposite effects of what you want to achieve.

guruevi · Jun 3, 2026

The problem with Gluster is that a negative lookup has to (worst case scenario) traverse every brick in the cluster and do a DHT search. Gluster metadata operations, particularly for negative lookups and in rebalancing scenarios, can become increasingly expensive as the number of bricks grows because clients may need additional DHT lookups and validation operations. Basically your traffic grows exponentially as the number of bricks in the system grows linearly (not exactly, there are some mitigating factors in the design such as extensive read caches)

Ceph's erasure-coded pools typically incur higher write amplification and network overhead because data and parity are distributed across multiple OSDs. A 6+2 profile has 8 participating nodes vs a 3-way replicated pool, increasing latency and reducing small-write performance. However, for larger objects that you would see in modern workloads, this difference is minimal. Hence we store boot disks on 3-way and large data disks on a 4+2 erasure pools. By default, it will rely less on things like filesystem caches and wants the client to handle that.

Gluster ultimately depends on the underlying filesystem and hardware stack for durability and caching guarantees. The result is that Gluster can appear simpler and more performant under some conditions where Gluster was once built for (gigabit networks, spinning disk), but Ceph provides strong guarantees around data integrity, failure recovery, and behavior at large scale.

Ceph is faster than Gluster today because it avoids many of the metadata and filesystem overheads inherent in Gluster's design. Modern Ceph uses BlueStore and has modern CPU optimizations, and it can scale because it uses an algorithm to determine object placement without expensive metadata lookups. Recovery is object-based and deterministic, while Gluster relies heavily on background healing and filesystem operations.

On modern NVMe and 25G+ networks, the CPU cost of checksums and erasure coding is negligible, Ceph will today outperform Gluster even in relatively small clusters while providing stronger integrity and recovery guarantees. Gluster wasn't built to perform well on these systems, it probably isn't even compiled or written to use modern CPU optimizations because it hasn't even seen major bug fixes in about a decade.

Comparing GlusterFS and Ceph on 15yo hardware is like running a VW Beetle, then transferring that engine into a modern car and comparing how fast they go. Sure, the Beetle will beat the new car in a speed race, but it will kill its occupants in a crash and be overall much less safe on the road today, and the Beetle still can't compare to the performance of your average modern car.

alpha754293 · Jun 3, 2026

guruevi said:
If you want a technical discussion, you need to do it on technical terms.

I have.

In fact, the data started flowing with my second comment on this thread. You're only just getting around to this now.

guruevi said:
Moreover, you're benchmarking 1 thing (Windows?) which, if you know anything about storage - Windows NTFS always tells the subsystem it is doing an async write (yeah, NTFS is bad for your data), whereas with Ceph + QEMU any write is sync.

I'd be able to run more tests if ceph was faster. (Heck, as of this writing, it finally just finished rebooting from the first round of Win11 post-install updates.)

The drives themselves are capable of 100 MB/s each. Clearly, ceph isn't using anything close to this capability.

guruevi said:
You can find some benchmarks from 2013 when Ceph was really young that already showed Ceph already winning various major points against Gluster, IOPS, latency but not throughput necessarily.

Are you talking about ceph replication or erasure coding? (Again, when people talk about ceph, most of them talk about ceph in the context of replication, and not in the context of erasure coding.

guruevi said:
On ancient hardware, there is all kinds of reasons modern code won't do well. On consumer hardware even worse. Consumer hardware has a tendency to lie about data consistency as well and (especially SSD) will cache writes 'by default', Ceph (like ZFS) tends to avoid those problems even on consumer hardware.

Again, as I've said, you can literally run your own calculations as to how much of the drive you're actually using for any system/drives that you have access to where you're able to run your own benchmarks.

You don't have to take my word for it. Deploy your own EC(k,m) ceph pool and then you can test it yourself and again, this isn't rocket science. You can literally run the calculation yourself and see for yourself.

I'd love to see your results from your own tests on your own hardware on your own EC(k,m) ceph pool.

Again, a lot of words expended, but still nothing that actually talks to nor speaks about the fact that a ceph EC pool only uses about 5% of a drive's capabilities.

Again, if you watch the video from 45Drives, the best that they're able to get is about 8% of the drive's capability and they have access to newer hardware and even then, it can still only muster just 8% of the drive's performance capability.

Again, run your own tests and then calculate how much of the drive's capability does your own ceph EC pool use from your drives.

It's super easy.

guruevi said:
So this is a consistent trope on this forum that ZFS and Ceph is 'slower' on consumer hardware than let's call it "naive storage" because "naive" benchmarks show a big (often unrealistic) gap. You see the opposite on real server hardware though. But Gluster is built really for hardware RAID with BBU, an open source alternative to the GPFS and other proprietary SAN systems, whereas Ceph (and ZFS) was literally invented to manage disks directly on cheap hardware, without mediator because around the early 2000s professionals started noticing from real life disaster stories that proprietary/hardware/software RAID, even with BBU cannot be trusted and does not provide the data guarantees and does not scale. I can tell you, Gluster won't scale past ~8 nodes, it will choke during rebuilds at today's scales (several TB), your data won't be consistent and available at all times in real world scenarios when things go wrong, being offline while a brick gets rebuilt is not great for business.

Yes and no.

1) You're basing your opinion from your experience from you said 2000s-ish timeframe, right? (if it was 15 years ago, then it would be ca. 2011.) So that begs the question, have you ever tried it since then or is your opinion still based on 15-year-old data?

(That'd be like if you were to base your opinion of ceph from 15-year-old ceph. But that's not what/how you're getting your opinions about ceph. In other words, you're comparing ceph now vs. gluster 15 year ago, right?)

2) a) ceph EC has a synchronisation overhead which is no different than MPI overhead as the number of processes increases. You can literally look up any HPC MPI scalability plot as a function of the number of CPU cores and you will find that going from 4096 cores to 8192 cores won't double your performance or cut your total wall clock run time in half.

b) Thus, one way that you can "mitigate" this synchronisation overhead when you have a lot of nodes and/or OSDs, is you limit the number of nodes/OSDs you're trying to synchronise/maintain concurrency. Hybrid OpenMP/MPI solved this (for LS-DYNA) something like at least a decade ago.

This is no different. Therefore; if you have a ceph OSD that's comprised of a ZFS pool, then it's the same thing that gluster recommends for a deployment, except that gluster figured this out whenever gluster first published gluster on ZFS.

3) You state that gluster is having performance issues ca. 15 years ago. ceph, especially EC ceph pool, is having performance issues now.

5% of a HDD that's capable of 150 MB/s sequential read = 7.5 MB/s.

5% of a U.2 or E1.S EDSFF NVMe 5.0 x4 SSD that's capable of 12 GB/s is only 600 MB/s. And whilst yes, 600 MB/s is faster than 7.5 MB/s, in both cases, you're still only using 5% of what the respective drive is capable of.

Buying a U.2 or E1.S EDSFF NVMe 5.0 x4 SSD is just throwing money to mask the fact that an EC ceph pool is only using 5% of the drive and the order of magnitude that you're paying more for said U.2 and/or E1.S EDSFF NVMe 5.0 x4 SSD would only barely move the needle from ~5% utilisation to ~8% utilisation (but it costs more than an order of magnitude more).

It's still only 5-8%.

(to be continued...time for me to go put the kids to bed)

guruevi · Jun 3, 2026

There is not a single spinning hard drive in the world that can sustain a random pattern uncached 100MB/s. That alone tells me something is wrong with your benchmark. 8-10MB/s for uncached data 4K read/writes on spinning rust sounds more accurate, true random read/write 4K would top out somewhere around 0.5MB/s. You need to do something like 1-4MB chunks sequential to get to 100MB/s on 7200RPM.

You can just set the virtual disk to write back (unsafe) and get your Ceph to probably be much faster and get you similar data reliability guarantees.

EC pools are not about throughput, they’re about space optimization. There is always a trade off. NVMe is much better than spinning disks for IOPS, but I don’t get where you get 5% usage. There are various tradeoffs but if your network fabric is 100GbE, you will never be able to push more than 100Gb whereas a single NVMe today can push multiples of that, the benefit of buying a better NVMe is lower latency, potentially better write endurance. But this is regardless of the system you end up using, there is no system that can break physics, Ceph is probably close to 80-90% of theoretical performance in real world scenarios. I have synthetic benchmarks for my EC pool to 1.15Tbps cluster-wide which is near 98% of line speed for this particular cluster. Sure in theory my SSD can provide close to 120Tbps in aggregate, but I don’t have the network backbone for that, that doesn’t mean the SSD were a waste, spinning disks are slower, consume more power, take up more space. Cheaper SSD are also slower, consume more power etc, the difference between 100ns latency and 2 millisecond latency is profound. The overhead of SAS/SATA vs NVMe is noticeable. Being able to sustain 50k IOPS under load vs a consumer SSD crashing down to 500 IOPS after the cache is full, those are all issues I think about.

Ceph EC has no exponentially increasing synchronization overhead like MPI, the CRUSH algorithm will only select n+k nodes for each block. It doesn’t have to synchronize to all nodes because the client knows exactly where the data is located based on its object id and the map.

As to whether I should test Gluster again, I ask why. We know the limitations, it won’t scale to the 25+ nodes and hundreds of bricks I have. Red Hat dropped it now 5 years? ago after it had already been on life support for several years at that point, the last proper Gluster development was the RHEL7 era. Nobody supports it, there are known bugs that will eat your data (see GitHub), it doesn’t perform well on modern hardware and is outperformed by Ceph.

alpha754293 · Jun 3, 2026

The Win11 VM finally finished both its first and second round of post-install updates at around like 4:27 AM this morning. Here is the final results with CrystalDiskMark running inside said Win11 VM:

The Win11 VM running on ceph is on the left and the Win11 VM running on gluster is on the right.

The results speak for themselves.

People, invariably, are going to complain about the hardware, blah blah blah. But these tests were performed on the same system so the only real difference is ceph vs gluster.

Note that the same people who are complaining about said hardware never provide numbers/data of their own, based on whatever methodology that they're also, likewise, complaining about.

alpha754293 · Jun 3, 2026

guruevi said:
If you want a technical discussion, you need to do it on technical terms.

Where's your data from the tests that you, yourself have ran recently?

It's really interesting to read about someone talking about having a technical discussion, and then provide no actual data as they lecture others on how to have a technical discussion.

guruevi said:
Your benchmarks (like most benchmarks) are invalid because you're not benchmarking the 'same thing'. Moreover, you're benchmarking 1 thing (Windows?) which, if you know anything about storage - Windows NTFS always tells the subsystem it is doing an async write (yeah, NTFS is bad for your data), whereas with Ceph + QEMU any write is sync.

I don't tink that CrystalDiskMark cares what NTFS is doing.

I mean, even if NTFS is lying to you, like-for-like (where I'm running CrystalDiskMark in the Win11 VM that's running on gluster as I am running the Win11 VM that's running on ceph, ceph is literally reporting a sequential write speed of 9.21 MB/s max, so even if it was lying to me, it's still ~1/9th of what the sequential write speed is, with CrystalDiskMark, where I'm using gluster.

Or, alternatively, what you're actually saying is that ceph is even slower than that, in reality (because NTFS lies).

That's what you're really saying/admitting to.

(Again, it is interesting that you complain about the benchmarking/testing methodology, but then don't provide any test data from the way that you would run the same test between ceph and gluster, with a Win11 VM client.)

Or TL;DR:
complain complain complain. no data.

This message brought to you from the same person who states, at the beginning of their response:

guruevi said:
If you want a technical discussion, you need to do it on technical terms.

alpha754293 · Jun 3, 2026

guruevi said:
And again, nothing wrong with trying stuff out and experimenting. But here is the big thing: if you don't care about your data, then it doesn't matter what you use. If you care slightly about your data, but you don't need consistency of your data at all times between multiple nodes, then yeah, a replicated Gluster can do that, but most likely ZFS replicated once every minute or even every hour to another node will probably give you better guarantees about your data than Gluster.

TL;DR:
complain complain complain. no data.

guruevi said:
You’re running Windows 11 on a CPU from the Windows 7 era. That proves absolutely nothing other than that you can get hardware to do insane things. I can show you a Ceph cluster that does Windows 11 updates in less than 5 minutes on just 10G backbone with SAS drives.

Look at your storage pressure stat, what does it say?

You're proving this point of mine for me;
"Therefore; given that people were going to complain about the hardware anyways, it didn't really matter what I tested with, because regardless of what I used, people were going to complain about it anyways. So I used what made it easy for me to run these tests."

Again, this message brought to you from the very same person who started one of your replies with:
"If you want a technical discussion, you need to do it on technical terms."

lol....lmao....

ARE you going to provide me with newer hardware so that I re-run the tests? No? Then you're just complaining about it for the sake of complaining about it rather than complaining about it and then actually doing something about it.

Heck, even if you don't put up the hardware, then you can run these tests yourself, the way how you think that you should run it, so long as you document exactly what and how you did it, so that other people can repeat it, which is precisely what I've done here.

I just literally installed it, configured it, and then installed Win11 VM on top of that, no other non-default settings were used, and run with it.

Anybody, including yourself, can run and repeat this test.

And if you don't like how the test was conducting (since you complained about the methodology as well), then fine, you run it how you see fit and let's see your data.

Between the two of us, I'm the only one here that's providing any data of any sort.

(says the person who stated: "If you want a technical discussion, you need to do it on technical terms.", but then don't provide any data for said discussion.)

Ceph storage pressure stats (when running CrystalDiskMark);

Gluster storage pressure stats, when running CrystalDiskMark: (I stopped the run in ceph before running it again in gluster)

Again, the results clearly speak for themselves.

Also again, you asking this question is something that you can very easily test yourself.

(Like if you're going to complain about what I'm running my tests on my 10+ year old hardware, then you're free and more than welcome to run this yourself, on your presumably, newer hardware, (newer CPU, newer/faster RAM, newer/faster SSDs, etc.).)

Are you just going to complain or are you actually going to do something about it (and then provide your data)?

Cuz so far, the only person who's providing any data (beyond what @kayson originally provided, is me).

alpha754293 · Jun 3, 2026

alexskysilk said:
Yes. ceph requires a minimum configuration (network topology, OSD count, etc) before its performant.

I'd argue the opposite:

If ceph is performant with old, cheap, crappy hardware, then throwing new hardware at it would just make it run better, faster.

(I mean, this is why I am running an entirely virtualised cluster because that way, my GbE won't be the limiting factor for ceph to work the best way that it can. You'd think that using the virtio-nic, that should mean that as far as networking is concerned, it will be able to run pretty much as fast as said virtio-nic will allow and won't have PHY/hardware/physical limitations like a PHY link would have.

And even with this, it still can't muster up more than 9.21 MB/s in sequential write on the same underlying hardware where I'm running my gluster tests.

CPU is the same between both tests, as is the RAM, motherboard, same Proxmox host boot SSD, same HGST 1 TB SATA 3 Gbps HDDs across the board, same everything basically. And the only difference is one virtualised 3-node cluster is running ceph and the other virtualised 3-node cluster is running gluster. And the host, and the virtualised clusters are all running Proxmox 7.4-20, so even down to the software versions, it's all the same.

Again, the video from 45Drives shows that even with vastly better hardware, they're still only able to achieve about 8% of the drive's performance capability (vs. the ~22% that I'm able to currently achieve with this old, crappy hardware).

So now, you take that 22% and apply to new, modern hardware, and so 22% of a U.2 or E1.S EDSFF NVMe 5.0 x4 SSD that's capable of 12 GB/s sequential read/writes will yield 2.64 GB/s vs even the best of ceph at 8% would only yield 0.96 GB/s.

No amount of networking is going to make up for the fact that EC ceph appears to max out at around 8-ish% of a drive's capabilities whilst a distributed dispersed gvol, even with my old crappy hardware, maxes out at around 22%.

Again, watch the 45Drive video that I've linked above.

When you're going to getting like 0.96 GB/s, what people actually do, in actual ceph deployments is that if they want to get closer to the 2.64 GB/s that you might be able to get with gluster, you'd have to buy three of the U.2 or E1.S EDSFF NVMe 5.0 x4 SSDs to make up for the fact that a ceph doesn't/can't use more than ~8% of the drive's capabilities.

That's literally just throwing money to mask the fact that ceph is only using 8% of the drive. Three times as much money.

vs. if you bought the one drive, and then ran gluster, gluster is faster, in both sequential and random workloads.

Out of all of the participants in this discussion, no one has provided any real, concrete, hard data that shows/demonstrates/proves otherwise.

People can complain about my methodology because again, as I predicted, people were going to complain about the hardware that I'm using, and since I predicted that people were going to complain about it, (and then the people who are complaining about my hardware of choice isn't actually going to supply new(er) hardware for me to test with, i.e. they're not going to actually do anything about it), which either means, if they're going to complain about it, then they can run the very same tests, run it however they see fit, with whatever hardware they have (or have access to), and try to show me that ceph can do better than using just 8% of a drive's capability.

No one else has provided any data. They just complain because people who complain won't be people who complain, unless they're always complaining about something (and then do absolutely nothing about it). It's complaining for the sake of complaining.

alexskysilk said:
its whether you are comfortable with operating a legacy environment without security or bugfixes. Its your headache to deal with. There was/is nothing inherently wrong with PVE7, or gluster- their devs simply moved on due to various drivers.

I think that it really depends on what's your security risk profile and how tolerant you are to/of it.

Like if you're a homelabber and you haven't opened any of your services so that they can be accessed from outside your network, then how big of a security risk/attack vector/profile do you really have and how much of that are you able to tolerate?

Think about all of the people who have homelabs.

You can find YouTube videos of people who deliberately exposed a new WinXP VM to the internet to become an instant magnet for problems.

But you can also run dockur/windows, run WinXP, have it still be able to connect to the internet without any issues. (Ask me how I know this. It was a bit of a pain to try and get a WinXP SP3 CD key.)

So how big of a security/attack vector is it really?

Think about how many point of sale systems that are still running WinXP Embedded.

alexskysilk said:
I dont really know what you're referencing in terms of information or statistics

My dad used to work at one of the banks as a "computer operations associate".

Banks do not deploy bleeding edge technologies. It is one of the reasons why your bank account, probably most likely sits on an IBM mainframe (still) somewhere. There's a very good reason why they haven't migrated your bank account over from AIX/POWER (or z/OS/System z) over to Linux on arm or x86_64. If you've ever noticed how it is becoming more often that you aren't able to access your bank account online, it's probably because they've migrated -- something that didn't used to happen nearly as much, when it was still running on the IBM mainframes. Or the other way to ask yourself the same question is "who still uses (IBM) mainframes and why?". (Oracle has pretty much functionally and effectively killed of Sun Microsystems hardware. Yes, there are still new servers, but when you go to oracle.com, their hardware is not featured prominently on their homepage, which tells you everything you need to know about Oracle's perspective on their own hardware solutions.

Where I work, our global manufacturing system, based on what my senior and supervising engineers have told me, still run on COBOL.

I've also been told that there has been attempts by companies/developers/computer/IT companies that have tried to modernise it and they have all failed to do so properly. If said global manufacturing system stops working, all of our manufacturing plants shut down, as a result/on account of it.

So pretty much any company that isn't a startup in like the last 30-40 years (since the dot com boom), the rest of the companies that have long existed before said dot com boom of the late 90s, are probably running legacy systems that either can't or have failed to be able to modernise (some of which, isn't for the lack of trying).

Or to you put it to you in a slightly different way, when I am interviewing potential developer candidates, I will ask them questions about scale. I had one interviewee who physically flinched when I asked this question (about scale), because that is the scale of our operations.

We've migrated from an on-prem deployment of github to github-in-the-cloud and I quite literally get emails about some problem with said github-in-the-cloud service daily. Again, scale.

Think of any company that existed throughout the 80s and 90s and even earlier. And ask yourself what kind of systems do you think they (still) run?

alexskysilk said:
With regards to IB- its a fantastic low latency transport, and is used in many use cases where latency is king. Unfortunately its also missing a TON of functionality used by more typical usecases (No layer 2 functionality to speak of) which limits it's utility for PVE, for example. Nevertheless, it can be used as long as your comfortable operating SMs in production and dont attempt to use them for vmbrs. I decomissioned all IB from my deployments when 100gbe became nearly as performant but massively cheaper, and with switches that can actually do stuff. I get you already own the hardware so the price isnt a feature for you; the real question you should be asking is what are you actually after?

Three things:
1) As I mentioned, for me, 100 Gbps IB was for HPC that I was running in my basement. As I've also said, said HPC applications that uses MPI/RDMA etc. can regularly and routely hit 80-90 Gbps out of a possible 100 Gbps as the software is leverage the hardware system interconnect.

Also as I've said, storage, as that point, was just a fringe benefit (and also as I've said, I didn't buy said 100 Gbps IB for storage). I bought it for HPC. The fact that I can use it for storage as well, was just an added bonus.

2) Remember that bandwidth can be represented by bit width / latency. Lower latency can relate to higher clock speed (but not always). Higher clock speed * bit width loosely gives you bandwidth. Therefore; if you want to speed things up, you can either increase the bit width, increase the clock speed, and/or lower the latency.

(That's more or less how the IB roadmap is developed, where they tackle one or more of these variables, pretty much in terms of signal and/or electrical and electronics engineering.)

3) As I've said, the Linux bridge (which I've come to learn) and it's inability to create IB linux bridges, isn't just a Proxmox thing; it's a linux thing.

Therefore; given that I can't create linux bridges, the next best thing for shared IB is SR-IOV and IB VFs, which, again, as I've stated, the Debian version of opensm doesn't support.

Which is also why I already wrote about how if I want virtualisation support, I'd either need to replace my switch from MSB-7890 to MSB-7800 (which I've had before, but had problems with my previous unit, so I returned it), and/or I'd need to have a dedicated system, whose sole purpose would be to run the IB opensm, most likely either in CentOS or Rocky Linux, so that I can enable virtualisation support. (which I've also written about already as well).

Then I'd be able to pass the IB VFs to LXCs and VMs and that'll get about as close to a linux bridge as it'll get on the IB side of things.

And the idea with that would be "use it since I already have it" (instead of deploying a new, physical 10G layer). Intranode, it can run 10G internally. That's what the virtio-nic shows up as in Win10+, Linux, and MacOS. Internodal communications can run over 100 Gbps IB (since I already have it).

alexskysilk said:
If you're insistent on using this for whatever purpose- use it. no one will stop you. If you intend to operate it as an enterprise- you've been warned.

Conversely, if an enterprise has already deployed gluster and it's working for them (because it's faster than EC ceph), if it is working for them, then it is working for them.

alpha754293 · Jun 3, 2026

t.lamprecht said:
And most affected way older kernels, we just didn't support any there.

That's fine.

But for example, with CVE-2026-43134, I checked to see if the affected kernel modules (which the exploit uses (or at least it's a part of the exploit)) was running and it wasn't.

And before the patched kernel was rolled out, the ICA was to disable those kernel modules.

So even for older kernels that haven't been updated, there are still mitigating actions that can be taken.

Like I said, I'm glad that I was able to use this very system that I am currently using for this ceph vs. gluster testing to test the pve7to8 migration before rolling it out to my PROD server at home, only to find out that when it was initially released (and for some time thereafter), the LSI SAS RAID HBA has issues with the very specific kernel version that shipped with pve8, out of the box, by default.

If I didn't test it and just rolled it out to PROD without testing, I would've be screwed. (Hence why I have LTO-8 backups.)

t.lamprecht said:
bonus third point: qemu devs don't care at all about commit counts, that's not the actual metric, just correlation. And with them, vibe coding definitively will have the opposite effects of what you want to achieve.

Oh I know.

But since people here are talking about "gluster is dead" in conjunction with them talking about the monthly commit count (or lack thereof), I'm not the one who's saying it. It's the other people in this discussion thread that's saying that.

So if you want to (artifically) boost the monthly commit count, agentic vibe coding would be the easiest way to do that, given that's what people here are talking about (monthly commit count).

cf.

ProKn1fe said:
Their github not even close to be "active maintained". Some bugfixes and nothing more.

Glusterfs is still maintained. Please don't drop support!

Distinguished Member

Active Member

Active Member

Renowned Member

Distinguished Member

Renowned Member

Active Member

Renowned Member

Distinguished Member

Active Member

Active Member

Proxmox Staff Member

Renowned Member

Active Member

Renowned Member

Active Member

Active Member

Active Member

Active Member

Active Member

We value your privacy