Glusterfs is still maintained. Please don't drop support!

EllerholdAG · Jun 3, 2026

I thought the problem was the QEMU support for glusterfs ? You'd need to raise this with the QEMU devs and not proxmox afaik, I think...

alpha754293 · Jun 3, 2026

guruevi said:
Ceph's erasure-coded pools typically incur higher write amplification and network overhead because data and parity are distributed across multiple OSDs. A 6+2 profile has 8 participating nodes vs a 3-way replicated pool, increasing latency and reducing small-write performance. However, for larger objects that you would see in modern workloads, this difference is minimal. Hence we store boot disks on 3-way and large data disks on a 4+2 erasure pools. By default, it will rely less on things like filesystem caches and wants the client to handle that.

Again, I will refer you to the problems with EC ceph that the guy from 45Drives highlights, in their video.

guruevi said:
but Ceph provides strong guarantees around data integrity, failure recovery, and behavior at large scale.

...at the expense of speed. (again, cf. the dev's page about performance issues with EC ceph) Unless, of course, you want to argue that the ceph devs are wrong about their own documentation about EC ceph's performance issues.

guruevi said:
Ceph is faster than Gluster today

ceph on the left, gluster on the right.

But according to @guruevi, ceph is faster. I think only in guruevi's world would 9.21 MB/s be faster than 82.33 MB/s. (etc. etc. etc.)
(Where's your data?)

guruevi said:
Modern Ceph uses BlueStore and has modern CPU optimizations, and it can scale because it uses an algorithm to determine object placement without expensive metadata lookups.

The ceph devs literally tell you that the current read implementation "Current code reads all data chunks."

Why do you make it so easy for me to disprove your statements by quoting the ceph devs themselves? Or are you saying/arguing that what the ceph devs wrote here is wrong? You really think/want to argue that what the ceph devs wrote here is wrong?

Between the ceph devs and guruevi, I wonder if guruevi thinks that he understands how ceph works better than the ceph devs.

(The ceph devs literally tell him that current read implementation "...reads all data chunks.".)

guruevi said:
Recovery is object-based and deterministic, while Gluster relies heavily on background healing and filesystem operations.

On modern NVMe and 25G+ networks, the CPU cost of checksums and erasure coding is negligible, Ceph will today outperform Gluster even in relatively small clusters while providing stronger integrity and recovery guarantees.

Again, as the screenshot above patently shows - this is quite literally and easily proven to be demonstrably false.

(Again, where's your data?)

Still no data.

guruevi said:
Comparing GlusterFS and Ceph on 15yo hardware is like running a VW Beetle, then transferring that engine into a modern car and comparing how fast they go. Sure, the Beetle will beat the new car in a speed race, but it will kill its occupants in a crash and be overall much less safe on the road today, and the Beetle still can't compare to the performance of your average modern car.

Still no data.

You've written a little bit, but still no data.

guruevi said:
If you want a technical discussion, you need to do it on technical terms.

Still no data.

(Again, are you complaining about how I've predicted that complainers will always find something to complain about, whilst doing absolutely nothing about it, or are you going to run the tests the way that you think how it should have been performed/tested?

Are you going to provide data or are you just going to complain about it?

guruevi said:
There is not a single spinning hard drive in the world that can sustain a random pattern uncached 100MB/s.

Your original statement/quote said nothing about random[/i] throughput.

Here is what you actually, originally said;

"My point was you can't push 100Gbps from a single spinning disk that gives at best 1-10Mbps of throughput (if not reading from cache)."

Note that nowhere, in that statement, did you talk about nor mention the word random in that entire statement.

Furthermore, most people talk about random I/O in terms of I/O/s rather than in terms of MB/s.

guruevi said:
That alone tells me something is wrong with your benchmark.

Show me, where, in your original statement;

"My point was you can't push 100Gbps from a single spinning disk that gives at best 1-10Mbps of throughput (if not reading from cache)."

you wrote the word random anywhere, from this statement of yours.

The problem isn't with the benchmark. The problem is that your statement doesn't have the word random in it, anywhere (in said statement).

That's on you.

guruevi said:
8-10MB/s for uncached data 4K read/writes on spinning rust sounds more accurate, true random read/write 4K would top out somewhere around 0.5MB/s. You need to do something like 1-4MB chunks sequential to get to 100MB/s on 7200RPM.

Again, the data shows/tells you, from CrystalDiskMark that gluster random I/O performance is 1.69 MB/s, which is well in line with your expectations.

The two statements can't be true simultaneously. The benchmark for random I/O performance can't be both true and false, at the same time. We're not talking about the Heisenberg Uncertainty Principle here (as it relates to EC ceph and distributed dispersed gluster performance).

Oh I would love to see the math as to how you figured that one out.

Again. Data.

Where is your data?

(If you're going to complain about someone else's data and/or say that someone else's data is wrong, then the least that you can do is to show up with your own data that clearly proves that said someone else's data is wrong. I've yet to see any of your data.)

alpha754293 · Jun 3, 2026

guruevi said:
EC pools are not about throughput, they’re about space optimization.

I never said that EC pools are about throughput.

EC pools gives you data redundancy, at higher levels of storage efficiency, whilst giving you some/most of the speed of a replicate pool.

It won't be quite as fast as a replicate pool for reads because you don't have the ability to parallelise your reads across multiple copies of the same data, but with the distribution of the chunks across OSDs, you'll still get some level of OSD read parallelism if it weren't for the fact that the current EC read implementation (per the ceph devs) is that it reads all of the chunks and then discards the unneeded chunks.

(Again, cf. here.)

guruevi said:
There is always a trade off. NVMe is much better than spinning disks for IOPS, but I don’t get where you get 5% usage.

I don't know how many times I am able to explain that if your drive is capable of 12 GB/s reads and you're only getting 600 MB/s - 600 MB/s divided by 12000 MB/s = 0.05.

That's 5% of what your drive is capable of, in terms of sequential read transfer rate.

If you can't understand 600 MB/s divded by 12000 MB/s = 0.05 - I'm not sure that there's much that I can do help with you this basic arithmetic.

Maybe Prodigy Math?

Or in another example, if my HGST 1 TB SATA 3 Gbps is capable of 133.75 MB/s in terms of sequential transfer rates, then 5% of that would be

133.75 MB/s * 0.05 = 6.6875 MB/s.

But as the gluster results show, I'm getting 104.58 MB/s. But that's across three drives. 104.58 MB/s / 3 = 34.86 MB/s

34.86 MB/s / 133.75 MB/s = 0.260635514

Again, if you need further help with the arithmetic, you can check out Prodigy Math for more assistance.

There aren't very many more ways that I would be able to explain this arithmetic to you in a way that would help you understand this division problem.

alpha754293 · Jun 3, 2026

guruevi said:
There are various tradeoffs but if your network fabric is 100GbE

As it has already been stated, my network is 100 Gbps IB.

guruevi said:
you will never be able to push more than 100Gb whereas a single NVMe today can push multiples of that, the benefit of buying a better NVMe is lower latency, potentially better write endurance.

Not really.

A single NVMe 5.0 x4 SSD only gets around like 12 GB/s. 12 GB/s * 8 bits/byte = 96 Gbps. So, if you're talking about multiples of 100 Gbps, then yes, in terms of 0.96 is a multiple of 100 Gbps. A fractional multiple.

Yes, NVMe has lower latency (that's like saying that water is wet).

And no, NVMe SSDs do not have better write endurance (than HDDs).

(What's the wearout percentage on your HDDs? What's the wearout on your SSDs?)

guruevi said:
Ceph is probably close to 80-90% of theoretical performance in real world scenarios. I have synthetic benchmarks for my EC pool to 1.15Tbps cluster-wide which is near 98% of line speed for this particular cluster. Sure in theory my SSD can provide close to 120Tbps in aggregate,

1.15 Tbps / 120 Tbps = 0.0096.

There you have it.

Your own data proves my point - EC ceph sucks sucks for performance because your own data proves that you're using LESS THAN 1% of what your drives are actually capable of.

Or to put it in another way (that might help you understand it better):

If your EC pool (gluster) was capable of using 26% of your drive's capability, you'd need ~1/4 as many drives as you currently have now.

So if you have 40 drives, which gets you your 1.15 Tbps out of a possible 120 Tbps, you'd be able to buy only 10 drives and get the same throughput as you did with 40 drives.

And if each drive is say $2000 USD.

40 * $2000 USD = $80000 USD

10 * $2000 USD = $20000 USD.

$80000 USD - $20000 USD = $60000 USD.

Why buy 40 drives what you can do with 10 drives?

Your own data SHOWS that you're currently using less than 1% of your drive's capability.

Which means that if you bought 1000 drives, you could've just bought 10, and gotten the same throughput.

So, yes, you literally wasted money on drives, where you don't make use of the remaining 99% of its performance capabilities. It's literally wasting 99% of its performance capabilities for the 99% of the performance capabilities that you're not using.

(Whereas if you were using a system that used 26% of the drive's performance capabilities, you could've gotten away with ~75% fewer drives.)

If you're using 40 drives, you literally and the drives were $2000 USD each, you literally wasted $60000 USD on drives whose performance capabilities you'll never use. (And that's with my 26% drive capability utilisation).

With your LESS THAN 1%.

If you bought 1000 drives at $2000 USD each, that's $2M USD.

But you could've gotten away with 10 drives (1% of 1000 = 10). 10 drives * $2000 USD each = $20000. $2M - $20k = $1.98M.

That's literally how much money you wasted on drives that you'll never be able to make full use of the capability you paid for.

You're proving my point: EC ceph pools mask the underlying performance problem by throwing money at it to mask said underlying performance problem.

Your bottom line is 0.96%. That's how much of the drive's performance capabilities you're using.

That's how many more drives you guys bought, when, if you had studied the performance numbers better/more closely, in detail, you would've figured out that 0.96% drive performance capability utilisation is very, very, very low.

And that you could've gotten away with needing to buy fewer drives, if you used something that used more of your drive's performance capabilities.

You threw $2M at a problem that you could've solved with $20k.

This is what I mean by how production deployments of EC ceph, just throws money at it to mask the fact that EC ceph pools have very poor drive performance utilisation.

guruevi · Jun 3, 2026

But your fabric is still 100Gb. You don't seem to understand that you cannot, regardless of method, go beyond that metric. Gluster can't replicate your data faster than your fabric, neither can Ceph and that is REGARDLESS of whether you use EC or replicated pools, it doesn't matter, you can't have a client write faster than the link to its storage. And with modern CPU, EC is about as fast as a replicated pool if properly architected.

If Gluster is pushing 100MB/s on a spinning drive, it's physically impossible it is committing that to 3 or 5 other disks across a gigabit network. You show benchmarks indicating a full gigabit network, are you comparing Gluster on your IB with a gigabit replicated Ceph and surprised they don't go equally fast?

What would you have done, purchased spinning disks with 10ms latencies vs NVMe with 100ns latencies? You clearly do not understand how things work. And even then, it wouldn't have cost you $20k, Enterprise SAS 2.5" 10k drives, which top out at ~1TB were roughly the same cost per GB than Enterprise NVMe SSD at the time when they were still available and consumed 10x more energy. Today you get 256TB on a single EDSFF SSD, there is not even a 3.5" drive with that density, but drives don't give you 600MB/s or even 100MB/s, they average "real commit speeds" at ~0.5-2.5MB/s.

But the latency is what matters, it is what drives your IOPS, it is what makes your Windows 11 update faster. Windows 11 isn't even supported on non-SSD systems, and there is a reason for that.

Anyway, I hope you never architect these things in a real situation, keep playing, I hope you learn something.

alpha754293 · Jun 4, 2026

guruevi said:
spinning disks are slower, consume more power, take up more space. Cheaper SSD are also slower, consume more power etc, the difference between 100ns latency and 2 millisecond latency is profound. The overhead of SAS/SATA vs NVMe is noticeable. Being able to sustain 50k IOPS under load vs a consumer SSD crashing down to 500 IOPS after the cache is full, those are all issues I think about.

The most telling part about all of this is the answer to the question "what do you run, in your own personal homelab where you (often) don't have a multi-million dollar IT budget?"

I'm going to go out on a limb here and surmise that what your company uses isn't going to be the same as what you're running at home, because you (often) don't have a multi-million (nor even multi-hundreds or tens of thousands of dollars) for your homelab.

So whilst you might talk a big talk in terms of what your company has spent multi-millions on, the telling part of all this is "what do you spend your money on?"

If say, you have like an aggregate total of 300 TB of raw storage capacity at home, I can pretty much guarantee you that you're not going to be spending $1000 on P4800X's, 1.5 TB at a time. (300 TB / 1.5 TB = 200. 200 * $1000 = $200,000).

I can guarantee you that you're not spending $200k on P4800X U.2 NVMe SSDs, let alone the additional costs of the systems/hardware to house said 200 U.2 NVMe SSDs, and then the networking infrastructure to tie all of those systems together.

And then because SSDs will die (finite life. SSDs are the brake pads of the computer world - the faster they are, the more you're going to want to use them, and therefore; the faster they'll wear out), that means that you then have to budget extra to replace said dead SSDs when they will, inevitably and invariably die.

The last time that I bought HGST 12 TB SAS 12 Gbps HDDs, I was able to buy them for $68.89 each. Extrapolating that out for my 300 TB raw storage --> 300 TB / 12 TB = 25. 25 * $68.89 = $1722.25.

So whilst your all NVMe solution would've been about $200k, mine, with all HDD, is less than 1% of that. (0.86% to be exact).

In other words, it's fine when you're spending someone else's money. Let's see what solution you come up with when you have capacity (and performance) targets that you're trying to accomplish when you're spending your own money.

guruevi said:
Ceph EC has no exponentially increasing synchronization overhead like MPI, the CRUSH algorithm will only select n+k nodes for each block. It doesn’t have to synchronize to all nodes because the client knows exactly where the data is located based on its object id and the map.

Read through the ceph EC performance enhancements docs authored by the ceph devs more closely. Again, they explicitly tell you that in the current read implementation, it reads all of the chunks from all OSDs.

If what you wrote was correct (and accurate) re: "It doesn't have to synchronize to all nodes because the client knows exactly where the data is located based on its object id and the map", then again, you're arguing that the ceph devs are wrong in that the current read implementation for erasure coding doesn't need to read all of the chunks from all OSDs to be able to read the file/data.

So, you're saying that a) you know more than the ceph devs and b) the ceph devs are wrong.

That's what you're saying.

(Again, the ceph devs literally and explicitly tell you that they read all data chunks from all OSDs. It's not conjecture. The ceph devs literally tell you that, as shown in the screenshot from their docs above.)

But you think that you know more than the ceph devs and thus, am now arguing that the ceph devs are wrong in their claim/assertion that it doesn't need to read all data chunks from all OSDs because "the client know where the data is located based on its object id and the map."

Good luck with that one.

(Think about how you would keep 300+ OSDs synchronised as a node can have more than one OSD per node. You can run two of these servers where each server node houses 6 HDDs, and thus, each server is responsible for 24 HDDs in total, and if you're running two of them, then you've got 48 HDDs spread across 8 nodes. And then with that, you can run said 48 drives where each drive is an OSD, but you're also running an EC(6,2) profile as well. And then scale it up to 300+ OSDs. It's a synchronisation issue, no different than MPI synchronisation for HPC applications. This is why you don't get linear scalability with MPI once you exceed somewhere between 512 to 2048 CPU cores (due to MPI synchronisation overhead). This is no different than that.)

In fact, gluster, every since they have said that you probably would want to make ZFS pool a gluster brick, has basically figured out the same (more or less) (akin to hybrid OpenMP/MPI that LS-DYNA can use) where instead of needing to synchronise all of the bricks, they recommend that you use a ZFS pool as a brick, and then that way, the EC (raidz(#)) can be handled internally by the brick itself (because ZFS pool) and then all gluster would need to be responsible for is internodal synchronisation which significantly reduces the synchronisation overhead because you're not dealing with disks=bricks anymore.

And this is a very similar to the idea that I have proposed for ceph where you run a ZFS pool, create, for example, and iSCSI target on it, and then export that to ceph as the OSD, and now, as far as ceph is concerned, it's just a massive OSD, but the OSD itself is backed by ZFS, where you can enjoy the advantages of ARC and it can also handle its own internal EC by way of raidz(#) since cache tiering in ceph has been depreciated since at least reef.

gluster has already figured this out.

ceph still isn't quote (officially) there yet.

(Furthermore, gluster, from what I've read, support RAM caching, whereas ceph has deprecated it.)

guruevi said:
As to whether I should test Gluster again, I ask why.

Cuz you were complaining about the results and methodology.

If you think that others are doing it wrong, then you can show us exactly how to do it "right" and "properly" according to you.

That's why.

And if you're not going to do that, then again, you're just complaining (word that starts with a 'b') for the sake of complaining, whilst never actually going to do anything about it.

You're just going to complain about it because people who complain always need something to complain about. And that's fine, so long as there's conscious recognition that that is what you are doing (complaining for the sake of complaining rather than actually running the tests/benchmark the way you think how it should've been performed, so that you have your own set of data for us to have a technical discussion about. But you're not going to do that. And that's fine. But at least we know which camp you reside.)

guruevi said:
it doesn’t perform well on modern hardware and is outperformed by Ceph.

no data.

Domino · Jun 4, 2026

At the end of the day, only you can know what is good for you based on what you have tested, the way you tested it, with the hardware you own and if it made you happy in the sense it performs to your expectations.

The problem is the variables, hardware, software, user experience, the list goes on and on.

Never ever fall into the trap of trying to convince others, it is as hard as explaining how strategically systematically engineered the current decline of the global economy is. (lol)

The primary issue is that the Qemu team have publicly announced deprecation for Gfs.

An obvious observation would be the fact the Proxmox team have to keep their end simple and somewhat aligned with underlying technologies (ie Qemu) to minimize ongoing workload as long as it doesn't impact their paying clientbase. Branching a codebase as large as Qemu is an enormous task which grows even larger the more you have to maintain outside the scope of the master-base.

My two pence, just do it yourself

I have currently modified a number of NIC drivers, even a 5G modem driver, fixed bugs which I just couldnt be bothered to submit upstream, but I managed to add full RSS support, and multiple TX/RX channels matching larger core counts and took my networking stack to a whole new level when under maximum load, core 0 sits at 0.

One could potentially even write a GlusterFS plugin for proxmox like that NVME fabric guy has done and provides a commercial service in that regard.

PS. I'm in a bad mood, I got my cooling script wrong and lost a 15tb nvme due to heat overload, it was the last drive in 2u server front vertical 2.5 slots and just didnt get the ideal airflow during a recent heat wave. Now I'm angry at myself for being so complacent because 15tb nvmes aren't exactly cheap. The good news is that its all backed up!

alpha754293 · Jun 4, 2026

guruevi said:
But your fabric is still 100Gb. You don't seem to understand that you cannot, regardless of method, go beyond that metric. Gluster can't replicate your data faster than your fabric, neither can Ceph and that is REGARDLESS of whether you use EC or replicated pools, it doesn't matter, you can't have a client write faster than the link to its storage. And with modern CPU, EC is about as fast as a replicated pool if properly architected.

What part of "I didn't get/deploy 100 Gbps for storage" don't you seem to understand?

Again, I've stated it on the onset - I am going to test this that makes it easy for me to test because complainers always have to have something to complain about.

You're proving my point for me.

If I tested this with GbE, you'll complain about that.

If I test with 100 Gbps IB, you'll complain about that.

People who complain about stuff always have to complain about something.

(And oh BTW, if you've actually read what I've written, which is quite evident you didn't/haven't, I also wrote that the tests were conducted entirely wthin a single system, and therefore; the 100 Gbps IB doesn't even factor into said tests. But you clearly didn't read that, and thus, are still complaining about something that has no bearing on the tests that were performed. It was a trap to see if you were paying attention and you clearly weren't.)

(I do have 100 Gbps IB. But it's not a part of these current tests, and therefore; largely irrelevant that the ConnectX-4 dual port 100 Gbps IB NIC is physically installed in the system.)

guruevi said:
If Gluster is pushing 100MB/s on a spinning drive, it's physically impossible it is committing that to 3 or 5 other disks across a gigabit network. You show benchmarks indicating a full gigabit network, are you comparing Gluster on your IB with a gigabit replicated Ceph and surprised they don't go equally fast?

This has already been answered.

Read what I've written again. (since it's quite evident that you didn't read it the first time through properly.

Read the system configuration and test setup section again. The answer to this has already been written and it's all in there.

"What would you have done, purchased spinning disks with 10ms latencies vs NVMe with 100ns latencies?"
I don't have to replace HDDs in the same way that I would need to replace NVMe SSDs, so it's HDDs for me.

(i've already burned through 7 SSDs using them as swap drives. It's NOT that hard. A single monte carlo simulation consumes about 2% wearout on a E1.S EDSFF NVMe SSD. Run 50 of said monte carlo simulations and the SSD is dead.)

"You clearly do not understand how things work."
BWAHAHAHA....

1.15 Tbps / 120 Tbps.

That's your data/number(s).

You clearly haven't thought about what a waste your system is to the tune of 99.04% is wasted/unused performance capability. At 0.96% performance capability utilisation rate, you probably bought drives that were capable of 12 GB/s sequential reads and got 115.2 MB/s and then thought proudly to yourself that you were doing pretty good, when as, my data shows, you could've gotten 104.58 MB/s out of the 115.2 MB/s that you're actually getting/using, with just three HGST 1 TB SATA 3 Gbps HDDs from 11 years ago.

And that's PER U.2 and/or E1.S EDSFF NVMe 5.0 x4 SSD that you bought.

You spent all this money on a solution that you could've spent a fraction of the money and gotten very similar results if your other solution actually had a drive utilisation rate than the solution that you actually implemented.

Clearly you weren't running the math which would've shown you what a terrible $/performance ratio this was.

Again, you spent how much money on this, what you could've accomplished with a fraction of said money, had you bothered to calculate this metric and run the math/numbers which would've shown you just how cost inefficient this solution actually, really is.

But you never did your homework/math. And that's why you went with this.

1.15 Tbps out of 120 Tbps possible. Or 0.96% drive performance capability utilisation.

Even with my crappy cluster with a 13 year old CPU, 13 year old motherboard, who knows how old my RAM is, but you can probably safely assume that it's likewise, 13 years old, and 11 year old HDDs, and I can get 26% drive performance utilisation out of 11-13 year old hardware that you can't get even 1% drive performance utilisation using the latest and greatest.

But sure, you know how things work alright. [/s]

26% drive performance utilisation with 11-13 year old stuff vs. 0.96% drive performance utilisation with your latest and greatest.

"And even then, it wouldn't have cost you $20k, Enterprise SAS 2.5" 10k drives, which top out at ~1TB were roughly the same cost per GB than Enterprise NVMe SSD at the time when they were still available and consumed 10x more energy."
You spent how much on the drives again, just to get 0.96% drive performance utilisation out of them?

I just pulled $2000 per drive outta thin air as the actual price isn't the point here. The point is that you could've gotten the same level of performance, for a lot less, if you had actually bothered to do your math homework (which you clearly didn't) (because if you did, then you would've realised what a terrible business proposition this was. 0.96% of the drive's performance utilisation. That's what all that money that you were given, bought you. And the point was that you could've achieved the same level of performance with something that could've costed you 1% of the budget that you were given, and still gotten the same actual throughput. But for a lot less money. And you would've known that, if you did your math homework to compute the drive performance utilisation metrics, which again, you clearly didn't.

You had all this money in the world to buy the latest and greatest and the best that you could muster up, with all that money in the world, was 0.96% drive performance utilisation. You were able to buy the latest and greatest of everything, and you could still only achieve 0.96% drive performance utilisation.

That's it. After all that money, that is all that you were able to accomplish.

With all that money in the world, you couldn't even crack the 1% drive performance utilisation mark.

But sure, my 26% drive performance utlisation = I don't know how things work.

If I don't know how things work, then how it is that I'm able to get 27x the performance that you were able to get with your latest and greatest??? (0.26/0.0096 = 27.083333333)

"Today you get 256TB on a single EDSFF SSD"
How much is that going to cost you?

"but drives don't give you 600MB/s or even 100MB/s, they average "real commit speeds" at ~0.5-2.5MB/s."
So what you're really saying is that the 9.21 MB/s sequential write in the Win11 VM that was running on ceph was actually, vastly slower than that, in actual reality?

I'm glad that you're willing to admit this, even if only implicitly.

(i.e. NTFS could be lying about the sequential transfer rates, but how it was lying about the 82.33 MB/s sequential write speed with gluster - it would also be lying about the 9.21 MB/s sequential write speed with ceph. Therefore; gluster could be slower in actual reality, but if the real, actual commit speed is say, 1/10th of that, then gluster would still be at 8.233 MB/s whilst ceph would still only be 0.921 MB/s. Either way, ceph is still slower, even if NTFS is lying to us, as it would lie to us in the exact same manner regardless of what I am using behind it (gluster or ceph). If the NTFS lie is a factor of 10, then it would be a factor of 10 for both gluster and ceph. Thus, 82.33 MB/s would become 8.233 MB/s on gluster, but 9.21 MB/s would become 0.921 MB/s on ceph. Either way, ceph is slower, as the data shows. NTFS wouldn't change it's lying factor just because you're using a different storage technology on the backend as the Win11 VM/NTFS would have no knowledge/awareness of what your storage backend is. It just uses it.)

"But the latency is what matters, it is what drives your IOPS, it is what makes your Windows 11 update faster."
And yes, on the same CPU, same motherboard, same RAM, same HGST 1 TB SATA 3 Gbps HDDs, gluster updated significantly faster than ceph.

If latency is what matters, the drives are the same between both gluster and ceph. And yet, ceph was the only Win11 VM where it had god awful performance (both in sequential workloads as well as random workloads) whereas gluster performed significantly faster.

You can argue that NTFS lies, latency is king, whatever.

But it's the SAME physical hardware. SAME physical HGST 1 TB SATA 3 Gbps HDDs. Same everything (basically).

And therefore; since it's the SAME HGST 1 TB SATA 3 Gbps HDDs that I'm using for both (and I bought the drives, gosh, like 8 years ago, and never used them until now, so they were wiped by the eBay seller, and I haven't used them since, so as far as the system is concerned, they're "fresh" drives - ready to go).

The latency of said HGST 1 TB SATA 3 Gbps HDDs would be the same across all 6 HDDs and not a function of whether I'm running gluster or ceph on it.

Therefore; given that it's literally six of the same drives, the latency would be the same across them, but gluster clearly ran the Win11 updates significantly faster than ceph, despite the fact that all six HDDs have the same latency. Or put in another way - the drives that were used for gluster isn't suddenly going to have 8.939196526x lower latency (for the sequential write test from CrystalDiskMark) and 2.745762712x lower latency for the random write test between running said Win11 VM on gluster vs. running said Win11 VM on ceph.

Latency is a property of the physical disk, not what storage technology is being used on them.

And yet, gluster clearly ran said Win11 updates faster, by a HUGE margin, over ceph.

If latency was king, then ceph should've been able to complete the Win11 updates in the same time that it took gluster to complete the same post-install updates.

If gluster could get 82.33 MB/s sequential write and 1.62 MB/s random write, then there is no reason why ceph couldn't get the same performance. I mean, gluster demonstrated that the physical hardware was capable of it, and yet, ceph failed to get anywhere close to gluster's level of performance.

(This is why I ran both tests on the same physical box.)

"Anyway, I hope you never architect these things in a real situation, keep playing, I hope you learn something."
I hope I'll never have to architect these things for real neither. It'll put your job at risk (beacuse I'll do my math homework (as shown here) whereas you, clearly, don't/didn't.

You'd prseent your proposal. And I'd present a counter proposal that would be able to achieve your levels of performance, at a fraction of the cost. (Which I've already demonstrated here. And that's with make up costs for the drives, etc. Put in the real costs for the hardware and the difference can be even bigger.)

26% drive performance capability utilisation vs. your 0.96 drive performance utilisation or 27x difference between your proposal and mine.

If your proposal had a total cost of $1M, my proposal would've only costed around $37037.04.

So yeah - I hope that I would never have to architect something this like for real as well, because it would put you out of a job.

Even if they doubled my budget, it'd still be a fraction of the cost of your proposal.

Because I did my math homework.

And yes, I learned that ceph sucks. You can make up all the excuses in the world, but the facts still remain that even if NTFS has 10x lying factor, it's going to be the same 10x lying factor whether it's gluster or ceph.

Therefore; gluster sequential writes are almost 9x faster than ceph, and random writes are 2.7x faster with gluster than ceph.

This is what the data shows. This is what I've learned.

And you don't have any concrete numbers that proves nor demonstrates otherwise, using whatever methodology you want to employ instead of just install and run.

alpha754293 · Jun 4, 2026

Domino said:
The problem is the variables, hardware, software, user experience, the list goes on and on.

Right. Again, this is why I tested both, on the same physical box and with six HDDs rather than just three.

That way, three can be assigned/dedicated to ceph and the other three can be assigned/dedicated to gluster so they don't "criss-cross" each other, and thus, can't interfere with each other neither.

By purposely running it within the same physical "box", it makes it so that as many of the variables that could be controlled, are controlled. I let the linux scheduler handled process/thread allocation as well (because in a real deployment, the I/O threads aren't going to have the process affinity mask set and it'll be up to the linux scheduler to figure out which I/O thread/process goes to which core, etc.)

And since ceph was so slow, I basically ended up running the gluster benchmark whilst the Win11 VM was still installing on ceph (because ceph was so slow). (And I actually started the ceph install first.)

Again, out of all of the people who have commented, @kayson and I are the only one who has provided any actual, concrete data. Other people just complained about it, but then don't bother to actually run their own tests, the way they think the tests should be executed/performed, and then publish their results data.

And that's fine. At the end of the day, the data speaks for itself. If people aren't happy that the data doesn't support their worldview, then they are more than welcome to run their own tests, in the manner of their own choosing, and present their data, but they haven't, and most likely won't.

alpha754293 · Jun 4, 2026

Domino said:
PS. I'm in a bad mood, I got my cooling script wrong and lost a 15tb nvme due to heat overload, it was the last drive in 2u server front vertical 2.5 slots and just didnt get the ideal airflow during a recent heat wave. Now I'm angry at myself for being so complacent because 15tb nvmes aren't exactly cheap. The good news is that its all backed up!

Sorry to hear about that man.

That really sucks!

(If you ever need me to run the CFD studies for you, just hit me up. I've already studied the optimal placements for thermal pads (on NVMe SSDs), using CFD. (It's better if you place individual thermal pads on the individual NAND flash chips, in terms of thermal conduction, than if you placed one long, continous thermal across all of the chips simultaneously. By cutting the thermal pads and placing them on the individual chips, you are limiting how much planar heat conduction is happening, within the thermal pad, and instead, promote orthogonal heat conduction, and thus, more of the heat goes into your heatsink rather than being "spread across" (planar conduction) said one long, continuous thermal pad.)

Domino · Jun 4, 2026

alpha754293 said:
Sorry to hear about that man.

That really sucks!

(If you ever need me to run the CFD studies for you, just hit me up. I've already studied the optimal placements for thermal pads (on NVMe SSDs), using CFD. (It's better if you place individual thermal pads on the individual NAND flash chips, in terms of thermal conduction, than if you placed one long, continous thermal across all of the chips simultaneously. By cutting the thermal pads and placing them on the individual chips, you are limiting how much planar heat conduction is happening, within the thermal pad, and instead, promote orthogonal heat conduction, and thus, more of the heat goes into your heatsink rather than being "spread across" (planar conduction) said one long, continuous thermal pad.)

Thats good to know! Thanks! I've always just ran a strip of thermal across the lot.

Although in this case it is a U.3 drive, so not really something you think about regarding opening and re-padding, plus with it being a Samsung pm1733, they do run crazy hot as it is.

My Kioxia CM6's run 10 degrees+ less! because of the so obvious design of having loads of vent-holes all around the drive! No idea why all the other manufacturers effectively seal their drives, and some only having a tiny orifice at the front and at the port end. I think going forward I will stick to buying Kioxia.

alpha754293 · Jun 4, 2026

Domino said:
Thats good to know! Thanks! I've always just ran a strip of thermal across the lot.

No problem.

You're welcome.

(I got curious and figured that I could just run it in CFD to figure it out with some science behind it. And then when I was looking at the results, I could see the planar thermal conduction through the thermal pad which meant that it was keep/"trapping" the heat internal into said long, continuous thermal pad, whereas cutting it becomes a physical stop for said thermal conduction and therefore; promotes orthogonal thermal conduction rather than planar thermal conduction. And that makes sense, because that's what you want it to do - have the heat travel into the heatsink as much and as fast as possible.)

Domino said:
Although in this case it is a U.3 drive, so not really something you think about regarding opening and re-padding, plus with it being a Samsung pm1733, they do run crazy hot as it is.

No, but again, mass manufacturing means that they probably can't take the time to cut the thermal pads and place it on the individual chips as that'll just take too long for them to do so.

If your drives are still in warranty, I'd just leave them. That way, if there's a problem, they can't use that as an excuse to deny your warranty claim.

But once your drives are out of warranty, you can cut the thermal pad yourself where it will run every slightly cooler (depending how much forced convection you have, on/up against the heatsink).

Bottom line - there are things that you can do to help with thermal management and CFD can help test different solution proposals prior to implementation. Why guess when you can just run the CFD and get some answers?

Domino said:
No idea why all the other manufacturers effectively seal their drives, and some only having a tiny orifice at the front and at the port end.

Short answer?

Bean counters.

Engineering probably told them that it would be better for their product.

Manufacturing probably told them how much it would cost.

Bean counters picked the solution that cost them the fewest beans.

And as long as the product fails the day after the warranty expires, then that's your problem. And they make more beans from you, since you might buy from them again.

There are only a tiny handful of companies where they will just charge the customer the manufacturing costs for a better engineered product.

Most other people buy what's cheapest, even if it means buying a product that might be inferior, engineering-wise.

It's what businesses and customers-alike do.

(My first "take your kid to work day" in 9th grade - I didn't want to go to where my parents worked (mom was a speech-language pathologist and dad ran the computer systems at a bank), so I was placed by my high school co-op office, at one of the local car dealerships, in the service department, with the mechanics. I distinctly remember asking the mechanic back then, what was something that car companies did that was stupid and he old me (he was working on a GMC van (think like GMC Safari, but from the late 80s/early 90s) and told me that the steering column has a plastic cover on it that faces the driver. That cover used to be a two-piece injection moulded plastic cover so that if the mechanic needed to service the upper half of the steering column, he would be able to take out the two screws, take the top half of the cover off, and do whatever he needed to. As a cost-save, GM cut the cost of those two screws out and make it one single, long injection moulded piece for fractions of a penny for said two screws. Now, to service the top part of the steering column, it would take the mechanic 4x as long because the screws that hold the cover in is now mounted to the floor, and so he'd have to half-rip the floor out to be able to get to the two bottom screws, take those out, and then take off the cover. But GM would still only pay him, to do that job, the time it would've taken when it was still a two-piece cover. (So he'd only be paid, let's say 15 minutes of time, for a job that now takes an hour.) This is what companies do. And it's not limited to just GM. This was just my first exposure to this sort of stupidity from companies that said mechanic taught me and it was a lesson in how companies work that I still remember to this day. And customers buys what's cheap(est) most of the time. It's why dollar stores and Walmart exist and is a multi-billion industry/company.)

kayson · Jun 5, 2026

Another data point (not performance, but gluster-isnt-dead) - Debian is shipping glusterfs 11.2 in forky: https://packages.debian.org/en/forky/glusterfs-server

spirit · Jun 5, 2026

@alpha754293

- how many PG do you have on your ceph cluster ? (new pool on empty storage are setting low number of PG by default with target ratio is not configured on the pool, and this can give very low write speed).

- do you have enable writeback on your vm ? (for ceph, it's really helping for small sequential write)

- also,for read, try to enable:

"rbd config pool set POOLNAME rbd_read_from_replica_policy localize"

(with this, in hypeconverged , the vm will read by default on the local node instead, like gluster)

guruevi · Jun 5, 2026

Here is actual data from the home lab. constrained to 400Mbps network (more on that later) and tested on 1 system. All systems use the exact same boot image, running Ubuntu provisioned through Vagrant, 1 dedicated drive per node for the servers and 1 client. The software is 'out of the box', no tuning, all tests are direct I/O, no cache, exact same config, flush caches between test etc.

I tested GlusterFS on XFS (the fastest way to do Gluster), CephFS (because that's a file system too) and then mounted a file system of an EXT4 partition on Ceph RBD (to emulate what a VM would do). Ceph RBD mounted via FUSE which is the worst way of doing Ceph, you get ~2-5x better performance using Ceph via VFS but it doesn't really matter here because I'm constrained to a 50MB/s pipe and I didn't want to invest the time.

All the settings for fio are identical, queue depth of 32, 4 threads reading 128 files (512 total files being created per run), 4MB for sequential, 4kB for the others, 5s warmup time (that will avoid spikes due to things like TCP windowing adjustments).

As a result I have virtually no variation between min and max. I tested random data writes to avoid ANY (accusation of) read caches.

First result:

Hmm, what is happening here, the test should take 65s, oh, setup time. Yes, fio pre-writes its test files. Gluster has problem scaling, when you write a file, all nodes have to be made aware, during that time each node was spending 50% of CPU just keeping data in sync for the folder, because Gluster wasn't built around SSDs the glusterfsd has a hard time keeping up with parallelism. The setup time takes almost twice as long on Gluster. Hence why I had to limit the network to 400Mbps - https://github.com/gluster/glusterfs/issues/4028 and https://github.com/gluster/glusterfs/issues/4085 - under heavy load, a cluster split-brains, and there are reports going back years, nobody has fixed that, not even Red Hat did, that's likely why they abandoned it around the time SSD became commonplace. Basically you have to throttle the ingress of data, once the gluster daemon has a backlog of things to process (and it's single threaded apparently), it seems the cluster thinks the brick has gone offline.

This is IOPS. Gluster again slower by ~30%.

Latency:

Throughput, which also demonstrates, a client can't go faster than its network pipe:

Now, there are ways of making Gluster (and Ceph too) go faster than its network pipe, which is enabling write-back cache. But that is a ticket to data corruption.

alpha754293 · Jun 5, 2026

spirit said:
- how many PG do you have on your ceph cluster ? (new pool on empty storage are setting low number of PG by default with target ratio is not configured on the pool, and this can give very low write speed).

PG autoscale=on

spirit said:
- do you have enable writeback on your vm ? (for ceph, it's really helping for small sequential write)

No idea.

Whatever the defaults were.

(The idea with testing it this way, without setting any non-default settings, is to test the default behaviour, "out of the box". If the default is wrong then perhaps, the default should change. Or at least the default should have a little bit more intelligence baked in, depending on what storage backend you're using, and then set the defaults in accordance with that.)

spirit said:
- also,for read, try to enable:

"rbd config pool set POOLNAME rbd_read_from_replica_policy localize"

(with this, in hypeconverged , the vm will read by default on the local node instead, like gluster)

1) Does this apply to EC pools as well?

2) How does it work when the metadata for EC pools requires a replicate pool, but then it is the data-pool option, within PVE storage (read: /etc/pve/storage.cfg that you can tell it that the data-pool is a data pool other than your replicate pool?

3) If it needs this, again, why isn't this enabled by default?

alpha754293 · Jun 5, 2026

guruevi said:
All systems use the exact same boot image

What are the specs of the host system? Whch Ubuntu version are you using/running? Why aren't you testing with a Win11 VM as the client? (if you want to mimick an enterprise Win11 VM workload? (unless you don't want to mimick an enterprise Win11 VM workload :shrug: )

guruevi said:
running Ubuntu provisioned through Vagrant

Why did you provision the Ubuntu VM through Vagrant?

(There are no details about what your host system is running, so why you wouldn't be provisioning an Ubuntu VM using Proxmox? (cuz you know...given that this is a Proxmox forum and all...))

guruevi said:
1 dedicated drive per node for the servers and 1 client.

No mention of how many nodes you're running in total and whether they're physical nodes or virtualised nodes. No mention of the hardware specs for each of the nodes.

guruevi said:
The software is 'out of the box'

Vagrant is installed in Proxmox "out of the box"? (If it isn't, then it really isn't truly "out of the box" now, is it"?)

guruevi said:
and then mounted a file system of an EXT4 partition on Ceph RBD (to emulate what a VM would do)

So...I'm confused. Are you testing it with an Ubuntu VM client, or are you testing the host?

Cuz if you're testing the host, I can't wait for @jlauro to comment about how you're testing the host, rather than inside a VM.

And if you're testing the Ubuntu VM, then why would you need to mount an ext4 filesystem/partition on the Ceph RBD to quote "emulate what a VM would do"?

Wait...how did you get Win11 to install with ext4?

I think SLES15SP6 now defaults to btrfs, IIRC.

Also IIRC, TrueNAS Scale defaults to ZFS on root, when you run their installer.

In other words, "emulating" what a VM would do would highly depend on what VM you're emulating. Of course, this also begs the question of "why emulate it if you can just install it?" Like it's not that hard (nor that time consuming). You can pretty much auto-provision the VM as a background task.

guruevi said:
All the settings for fio are identical, queue depth of 32, 4 threads reading 128 files (512 total files being created per run), 4MB for sequential, 4kB for the others, 5s warmup time (that will avoid spikes due to things like TCP windowing adjustments).

I'm confused.

If you're running quote "1 dedicated drive per node" and you're running this on "1 system", so....are the nodes virtualised then???

From what you've written, I can't tell if you're runinng 1 client VM (of type: unknown/undisclosed) or if you're running multiple VM clients.

I say this because I didn't think it would be possible for a single VM client (of type unknown) to be able to generate an actual queue depth of 32, with 4 threads reading 128 files each.

You say that your test is to emulate what a VM client does, but I can't think of a VM client that would generate a queue depth of 32, reading 4 threads of 128 files each, can you?

I mean, let's suppose that you're testing an Ubuntu VM (again, you don't say what VM you're running or what VM you're trying to emulate, just what FS the (Linux) client VM would be using), and let's suppose that you don't download (and install) the updates when Ubuntu is getting installed (and you're going to use the updates to test how the storage backend reacts), I can't think of/imagine that running said Ubuntu updates, post-install, is going to generate a queue depth of 32, with 4 threads reading 128 files each.

What VM does that?

(Since you already assume that there's more that I need to learn, then please, educate me. What VM, whilst it is doing its own VM things, will result in a queue depth of 32, with 4 threads reading 128 files each? I'd like to see how you would even get a VM into this state where it generates a queue depth of 32, with 4 threads reading 128 files each. Can you share/tell me how in the world you got your VM into this state so that I can replicate your test please?)

guruevi said:
I tested random data writes to avoid ANY (accusation of) read caches.

If you had provisioned a VM, and it was running the post-install updates, wouldn't that kinda take care of this pretty much automatically for you so that you don't need to worry about read caching?

For your sequential write results, how big of a file are you writing?

(you state that it's queue depth=32, numjobs=4, and I don't know what the fio flag is for 128 files, but didn't state what the block size is nor the total count of the number of blocks (or anything that would indicate the total size of the sequential file being written).)

guruevi said:
Yes, fio pre-writes its test files

CrystalDiskMark does the same thing.

guruevi said:
Gluster has problem scaling, when you write a file, all nodes have to be made aware

I have no idea how many nodes you're running your tests with.

guruevi said:
during that time each node was spending 50% of CPU just keeping data in sync for the folder

Nor the configuration specs for each node (How many CPU cores? CPU model? Is it using kvm64 as the node CPU model/type? How much RAM does each node have? etc.)

Also no details about whether you're running a replicate Ceph pool or you're running an EC Ceph pool. No configuration details for your gluster test as well? Did you set up distributed volume(s)? replicated? dispersed? distributed replicate? distributed dispersed? Again, no configuration details for which/how you've set up your gluster (nor ceph, for that matter) for your tests.

guruevi said:
Gluster wasn't built around SSDs

No idea what kind of SSDs you're using here, for your tests.

guruevi said:
The setup time takes almost twice as long on Gluster. Hence why I had to limit the network to 400Mbps - https://github.com/gluster/glusterfs/issues/4028 and https://github.com/gluster/glusterfs/issues/4085 - under heavy load, a cluster split-brains, and there are reports going back years, nobody has fixed that, not even Red Hat did, that's likely why they abandoned it around the time SSD became commonplace. Basically you have to throttle the ingress of data, once the gluster daemon has a backlog of things to process (and it's single threaded apparently), it seems the cluster thinks the brick has gone offline.

So...what I am gathering here is that I can try deploying multiple Win11 VMs in my test cluster, to see if I would be able to push gluster to its limits, on my system (because I don't have new/fancy toys like you do at home. Kids college fund is more important than me buying toys (I sysadmin because I have to, to support my CAE habits, and not because I want to.)

But even then, again, you're generating a load with a queue depth of 32, and 4 threads reading 128 files each simultaneously.

Your homelab has that much I/O nominally???

That's actually kinda impressive that you're able to sustain that much I/O work, in what I can only assume to be running 24/7.

Again, I'm trying to think of when would you ever have VMs that would generate a iodepth=32, numjobs=4, nrfiles=128 workload concurrently/in parallel.

I can't think of a scenario, in a VM, that would do that. Can you?

guruevi said:
Hence why I had to limit the network to 400Mbps - https://github.com/gluster/glusterfs/issues/4028 and https://github.com/gluster/glusterfs/issues/4085 - under heavy load, a cluster split-brains, and there are reports going back years, nobody has fixed that, not even Red Hat did, that's likely why they abandoned it around the time SSD became commonplace. Basically you have to throttle the ingress of data, once the gluster daemon has a backlog of things to process (and it's single threaded apparently), it seems the cluster thinks the brick has gone offline.

Are you using a separate network for the storage I/O traffic or is it all just through one network?

(My test system, having only 64 GB of RAM, is RAM starved where I can't spin up more than four Win11 VMs (and the corresponding virtualised nodes) simultaneously. If the system had more RAM, then I'd be able to spin up more (nodes/Win11 VMs, simultaneously), but right now, I am practically limited to 3 nodes, which means I can spin up only 3 Win11 VMs simultaneously. But again, I can't think of how you'd even get a VM to generate a iodepth=32, numjobs=4, nrfiles=128 workload, although to be fair, if your numjobs=4, then it would be like if you only had 4 VMs running anyways.

I'm trying to what state a VM would have to be in where the queue depth would be 32. And across 128 files at that. That works out to 4096 I/O operations, waiting in the queue (if the number of I/O operations waiting in the queue = queue depth x number of files). In other words, you'd have to be issuing I/O operations faster than the SSDs (and the host) can process them.

Again, what state would the VM have to be in, for that to be true and/or even possible???

What on Earth would you have to be doing in that poor VM, to get it into a state where even your SSDs, on your system can't process the I/O that quickly???

That doesn't make sense to me.

How in the world would you even get a VM into this kind of an I/O state?

guruevi said:
Now, there are ways of making Gluster (and Ceph too) go faster than its network pipe, which is enabling write-back cache. But that is a ticket to data corruption.

If you're creating an ext4 FS on ceph RBD, then why would CephFS be faster than ceph RBD?

I don't fully understand this.

(Sidebar: In my testing, on my test system, I gave each of the nodes a virtio-nic so that the network wouldn't be a bottleneck, and it was running it just fine. Granted, I don't know/think that a Win11 VM, even with HDDs would be able to generate much of a queue depth, across i number of files, but I don't know/think that an Ubuntu VM would generate much of a queue depth across j number of files neither, in any practical, nominal usage scenario.)

Like I said, what the heck do you have to do to a VM to get it to a point/state where even on your system, with your SSDs, that you'd end up with a queue depth of 32 across 128 files?

I don't understand how you'd get there, especially when numjobs=4 (e.g. you're only running 4 VMs).

Johannes S · Jun 5, 2026

kayson said:
Another data point (not performance, but gluster-isnt-dead) - Debian is shipping glusterfs 11.2 in forky:

This has nothing to do with the question whether gluster is dead or not. Debian ships a lot of packages which hasn't seen much or any upstream activity for years as long as somebody volunteers for maintaining it. They also don't ship some packages with high upstream activity because up to now nobody bothered to package and maintain it. This doesn't change qemus deprecation of gluster though but that's the thing that matters. If you want continueing gluster support in Päqemu and ProxmoxVE: Volunteer as maintainer or get funding for it e.g. via crowdfunding. If you can't do this you can still use glusterfs as directory based storage similiar to ocfs2. I wouldn't do this in a business oroduction Environment though since it's not covered by Proxmox support subscriptions.
But ( I might be wrong though ) my impression is that the whining on the loss of gluster mainly comes from homelabbers so for them the directory option might still be "good enough"

alpha754293 · Jun 5, 2026

guruevi said:
Hence why I had to limit the network to 400Mbps - https://github.com/gluster/glusterfs/issues/4028 and https://github.com/gluster/glusterfs/issues/4085 - under heavy load, a cluster split-brains, and there are reports going back years

Both of the links that you've provided refers/talks about GlusterFS 11.x.

To quote from your first link: https://github.com/gluster/glusterfs/issues/4028
"This issue did not occur with v10.3"

To quote from your second link: https://github.com/gluster/glusterfs/issues/4085
"I had cluster that was stable as a rock on 10.x"

guruevi · Jun 5, 2026

Most of the data you ask about is really irrelevant. The point was to drive the system to some kind of unified performance peak, what CPU and whether virtual or hardware and exact test design is rather irrelevant as long as they are the same, so you can make a fair comparison of performance without any tricks. You can set up the same thing in other systems and do the comparison. Most of the setup favors Gluster’s ideal setup and is not at all how you would do an intelligent Ceph setup, because as the OSD scales, so does Ceph performance whereas Gluster goes in the other direction. The synchronization issue you mentioned is real on Gluster, more bricks, more traffic, whereas Ceph uses CRUSH to avoid talking to unnecessary nodes.

Queue depths of 32 and having 128 files open across 4 threads are mild workloads for modern SSD and hypervisor workloads, even a single Windows 11 workspace can reach that, it also avoids various pitfalls people find in benchmarks such as CPU and interrupt saturation, which would otherwise just become a CPU/storage/NIC throughput benchmark. The tests were performed from the client to the server as indicated. I didn’t test Windows because it doesn’t matter, in Linux it is easier to control all the variables and automate the tests.

Similar split brain scenarios under heavy (note that for today's standards, this is no longer heavy, Ceph can hit 1M 64kB IOPS across 12 nodes) load was reported before on mailing lists in 2019 with Gluster 4 https://lists.gluster.org/pipermail//gluster-users/2019-September/037083.html It is a very similar problem to what I started seeing 15 years ago as our cluster grew, split brains started occurring. Yes, you can throw more CPU, hardware and do optimizations, but at some point, you get to a point that it is untenable, a file system that throws up its hands in I/O error is BAD.

The fact you end up with lost data rather than a complete halting state is one of the many design issues with Gluster and hence the recommendation not to use it in production. The fact it, even if it were a regression, hasn’t been fixed in over 2 years just demonstrates it is a dead project.

Glusterfs is still maintained. Please don't drop support!

Member

Active Member

Active Member

Active Member

Renowned Member

Active Member

Well-Known Member

Active Member

Active Member

Active Member

Well-Known Member

Active Member

Member

Distinguished Member

Renowned Member

Active Member

Active Member

Distinguished Member

Active Member

Renowned Member

We value your privacy