Glusterfs is still maintained. Please don't drop support!

the Gluster plugin WORKED in previous versions. to ensure that it CONTINUES to work is a non trivial investment in manpower and development time
Line 32 of debian/control just calls for the glusterfs-client.

PVE isn't responsible for maintaining said glusterfs-client package which is the core package that enables gluster to work in/with PVE.

lines 674 and 753 of pve.sm calls for the Glusterfsplugin.pm which was deleted in the 16 Jun 2025 commit (commit id# 7669a99e97f3fd35cca95d1d1ab8a377f593dccb) (so a little less than a year ago, as of this writing), deleted a plugin that a) already exists and b) already works.

Again, PVE isn't responsible for maintain the glusterfs-client that this all rides on.

And since a lot of people have already said that there isn't much activity going on with glusterfs anyways (activity is slow, but not nothing), therefore; the chances that something breaks with an existing glusterfs-client package would usually/generally be quite low, especially if it is stable and has been working for some time.

It be one thing if they're dropping the glusterfs-client package because it doesn't work but that's not the case here.

What makes you think that PVE is responsible for making sure that glusterfs-client isn't broken?

If it was broken and PVE decides to drop it, at least that's understandable.

The PVE team dropping this for a part of QEMU that PVE doesn't even use - that's just silliness.

Is the glusterfs-client package broken? No?

Does PVE use QEMU's native glusterfs client? No?

Then the stated reason for dropping it doesn't apply.

IIRC, I think that someone else here has said that the user is free to use the glusterfs-client as a directory mount point, which is what the plugin does for you.

They made the decision to drop the commitment to supporting gluster in order to commit their resources to other aspects of the stack they deem is of more importance.
No, that's not the stated reason for dropping gluster.

This is what @Thomas Lamprecht wrote, for this commit (as the "official" stated reason for dropping gluster);

Code:
drop support for using GlusterFS directly

As the GlusterFS project is unmaintained since a while and other
projects like QEMU also drop support for using it natively.

One can still use the gluster tools to mount an instance manually and
then use it as directory storage; the better (long term) option will
be to replace the storage server with something maintained though, as
PVE 8 will be supported until the middle of 2026 users have some time
before they need to decide what way they will go.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>

The part about gluster not being maintained simply isn't true.

(Again, @alexskysilk you wrote what you wrote on the assumption that it is up to the PVE devs to make sure that the glusterfs-client package still works (because that's what pvesm.pm and Glusterfsplugin depends on.)

If it still works, then there's no reason to remove it.

The reason that Thomas gave for removing it is partially false (re: that it isn't maintained). PVE doesn't use the QEMU native support, and therefore; Thomas' second reason is irrelevant.

GlusterfsPlugin.pm already exists and as long as glusterfs-client package works, then there is little reason to think that said GlusterfsPlugin.pm won't work.

If it ain't broke, why break it?

(PVE doesn't use QEMU's native glusterfs client/support/backend and gluster is still maintained, even if (very) slowly. You can look at the PRs that were merged. Again, if a code is stable enough, wouldn't you expect the number of commits per month to decrease?)

So simple.
 
You asked for an answer, and you received one. litigating your point of view is not relevant, interesting, or consequential. You dont have to ACCEPT the answer. You choices remain the same as before the wall of text (which I didnt read, apologies.)
 
  • Like
Reactions: Johannes S
You asked for an answer, and you received one.....You don't have to ACCEPT the answer.
The answer that was provided is riddled with inaccuracies and/or statements that are just patently and/or knowingly fraudulent misrepresentations that went into said answer.

It'd be one thing if the answer was even remotely factually accurate, but alas, sadly, it is not (as shown above).

It'd be one thing to give people bad information because the person giving said advice doesn't know.

But to knowingly give out wrong information, I would have to imagine that's worse, right?

PVE isn't responsible for maintaining the glusterfs-client Debian package that the GlusterfsPlugin.pm (and debian/control and pvesm.pm) depends on. They just have to rollback the commits which deleted those lines and like I said, people who don't use it wouldn't even notice and it brings back a feature/functionality that wasn't even broken.

Again, if/when the glusterfs-client breaks and PVE decides to remove it because it broke and therefore; is no longer useful/valid - then sure. I agree with that (reasoning/rationale).

But to take it out because of a part of QEMU that PVE doesn't even use - that's just silly.

PVE devs can leave the GlusterfsPlugin.pm in its last known good state and don't even have to touch it. If/when it breaks, they can take it out in the future. But if it isn't broken now, then why take it out based on something that PVE doesn't even use from QEMU? That's just silly.
 
Again, the way that gluster is implemented in Proxmox, you add Gluster, effectively, via
pvesm add glusterfs my-gluster-storage --server 192.168.254.1 --volume gv0 --backup-server 192.168.254.2 --content images,rootdir
and that will mount it to /mnt/pve/my-gluster-storage on all of the nodes.
Got fuse mount that way ... and that's the point of support of qemu ...:
The Proxmox Native Custom Directory Trick (Recommended)
You can instruct QEMU to use a gluster:// network URI directly instead of pointing it to a local folder mount and use the QEMU libgfapi.
  1. Create your VM as usual in Proxmox.
  2. Edit the VM's hardware configuration file manually on the Proxmox host (/etc/pve/qemu-server/<VM_ID>.conf).
  3. Change the drive definition line to look like this:
    ...
    scsi0: gluster://<IP_OF_GLUSTER_NODE_1>/<VOLNAME>/images/<VM_ID>/vm-<VM_ID>-disk-0.raw,cache=writeback,discard=on,size=50G
    ...
By specifying gluster://, QEMU automatically hooks into its built-in libgfapi driver, entirely bypassing FUSE. [1, 2]

Otherwise the NFS-Ganesha alternative (If you need GUI simplicity)
If you require strict Proxmox GUI storage management and LXC container support (as LXC cannot use libgfapi), standard NFS won't cut it. Instead, deploy NFS-Ganesha on your Gluster servers.
  • NFS-Ganesha is a user-space NFS server that links directly into libgfapi underneath.
  • You configure Proxmox via the GUI to talk to it as a standard NFS share.
  • Performance: Slower than direct QEMU libgfapi, but vastly superior to standard Linux kernel NFS or - your describted - FUSE mounts.
PS: Anyway do some "pvesm add glusterfs my-gluster-fuse ..." to go to glusterfs by gui too and be able to use it for lxc's if not taking the nfs way too.
 
Last edited:
From your writing, you clearly do not understand what PVE is, what qm does and to read through and disentangle it all seems like a waste of time. But qm is just a client to the PVE API (it simplifies QEMU only to a few relevant commands and builds an abstraction layer). PVE (the totality) is the management layer on top of QEMU, Ceph, ZFS, LVM and a few others. qm (the command) does not do any data reading or writing to disk for disk images, if you read the source code, where it may appear to do so, it just passes it through to a qemu-img tool. Almost everything you do in Proxmox can be done with QEMU tools if you find that there is an option missing or some massive complexity that is not (yet) built-in.

If QEMU doesn't support it, by definition PVE can't support it, but there is no obligation for Proxmox (the company) to implement every option that may be available in QEMU or extend support for things that QEMU supports. Like the Linux kernel, QEMU supports a TON of old stuff, including old hardware, hardware emulation etc. That's not the goal of PVE, the goal of PVE is a simpler, unified management layer for commercial use. A replacement to VMware more than a multi-tool for QEMU. If you want a multi-tool for QEMU, then use a multi-tool for QEMU like Quickemu, Virt-Manager etc.

QEMU dropped Gluster support (https://qemu-project.gitlab.io/qemu/about/removed-features.html), therefore it can't write to a block device on Gluster. As I already explained, a block device is not just a file share. You can set up Gluster servers with NFS Ganesha and share it out that way so you can write disk images, but to do that in Proxmox would require them to write a Gluster management layer, which IT'S A DEAD PROJECT with NO commercial significance whatsoever and using a file share to put qcow2 disk images on Gluster is not the best way, neither for performance or data safety. If QEMU were to return support for Gluster, then you can add that option to Proxmox, either by a storage plugin, or simply manually by editing the VM config files to add the correct QEMU incantations (https://www.gluster.org/qemu-glusterfs-native-integration/)

You can add Gluster support with a local disk backend to Proxmox if you want - here are the packages which HAVEN'T BEEN REBUILT IN 2 YEARS https://launchpad.net/~gluster/+archive/ubuntu/glusterfs-11 and thus is likely no longer functional on Ubuntu 25 or 26 LTS (which Ubuntu are the kernels Proxmox follows).

As per your Ceph comment, read again what it says you highlighted. It does not say it reads from all nodes in a cluster, as you previously indicated it works, it reads only the relevant DATA CHUNKS which is specifically the n OSD that your data is on per the CRUSH algorithm. In our previous example, it doesn't have to traverse all 300 of them, as in the documented case of Gluster requiring you to worst case scenario contact each brick daemon to query its DHT to find an object, Ceph only has to contact n+k where n+k is whatever data+parity distribution you have specified and the client knows "in advance" which nodes those are based on the cluster map. The example you copied is just an example, if you have 300 OSDs, it will still only read from those 6, if you have 1000 it will still only read from those 6, hence they do not need to illustrate the rest, since at that point of the documentation, it is already assumed you understand the CRUSH map. As a result, Ceph can and does scale to thousands of OSD daemons and has linear performance increases up until the point you hit a physical limit (CPU, RAM, networking).
 
Last edited:
We once had a ceph ec poc with up to 15 hosts (3x mon, 12x data), 7 hdd each/data host, all 10Gbit, for nfs serving data to linux ws's, getting up to 600 MB/s
while all ceph cpu's (2x xeon each, 8-12 cores) were run at 100%, all backend networks 100%. Deleting data takes over night, that was the most horrible.
Linear scaling would be nice but you need so much hw to get reasonable performance to clients and you just can bit on realibility but not on performance to the whole 15 server setup.
We once had 3 node prod gluster (couple of hdd's each) for a prod RHEV (~150 vm's, no containers) which ran for 6 years without any performance issues reported and just had 1 time a gluster daemon hang-on, rebootet the 3 nodes and all was running further without problems.
So, from my perspective, the conclusion is that the gluster installation was significantly more useful than the ceph installation for around a year.
On the other hand rh gluster support was uncompareable to ceph support to earn money with at RH, so in my eyes it was also clear a business decision to push ceph - sadly decision for the real customer world.
 
Last edited:
Got fuse mount that way ... and that's the point of support of qemu ...:
The Proxmox Native Custom Directory Trick (Recommended)
You can instruct QEMU to use a gluster:// network URI directly instead of pointing it to a local folder mount and use the QEMU libgfapi.
  1. Create your VM as usual in Proxmox.
  2. Edit the VM's hardware configuration file manually on the Proxmox host (/etc/pve/qemu-server/<VM_ID>.conf).
  3. Change the drive definition line to look like this:
    ...
    scsi0: gluster://<IP_OF_GLUSTER_NODE_1>/<VOLNAME>/images/<VM_ID>/vm-<VM_ID>-disk-0.raw,cache=writeback,discard=on,size=50G
    ...
By specifying gluster://, QEMU automatically hooks into its built-in libgfapi driver, entirely bypassing FUSE. [1, 2]
100% agree.

My point is that Proxmox dropping support for Gluster because QEMU is dropping the libgfapi where it treats it as a block device rather than as a FUSE-base client just provides more data demonstrating the silliness of Proxmox dropping Gluster support.

People here think that the way that Proxmox mounts Gluster is the same way that QEMU mounts it, but as you have just stated, that isn't the case at all. Therefore; the reasoning for Proxmox dropping it because QEMU dropping it, is again, silly.


Otherwise the NFS-Ganesha alternative (If you need GUI simplicity)
If you require strict Proxmox GUI storage management and LXC container support (as LXC cannot use libgfapi), standard NFS won't cut it. Instead, deploy NFS-Ganesha on your Gluster servers.
  • NFS-Ganesha is a user-space NFS server that links directly into libgfapi underneath.
  • You configure Proxmox via the GUI to talk to it as a standard NFS share.
  • Performance: Slower than direct QEMU libgfapi, but vastly superior to standard Linux kernel NFS or - your describted - FUSE mounts.
PS: Anyway do some "pvesm add glusterfs my-gluster-fuse ..." to go to glusterfs by gui too and be able to use it for lxc's if not taking the nfs way too.
Yeah, that might work.

It depends on whether the storage node is the same as the node that's running said virtualisation workloads.

In resource constrained environments, a single system might be pulling multiple duties where the storage system and the system that's running said virtualisation workloads are the same, physical system.

And whilst you could use NFS-Ganesha, even for a local system; it would just be an extra step that the VM (and/or LXC) would have to go through, to be able to utilise gluster.

"as LXC cannot use libgfapi"
You can't pass that through/into said LXC?

I didn't know that.

(So far, I've only been testing with VMs, haven't tested with LXCs yet.)
 
From your writing, you clearly do not understand what PVE is
Oh....there are a lot of things that you clearly don't understand neither, which I've already outlined previously.

I mean, you're literally arguing with the ceph devs - to which I say, "good luck with that one".
what qm does and to read through and disentangle it all seems like a waste of time. But qm is just a client to the PVE API (it simplifies QEMU only to a few relevant commands and builds an abstraction layer). PVE (the totality) is the management layer on top of QEMU, Ceph, ZFS, LVM and a few others
e.g. middleware

(Which you can observe when PVE is booting up.)
qm (the command) does not do any data reading or writing to disk for disk images, if you read the source code, where it may appear to do so, it just passes it through to a qemu-img tool
As I have already shown on commit id#: 7669a99e97f3fd35cca95d1d1ab8a377f593dccb, PVE uses glusterfs-client and not the GlusterFS block driver that QEMU uses.

Screenshot 2026-06-08 182440.png

Screenshot 2026-06-08 182547.png
(Source: https://www.gluster.org/qemu-glusterfs-native-integration/)

Therefore; by virtue of this alone (glusterfs-client is a FUSE-based client whereas QEMU's native GlusterFS client is a block driver that uses libgfapi. Block driver and FUSE-based clients aren't the same thing.

Like I said, there are a lot of things that you don't appear to understand.


If QEMU doesn't support it, by definition PVE can't support it
Patently false.

PVE uses glusterfs-client. QEMU uses a GlusterFS block driver that uses libgfapi.

They are not the same thing.

In fact, the aforementioned commit, signed off by @Thomas Lamprecht, can be rolled back, and they would just have to restore the GlusterFSPlugin.pm to restore this capability/functionality even if QEMU no longer supports it via the GlusterFS block driver because PVE uses glusterfs-client and not the GlusterFS block driver that QEMU uses.

Like, how do you not get this difference???

It's literally in the commit.

Read the commit.

It's not that hard.

Like I said, there's a lot that you don't know how things work because you aren't willing to read the commit. I can't help your unwillingness or your lack of motivation, to read things that have been already sourced and cited. You can bring a horse to a lake, but you can't make him drink, even if it means he'll die of thirst.


but there is no obligation for Proxmox (the company) to implement every option that may be available in QEMU or extend support for things that QEMU supports.
Irrelevant.

PVE already uses (or used to use) glusterfs-client for the GlusterFSPlugin.pm.

See the commit.

Once again, you're still assuming that the mechanism that PVE uses to mount GlusterFS is the same as the mechanism that QEMU uses and it's not.

QEMU is a block driver that uses libgfapi.

PVE uses a FUSE-based glusterfs-client. They're not the same thing. Not even close.


A replacement to VMware more than a multi-tool for QEMU. If you want a multi-tool for QEMU, then use a multi-tool for QEMU like Quickemu, Virt-Manager etc.
This is a vastly more recently development due to Broadcom's pricing shenanigans with VMWare licensing costs.

If PVE was a VMWare competitor, then the Data Center Manager would have existed before said Broadcom/VMWare shenanigans began, but that wasn't the case.

It's now being marketed/proposed as one of the options to displace VMWare for those who don't want to be extorted by Broadcom.


therefore it can't write to a block device on Gluster. As I already explained, a block device is not just a file share. You can set up Gluster servers with NFS Ganesha and share it out that way so you can write disk images, but to do that in Proxmox would require them to write a Gluster management layer,
That's not how glusterfs-client, which PVE uses, works.

You're so hung up on the whole QEMU native GlusterFS block driver thing that you literally can't read the commit for yourself where it explicitly tells use that it uses the FUSE-based glusterfs-client package to deal with/manage the GlusterFS storage backend (type).

It's amazing that I have to spell things out for you, explicitly.

Screenshot 2026-06-08 184202.png


which IT'S A DEAD PROJECT
Patently false.


using a file share to put qcow2 disk images on Gluster is not the best way, neither for performance or data safety.
What do you think the directory storage backend is???

Additionally, you can still mount a GlusterFS storage manually and add it as directory storage to Proxmox VE 9.
You can read the comment from Thomas Lamprecht himself.
 
Another data point (not performance, but gluster-isnt-dead) - Debian is shipping glusterfs 11.2 in forky: https://packages.debian.org/en/forky/glusterfs-server
FWIW, forky isn't released, and it's still about half a year until the freeze period starts, and until then lots of packages can still get dropped. Debian cares a bit less about the reasoning we had here, there is no support (and thus liability) sold from their side.
As I have already shown on commit id#: 7669a99e97f3fd35cca95d1d1ab8a377f593dccb, PVE uses glusterfs-client and not the GlusterFS block driver that QEMU uses.
That's just wrong; we used both, for CTs and file operations, the client was used, and for VMs, the QEMU library integration.
https://git.proxmox.com/?p=pve-qemu.git;a=commitdiff;h=4397bd351d8f86a087e6356f0fb1fdb684ca2dc4

This discussion (style) here is also not really fruitful and has rather lots of redundancy of moot arguments; let's not turn this into spam here.
 
You can add Gluster support with a local disk backend to Proxmox if you want - here are the packages which HAVEN'T BEEN REBUILT IN 2 YEARS https://launchpad.net/~gluster/+archive/ubuntu/glusterfs-11
@kayson just told you that Debian (which PVE is built on top of) is testing GlusterFS 11.2 three months ago.
and thus is likely no longer functional on Ubuntu 25 or 26 LTS (which Ubuntu are the kernels Proxmox follows).
Code:
Since Proxmox VE 8.4, the 6.14 kernel has been made available as an option.This kernel version is derived from Ubuntu 25.04.
(Source: https://pve.proxmox.com/wiki/Proxmox_VE_Kernel)

And yet, GlusterFS works perfectly fine in PVE 8.4.

Huh.

Welp, there goes your statement.

(You make this so easy.)

What were you saying about how I needed to learn and how I don't understand what PVE is again?

As per your Ceph comment, read again what it says you highlighted. It does not say it reads from all nodes in a cluster, as you previously indicated it works, it reads only the relevant DATA CHUNKS
Screenshot 2026-06-08 191013.png

I love how the screenshot, as developed by the ceph devs, literally tell you "Current code reads all data chunks" and then you changed that to read "Current code reads relevant data chunks", which isn't what the ceph devs wrote, at all.

"....relevant data chunks" isn't even remotely close to what the ceph devs wrote, at all.

Whyyy are you lying about what the ceph devs wrote?


Ceph devs: "Current code reads all data chunks..."

You: "Current code reads relevant data chunks..."

That's not what the ceph devs wrote, at all.

Why are you lying about what the ceph devs wrote???

Why are you lying by saying that the ceph devs wrote something that they literally did. not. write.?

It does not say it reads from all nodes in a cluster, as you previously indicated it works,
This is, once again, literally not what I said/wrote.

This is what I actually wrote:

and the picture shows you that it reads all of said data chunks from all OSDs.

Note, I wrote OSDs (which is what the picture shows) and not "all nodes" as you claim.

Why do you have to [blie[/b], in order for you to make your argument work for you?

I mean, the picture literally shows you that it is reading data, from the different OSDs.

Are you telling me that you can't understand a picture neither?


which is specifically the n OSD that your data is on per the CRUSH algorithm.
Scroll down to the "Current overwrite implementation" section of the ceph erasure coding performance document that the ceph devs authored.

Screenshot 2026-06-08 192303.png

Again, you cetain certainly argue that you think that the ceph devs are wrong about this, but again, as to who would understand ceph better - you or the ceph devs, I'd trust the ceph devs to understand how ceph devs actually works any day of the week and twice on Sundays, rather than you, because they're literally the guys that are developing ceph.


The example you copied is just an example, if you have 300 OSDs, it will still only read from those 6, if you have 1000 it will still only read from those 6
So those 6 (n+k) OSDs would get hammered by CRUSH map lookup requests when you have a 1000 OSDs then. That's what you're telling me.


As a result, Ceph can and does scale to thousands of OSD daemons and has linear performance increases up until the point you hit a physical limit (CPU, RAM, networking).
As you have proven, at 0.96% drive utilisation, you, will, of course need thousands of OSDs to make up for the poor drive utilisation from each OSD.

Your own data proves this.

You could've gotten 120 Tbps of total aggregate bandwidth that the drives themselves are able to provide, but in your ceph deployment, you're only able to get 1.15 Tbps of total actual bandwidth, or 0.96% of what the drive's are actually capable of.

No wonder why you need thousands of OSDs to make up for this fact that even in your deployment, you have <1% drive utilisation (relative to what the drives themselves are capable of).

How much did your company pay per drive again?

(Notice how you still haven't disclosed anything about your hardware, software, configuration, commands/scripts you used, nor any other methodology details from your testing. I'd be embarassed too after spending so much time complaining about the hardware that I was using.)

"...has linear performance increase...."
When you're only using 0.96% of the drive's capabilities (vs. 26% with gluster), of course you'd need to add more drives just to catch up.

If I assume 12 GB/s per NVMe 5.0 x4 drive (whether it's U.2, or E1.S EDSFF), 120000 Gbps (120 Tbps) / 96 Gbps = 1250 drives.

1150 Gbps (1.15 Tbps) / 96 Gbps = 11.97166667 (round this up to 12 drives).

If you were using the full capability of your drives, you could've gotten away with 12 drives instead of 1250 drives.

Given that 0.26 (gluster drive utilisation) / 0.0096 (ceph drive utilisation) = 27.0833333

Therefore; 1250 drives / 27.08333 = 46.15385 drives (round this up to 47 drives).

It's no wonder why you need 1250 drives (running ceph) what you can get away with 47 drives (running gluster).

Heck, even if gluster sucked and you needed four times as many drives -- 47 x 4 = 188 drives. That's still only 15.04% of the 1250 drives that you deployed with the ceph cluster.

This is why you need ceph to be able to scale out, linearly with performance, because as your data shows, it has really poor drive utilisation.
 
You should develop your own storage plugin like I have done to understand the code you can find in the commit. It does not use glusterfs backend for the VM image, as the dev says. You can, if you want put the plugin back in place and see how well it works.

There is no CRUSH map lookup request, Ceph doesn’t have to query a node to see where the data is located, unlike Gluster, which does have to request the equivalent of every OSD for every request. There is a data block request with separate data blocks evenly distributed but individual blocks only across m+k OSD. I still can’t comprehend what you are saying if it reads from “all” OSD when you have an m+k layout, or how a SSD could exceed the limits of the network or how an SSD would even get to its own interface rate for your average random reads. Gluster is magic that breaks physics, we get it, in the real world, as someone else said on the first page, Gluster breaks at benchmark loads and is utterly abandoned.

Either way, I’m done with this discussion, you don’t comprehend.
 
Last edited:
  • Like
Reactions: Johannes S
You should develop your own storage plugin
That's a terrible idea and spoken like a true dev (who thinks that anybody can be a developer and thus take your job).

Non-developers shouldn't be developing anything because it will most certainly be bad. I wouldn't want to be a dev anymore than I would want a dev to be responsible for the A-pillar/exterior mirror aeroacoustics engineer because they're not in a position to understand the Navier-Stokes equation along with the various viscosity dissipation models that is needed to solve A-pillar/exterior mirror aeroacoustics problems.

There's a reason why developers exist. If anybody can be a developer, then it would just increase said supply of developers, and pursuant to the principles of supply and demand, would lower the value of individual developers, so no, that would be a terrible idea.
There is no CRUSH map lookup request,
Oh really???

"A CRUSH map also has a list of rules that determine how CRUSH stores and retrieves data."
(Source: https://www.ibm.com/docs/en/storage-ceph/9.9.0?topic=overview-crush-introduction)

For EC CRUSH map, you would need an EC CRUSH rule, and said EC CRUSH rule would be defined via the EC profile that the EC CRUSH rule will use for the EC CRUSH map.

Thus, when it is trying to find a file, it will use said EC CRUSH map, which also is stored on the primary OSD, to figure out which leaf (OSD) it needs to go, to grab the data, right?

Screenshot_2026-06-09_07-36-15.png
That's literally what the direct read I/O from the Erasure Coding Performance Enhancements page that the devs wrote, literally and explicitly states.

Again - can't help you if you aren't willing to read that EC Performance Enhancements document that said ceph devs wrote.

Either way, when you are looking for a file, there has to be some mechanism for it to know a) how to find it and b) where to look for it.

Think about how, when you are looking for a file from a ceph cluster, how it would accomplish both a) and b) if your statement is true that it doesn't use the CRUSH map (which is replicated on the primary OSD).
Ceph doesn’t have to query a node to see where the data is located
Huh.

See the highlighted sentence below.

Screenshot_2026-06-09_07-45-10.png
What were you saying about "Ceph doesn’t have to query a node to see where the data is located" again? (^see immediately above, where the ceph devs literally and explicitly tell you "....rather than directing all I/O requests to the Primary OSD."

What were you saying about "Ceph doesn’t have to query a node to see where the data is located" again?

(Why do you make this so easy?)


unlike Gluster, which does have to request the equivalent of every OSD for every request
PATENETLY FALSE.

"finding a file involves more than calculating its hashed location andlooking there. That is in fact the first step, and works most of the time -i.e. the file is found where we expected it to be"
(Source: https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/dht/)

If you're looking for a file with gluster and it is able to identify the location via its hased location, then that's it - it goes and grabs the file.

It will only hit all of the bricks, IF and ONLY IF the file cannot be found where it is supposed to be, based on its hashed location.

This is, of course, in contrast with ceph, where again, as the EC Performance Enhancements documents explicitly shows, it is reading data from all OSDs, as described/explained to you by the ceph devs, as shown by the red arrows to highlight this point as authored by said ceph devs.

Screenshot 2026-06-08 191013.png
Screenshot 2026-06-08 192303.png
I mean -- the ceph devs can't/i] make it any more obvious than what they've already done.

If you aren't willing to learn, there is nothing that I can do that'll help you with that.

If you aren't willing to learn from the ceph devs themselves, there's nothing that I can do that'll get you to learn, otherwise.


I still can’t comprehend what you are saying if it reads from “all” OSD when you have an m+k layout
If you still can't comprehend it, then I would recommend that you study some more.

As shown in the "Current Overwrite Implementation", ceph just reads from all k/i] OSDs and will only need to read from m OSDs IF and ONLY IF they can't read the file properly, and then it has to reconstruct the file from the parity data by reading the m OSD(s).

But if the file doesn't require said reconstruction, then you only need to read the chunks of the file from k OSDs only, again, as diagramed by said ceph devs.



in the real world, as someone else said on the first page, Gluster breaks at benchmark loads and is utterly abandoned.
And yet, my results showed that Gluster was what? 9 TIMES faster than ceph on the sequential write workload. Something like that.

And as a function of how much of the drive's performance it is able to use, it's 26% vs your 0.96%. And I tested it with an actual Win11 VM workload.


Either way, I’m done with this discussion
Thank goodness!!!


you don’t comprehend.
Says the person who doesn't/can't/won't read what the ceph devs wrote, where you make it super easy for me to disprove your statements by citing said ceph devs.

Also says the person who had to lie to make your argument work for you.

Ceph devs: "Current code reads all data chunks..."

You: "Current code reads relevant data chunks..."

That's not what the ceph devs wrote, at all.
 
Last edited:
Know little about Gluster, so can't compare features/performance, but wanted to clarify some points regarding Ceph (again):

Thus, when it is trying to find a file, it will use said EC CRUSH map, which also is stored on the primary OSD, to figure out which leaf (OSD) it needs to go, to grab the data, right?
Among other tasks, Ceph MON will provide a copy of the CRUSH map to every component in the Ceph cluster, including clients. When client connects (i.e. QEMU starts a VM/LXC) and when any change related to OSDs happens (OSD inb/out, Host up/down, PG count change, etc). There will be a local lookup, within Ceph client memory, to find the Primary OSD for the PG were an object is stored and Ceph client will usually communicate with that Primary OSD to read/write data (this can be customized, i.e. rbd_replica_read_policy). No disk I/O is involved during such lookup. The copy of CRUSH map in OSD is used for replicas, EC, recovery, etc (OSDs are responsible of keeping replicas/EC k,m fragments), not for client I/O.


What were you saying about "Ceph doesn’t have to query a node to see where the data is located" again?
So no, Ceph client have to query nothing but its own CRUSH map to lookup where any given object is, Ceph client already knows which OSD to speak with thanks to the deterministic nature of the RADOS algorithm for a given cluster with a given amount of hosts, OSD, pools, CRUSH rules and PG count.


This is, of course, in contrast with ceph, where again, as the EC Performance Enhancements documents explicitly shows, it is reading data from all OSDs, as described/explained to you by the ceph devs, as shown by the red arrows to highlight this point as authored by said ceph devs.
Regarding EC reads: up to 19.2 Squid, Ceph has to read from all "k" OSD holding data even if the read IO size is < than stripe width (which is typically 4KB*k, so 16k in a k=4, m=2). On 20.2 Tentacle, partial reads has been implemented via allow_ec_optimizations, which allows read IO smaller than stripe width to read just from the OSD that holds that data, reducing I/O load on disks and latency.

On any Ceph version, a given read I/O goes to the Primary OSD of the PG and this OSD issues read iops to at least k-1 OSDs (k OSD if Primary has a m shard) to gather the rest of the stripe, except on 20.2 Tentacle with allow_ec_optimizations, where Primary OSD will return the data if it has the required shard or issue IO to the one OSD that has it thanks to partial reads.

Direct reads will allow to access the OSD holding the needed shard directly, without going through the Primary OSD. AFAIK this isn't implemented yet.

That said, I stopped reading most of this thread the moment you based your complaint on a performance comparison against an old PVE setup running Ceph 17.2 with an EC <span>k=2,m=1</span> HDD pool. That was never a performant or recommended storage backend for VMs.

Also, HDDs with Ceph are slow, even for replicated pools. Everyone knows that RocksDB on HDDs is hard because HDDs have inherently poor random-seek performance. Even with SSDs, EC pools have traditionally been slow for VM workloads, although this has improved significantly with Ceph 20.2 Tentacle and its FastEC optimizations, including partial reads and partial writes. Previously, a write I/O smaller than the stripe size involved expensive read-modify-write behavior: Ceph had to read the relevant existing stripe data/parity, modify the data in memory, recalculate the coding/parity information, and write the updated chunks back.

I think that comparison is unfair and unnecessary. The stronger argument for keeping PVE support for Gluster is that Gluster still provides benefits for your specific use case, rather than framing it as Ceph being “better” or “worse”.


PD: hope I don't regret having writen this post given the general tone of the dicussion :)
 
  • Like
Reactions: Johannes S
No disk I/O is involved during such lookup.
So...the lookup happens in RAM?

Just out of curiosity - how much RAM does the CRUSH map usually consume?
The copy of CRUSH map in OSD is used for replicas, EC, recovery, etc (OSDs are responsible of keeping replicas/EC k,m fragments), not for client I/O.
Got it.

Then why did the ceph devs (in the Erasure Coding Enhancement document) quote:

"We want clients to submit small I/Os directly to the OSD that stores the data rather than directing all I/O requests to the Primary OSD and have it issuerequests to the secondary OSDs."

?

(emphasis mine)

If the copy of the CRUSH map isn't for client I/O, then why did the ceph devs write/say "...rather than directing all I/O requests to the primary OSD"?

I don't understand why the ceph devs would write this, if it isn't true.


So no, Ceph client have to query nothing but its own CRUSH map to lookup where any given object is, Ceph client already knows which OSD to speak with thanks to the deterministic nature of the RADOS algorithm for a given cluster with a given amount of hosts, OSD, pools, CRUSH rules and PG count.
If the Ceph client just has to check the copy of the CRUSH map that's in the client's (presumably) RAM, and the Ceph client just has to check its local (RAM) copy of said CRUSH map, then I don't understand why the ceph devs would write that quote: "...rather than directing all I/O requests to the primary OSD".

That would mean that what the ceph devs wrote here is incorrect and said ceph devs should probably change it.


Regarding EC reads: up to 19.2 Squid, Ceph has to read from all "k" OSD holding data even if the read IO size is < than stripe width (which is typically 4KB*k, so 16k in a k=4, m=2). On 20.2 Tentacle, partial reads has been implemented via allow_ec_optimizations, which allows read IO smaller than stripe width to read just from the OSD that holds that data, reducing I/O load on disks and latency.
Yeah, I saw that.


Direct reads will allow to access the OSD holding the needed shard directly, without going through the Primary OSD. AFAIK this isn't implemented yet.
This is my understanding as well.


That said, I stopped reading most of this thread the moment you based your complaint on a performance comparison against an old PVE setup running Ceph 17.2 with an EC <span>k=2,m=1</span> HDD pool. That was never a performant or recommended storage backend for VMs.
Whilst it is true that it isn't performance nor recommended, but the other data that has been supplied with configurations that are support to be recommend for performance, as their data shows, out of the potential performance capacity/capability of the drives that are being used, less than 1% of said performance capability is actually achieved, in their (his) actual (business/commercial) deployment.

Therefore; whilst yes, I can only test with what I have available, the data from production ceph deployments also shows that even with a "real" deployment, it doesn't really get significantly better than what I have observed/recorded in regards to performance utilisation (% of the drive's performance that I am able to achieve).

The best that I've seen (shared by other users) is about 8% of a drive's performance capability, on a deployment that should follow the deployment recommendations and guidelines to get the most (performance) out of a ceph cluster.

That's still only 3% better than what I'm able to get out of my EC(2,1) with 13 year old HDDs, which is a far cry from the 26% I'm able to get with the same HDDs, using Gluster.

And the people who have shared their large scale, business/commercial deployments (which again, presumably has been configured properly, per the recommendations so that it will be performant) - many of those who have shared their data/results can't explain why they're only getting a fraction of what their drives should be capable of, in terms of performance.

And whilst you might be able to get 180 GB/s with 300 drives, but that still means that each drive is only contributing a max of 600 MB/s, which, if the drive is capable of 12000 MB/s (for a NVMe 5.0 x4 SSD), you're still getting/using only 5% of the performance capability of said drive and no one who has deploy ceph with such (large) deployments have been able to educate me on why they're only getting 5% of the drive's rated performance.

That's the part that I don't understand why they're only getting a fraction of what the drive is capable of. A given the lack of responses from the people who either helped or were responsible for overseeing said deployments, they don't really seem to know why neither.


Also, HDDs with Ceph are slow
There's no debate about that.

But this is also why I am talking about ceph performance as a % of what the drive should be able to do. (Because if a HDD can only max out at like 150 MB/s, then 5% of that - I'm not going to expect more than 7.5 MB/s (each).

Similarly, if I have a NVMe 5.0 x4 SSD that's supposed to be capable of 12 GB/s sequential read/writes, and I'm only getting 600 MB/s, yes, 600 MB/s is faster than 30 MB/s but 600 MB/s out of 12000 MB/s is still only 5%.

In other words, as a percentage of the drive's capability, the device class doesn't really seem to "magically" utilise 50% of a NVMe 5.0 x4's performance capability by switching from the HDD device class to the NVMe device class. It stays relatively low (max people have reported is about 8%).


Everyone knows that RocksDB on HDDs is hard because HDDs have inherently poor random-seek performance.
I think that even for random I/O performance, it's still <10% of what a U.2 or E1.S EDSFF NVMe 5.0 x4 SSD is supposed to be capable of.


Previously, a write I/O smaller than the stripe size involved expensive read-modify-write behavior: Ceph had to read the relevant existing stripe data/parity, modify the data in memory, recalculate the coding/parity information, and write the updated chunks back.
Agreed. The EC Enhancements doc talks about this.


The stronger argument for keeping PVE support for Gluster is that Gluster still provides benefits for your specific use case, rather than framing it as Ceph being “better” or “worse”.
Well, I'm looking at it from a drive performance capability usage/percentage POV.

If I can use 26% of a HDD, I go from 7.5 MB/s (each) to 39 MB/s (each). And thus, with k=2, I would, in theory, be able to go from 15 MB/s to 78 MB/s.

Now if I apply that to a U.2 or E1.S EDSFF NVMe SSD, if you are only getting 600 MB/s (5%) and you go up to 26%, then you're hitting 3.12 GB/s, which would be a HUGE performance benefit for whatever workload you're running.

And who wouldn't want their storage subsystem to be able to have higher sequential bandwidth and/or be able to handle/serve more random I/O requests, especially if you can achieve the same or very similar levels of performance that currently takes you 1250 drives down to just 250 drives (because the % of the drive's performance utilisation increased by 5x).

You'd save your company millions.

Who wouldn't want that?

And if this is about Proxmox being a business, to serve the needs of businesses - just imagine what Proxmox marketing department can do with telling their prospective (and existing) customers "hey, I can cut your storage costs down by 80% by using this technology that we've integrated into our Proxmox Virtual Environment product".

it would be a huge marketing win for Proxmox.

(heck, even if you don't get 80% savings, and you get 40-50% savings, it'd still be better than NO savings.)
 
Last edited: