I thought the problem was the QEMU support for glusterfs ? You'd need to raise this with the QEMU devs and not proxmox afaik, I think...
Again, I will refer you to the problems with EC ceph that the guy from 45Drives highlights, in their video.Ceph's erasure-coded pools typically incur higher write amplification and network overhead because data and parity are distributed across multiple OSDs. A 6+2 profile has 8 participating nodes vs a 3-way replicated pool, increasing latency and reducing small-write performance. However, for larger objects that you would see in modern workloads, this difference is minimal. Hence we store boot disks on 3-way and large data disks on a 4+2 erasure pools. By default, it will rely less on things like filesystem caches and wants the client to handle that.
...at the expense of speed. (again, cf. the dev's page about performance issues with EC ceph) Unless, of course, you want to argue that the ceph devs are wrong about their own documentation about EC ceph's performance issues.but Ceph provides strong guarantees around data integrity, failure recovery, and behavior at large scale.
Ceph is faster than Gluster today
ceph on the left, gluster on the right.Modern Ceph uses BlueStore and has modern CPU optimizations, and it can scale because it uses an algorithm to determine object placement without expensive metadata lookups.

Again, as the screenshot above patently shows - this is quite literally and easily proven to be demonstrably false.Recovery is object-based and deterministic, while Gluster relies heavily on background healing and filesystem operations.
On modern NVMe and 25G+ networks, the CPU cost of checksums and erasure coding is negligible, Ceph will today outperform Gluster even in relatively small clusters while providing stronger integrity and recovery guarantees.
Still no data.Comparing GlusterFS and Ceph on 15yo hardware is like running a VW Beetle, then transferring that engine into a modern car and comparing how fast they go. Sure, the Beetle will beat the new car in a speed race, but it will kill its occupants in a crash and be overall much less safe on the road today, and the Beetle still can't compare to the performance of your average modern car.
Still no data.If you want a technical discussion, you need to do it on technical terms.
Your original statement/quote said nothing about random[/i] throughput.There is not a single spinning hard drive in the world that can sustain a random pattern uncached 100MB/s.
Show me, where, in your original statement;That alone tells me something is wrong with your benchmark.
Again, the data shows/tells you, from CrystalDiskMark that gluster random I/O performance is 1.69 MB/s, which is well in line with your expectations.8-10MB/s for uncached data 4K read/writes on spinning rust sounds more accurate, true random read/write 4K would top out somewhere around 0.5MB/s. You need to do something like 1-4MB chunks sequential to get to 100MB/s on 7200RPM.
I never said that EC pools are about throughput.EC pools are not about throughput, they’re about space optimization.
I don't know how many times I am able to explain that if your drive is capable of 12 GB/s reads and you're only getting 600 MB/s - 600 MB/s divided by 12000 MB/s = 0.05.There is always a trade off. NVMe is much better than spinning disks for IOPS, but I don’t get where you get 5% usage.
As it has already been stated, my network is 100 Gbps IB.There are various tradeoffs but if your network fabric is 100GbE
Not really.you will never be able to push more than 100Gb whereas a single NVMe today can push multiples of that, the benefit of buying a better NVMe is lower latency, potentially better write endurance.
1.15 Tbps / 120 Tbps = 0.0096.Ceph is probably close to 80-90% of theoretical performance in real world scenarios. I have synthetic benchmarks for my EC pool to 1.15Tbps cluster-wide which is near 98% of line speed for this particular cluster. Sure in theory my SSD can provide close to 120Tbps in aggregate,
The most telling part about all of this is the answer to the question "what do you run, in your own personal homelab where you (often) don't have a multi-million dollar IT budget?"spinning disks are slower, consume more power, take up more space. Cheaper SSD are also slower, consume more power etc, the difference between 100ns latency and 2 millisecond latency is profound. The overhead of SAS/SATA vs NVMe is noticeable. Being able to sustain 50k IOPS under load vs a consumer SSD crashing down to 500 IOPS after the cache is full, those are all issues I think about.
Read through the ceph EC performance enhancements docs authored by the ceph devs more closely. Again, they explicitly tell you that in the current read implementation, it reads all of the chunks from all OSDs.Ceph EC has no exponentially increasing synchronization overhead like MPI, the CRUSH algorithm will only select n+k nodes for each block. It doesn’t have to synchronize to all nodes because the client knows exactly where the data is located based on its object id and the map.
(Again, the ceph devs literally and explicitly tell you that they read all data chunks from all OSDs. It's not conjecture. The ceph devs literally tell you that, as shown in the screenshot from their docs above.)Cuz you were complaining about the results and methodology.As to whether I should test Gluster again, I ask why.
no data.it doesn’t perform well on modern hardware and is outperformed by Ceph.
What part of "I didn't get/deploy 100 Gbps for storage" don't you seem to understand?But your fabric is still 100Gb. You don't seem to understand that you cannot, regardless of method, go beyond that metric. Gluster can't replicate your data faster than your fabric, neither can Ceph and that is REGARDLESS of whether you use EC or replicated pools, it doesn't matter, you can't have a client write faster than the link to its storage. And with modern CPU, EC is about as fast as a replicated pool if properly architected.
This has already been answered.If Gluster is pushing 100MB/s on a spinning drive, it's physically impossible it is committing that to 3 or 5 other disks across a gigabit network. You show benchmarks indicating a full gigabit network, are you comparing Gluster on your IB with a gigabit replicated Ceph and surprised they don't go equally fast?
[/s]Right. Again, this is why I tested both, on the same physical box and with six HDDs rather than just three.The problem is the variables, hardware, software, user experience, the list goes on and on.
Sorry to hear about that man.PS. I'm in a bad mood, I got my cooling script wrong and lost a 15tb nvme due to heat overload, it was the last drive in 2u server front vertical 2.5 slots and just didnt get the ideal airflow during a recent heat wave. Now I'm angry at myself for being so complacent because 15tb nvmes aren't exactly cheap. The good news is that its all backed up!![]()
Sorry to hear about that man.
That really sucks!
(If you ever need me to run the CFD studies for you, just hit me up. I've already studied the optimal placements for thermal pads (on NVMe SSDs), using CFD. (It's better if you place individual thermal pads on the individual NAND flash chips, in terms of thermal conduction, than if you placed one long, continous thermal across all of the chips simultaneously. By cutting the thermal pads and placing them on the individual chips, you are limiting how much planar heat conduction is happening, within the thermal pad, and instead, promote orthogonal heat conduction, and thus, more of the heat goes into your heatsink rather than being "spread across" (planar conduction) said one long, continuous thermal pad.)
No problem.Thats good to know! Thanks! I've always just ran a strip of thermal across the lot.
No, but again, mass manufacturing means that they probably can't take the time to cut the thermal pads and place it on the individual chips as that'll just take too long for them to do so.Although in this case it is a U.3 drive, so not really something you think about regarding opening and re-padding, plus with it being a Samsung pm1733, they do run crazy hot as it is.
Short answer?No idea why all the other manufacturers effectively seal their drives, and some only having a tiny orifice at the front and at the port end.
We use essential cookies to make this site work, and optional cookies to enhance your experience.