[SOLVED] Ceph (stretched cluster) performance troubleshooting

Coming back on this as a follow-up...
We were able to source Micron 7300 Pros 1.92TB NVMe's.
Here are the results, same fio command was used:

Code:
Jobs: 1 (f=1): [W(1)][100.0%][w=2230KiB/s][w=557 IOPS][eta 00m:00s] Samsung 970 EVO Plus 2TB
Jobs: 1 (f=1): [W(1)][100.0%][w=274MiB/s][w=70.0k IOPS][eta 00m:00s] Micron 7300 Pro
 
Coming back on this as a follow-up...
We were able to source Micron 7300 Pros 1.92TB NVMe's.
Here are the results, same fio command was used:

Code:
Jobs: 1 (f=1): [W(1)][100.0%][w=2230KiB/s][w=557 IOPS][eta 00m:00s] Samsung 970 EVO Plus 2TB
Jobs: 1 (f=1): [W(1)][100.0%][w=274MiB/s][w=70.0k IOPS][eta 00m:00s] Micron 7300 Pro
Hi mate, Good that you solved the issue, we are in the same boat as u were before. :(
This 70k IOPs improvement is with just changing the drives to Micron NVMe in the 2 DC architecture?
Or have you done any other configuration changes (like in network latency etc.,)?
and what VM OS you are doing the fio testing-like linux or windows?

Regards.
 
This 70k IOPs improvement is with just changing the drives to Micron NVMe in the 2 DC architecture?

This will be at least part of it since he started ( see first post ) with consumer drives without power-loss-protection.

I doubt however that you will get an answer from a four years old thread.

Please create a new thread and describe your Problem with as much detail ( used hardware, networking Setup as possible, configurations ).

Often enough cheaping out at the network leads to a bottleneck for Ceph.
 
  • Like
Reactions: sensei_pv and UdoB
Hi mate, Good that you solved the issue, we are in the same boat as u were before.:(
This 70k IOPs improvement is with just changing the drives to Micron NVMe in the 2 DC architecture?
Or have you done any other configuration changes (like in network latency etc.,)?
and what VM OS you are doing the fio testing-like linux or windows?

Regards.
The results are from the following fio command executed directly on the NVMe's, ceph is not involved in this test...

Code:
fio --ioengine=libaio --filename=/dev/nvme... --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio

Ceph is pretty much default, there is no tuning.
Fio is executed on the proxmox host, so OS will be Debian.

If you using consumer grade NVMe's then stop and change them for proper enterprise drives. No sense in continuing with consumer drives, this is my key takeaway.
 
The results are from the following fio command executed directly on the NVMe's, ceph is not involved in this test...

Code:
fio --ioengine=libaio --filename=/dev/nvme... --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio

Ceph is pretty much default, there is no tuning.
Fio is executed on the proxmox host, so OS will be Debian.

If you using consumer grade NVMe's then stop and change them for proper enterprise drives. No sense in continuing with consumer drives, this is my key takeaway.
First of thanks for the reply from a 4 year old thread. :).
So, my curiosity is by changing the drives to nvme, specifically how does your stretched cluster perform/increase in IOPS ? (if u can give a result of IOPS in your stretched cluster it will be our baseline to work for!)
Did u have done any tuning for ceph for stretched cluster?
Did u have done any tuning for network in general or for stretched cluster?

Thanks.
 
Last edited:
Note that benchmarks rarely translate to other setups, there are loads of variables.

For a stretched cluster: make sure you have sufficient bandwidth and low latency between the two endpoints and make sure you have a CRUSH map that accounts for the two sites. Accept that the fastest it will ever go is for the data to take a roundtrip to the other DC. So if your latency is ~1mS for a packet, you cannot expect >1000 IOPS in a singular queue. At that point NVMe vs SAS doesn’t matter all that much, as long as you use datacenter SSD, performance will be similar although these days SAS is only offered for legacy reasons, almost anything ‘new’ will be NVMe.

The reason “eco” SSDs perform worse than “pro” SSD which still perform worse than actual datacenter SSD is purely design of the SSD itself. A datacenter SSD of 2TB in 2.5” format will still often have 16 or 32 flash chips so writes are striped across more chips whereas an eco SSD will have 1 and a “pro” may have 2 or 4. The benchmark for eco/pro SSD are done with massive RAM cache on the chip, so 1M IOPS can be sustained for ~2GB before it has to stop and commit that to disk, which is when your system craters or even times out. A datacenter SSD is benchmarked with sync writes, they will thus on paper have lower benchmarks, but they can sustain those 24/7 under full load. They also reserves more space, embed a more expensive CPU, thus allowing better/faster optimizations like compression and deduplication, on “the same” TLC/QLC chips.
 
Excellent response above, nothing to add.
Just know our journey was from consumer to enterprise NVMe's, we never used SSD's for the Ceph cluster.

With regards to crush map, I have found the following in our docs:
Code:
ceph osd tree

ceph osd crush add-bucket org-name root
ceph osd crush add-bucket DC1 datacenter
ceph osd crush add-bucket DC2 datacenter

ceph osd crush move DC1 root=org-name
ceph osd crush move DC2 root=org-name

ceph osd crush move pve11 datacenter=DC1
ceph osd crush move pve12 datacenter=DC1
ceph osd crush move pve13 datacenter=DC1
ceph osd crush move pve21 datacenter=DC2
ceph osd crush move pve22 datacenter=DC2
ceph osd crush move pve23 datacenter=DC2

ceph osd tree
ceph osd crush remove default

And then the actual crush rule will be as follows:
Code:
# rules
rule multi_dc_rule {
    id 0
    type replicated
    step take org-name
    step choose firstn 2 type datacenter
    step chooseleaf firstn 2 type host
    step emit
}

From here you create new pool using the rule above (multi_dc_rule).

To check the rbd image distribution use the following:
Code:
ceph osd map ceph_pool image_id

For example:
Code:
root@pve11:~# ceph osd map ceph vm-100-disk-0
osdmap e49823 pool 'ceph' (2) object 'vm-100-disk-0' -> pg 2.720ca493 (2.93) -> up ([23,12,3,11], p23) acting ([23,12,3,11], p23)

This image is located on OSD's 23, 12, 3 and 11 -> 2 datacenters, 2 different hosts

Good luck!
 
Last edited:
Easy, go with single datacenter if your business case allows it.
The latency between datacenters will have significant impact on the ceph performance. Even if 1ms round trip.
Additionally we have 4 copies of each image (vs 3 by default), this also impacts performance since ceph uses synchronous writes.
If your workloads are not IO intensive, maybe stretched cluster will work for you, needs to be tested...
 
With regards to crush map, I have found the following in our docs:
Missing a lot of detail here and I may be wrong, but would like to avoid confusion for future readers: that config is not a Ceph Stretched Cluster [1]. That config is a "simple" crush rule that uses a datacenter entity as the primary fault domain, then host.

Ceph Stretched cluster mode is a completely different beast that overcomes most of the problems with your current config:
  • There will be no MON quorum once one DC is down, no matter where you place your MON: when a DC fails, you will have to do disaster recovery at MON level to remove unavailable MONs and there will be no I/O on the cluster meanwhile.
  • With a streched cluster, Ceph knows the locality of each client with respect to MON and OSD location, directing read/write to it reducing the impact of latency between datacenters.
  • Once the failed DC is back, you will have to redo MONs, sometimes edit OSDs config too, and resync everything from scratch.
A real Ceph Streched Cluster isn't trivial to implement nor it's supported natively by PVE (would need something like HA at multicluster level or some other kind of third party orchestration), although if you have the requirement and the infrastructure could well be worth the effort.

Regards.

[1] https://docs.ceph.com/en/latest/rados/operations/stretch-mode/
 
@VictorSTS, technically you are right, we are using 3rd datacenter only for MON and to maintain quorum in the event of datacenter failure.
The Ceph Stretched Cluster is something we've looked into long time ago, it was only available with Pacific and above which at the time was not yet available in Proxmox. We have not seen many people using it and have abandoned the idea. I am sure we have that somewhere on the forum.

Thanks for clarifying!
 
@VictorSTS, technically you are right, we are using 3rd datacenter only for MON and to maintain quorum in the event of datacenter failure.
The Ceph Stretched Cluster is something we've looked into long time ago, it was only available with Pacific and above which at the time was not yet available in Proxmox. We have not seen many people using it and have abandoned the idea. I am sure we have that somewhere on the forum.

Thanks for clarifying!
Beware of this if both DC see the remote MON but MONs at each "local" DC can't reach each other.

Have you tried that in a lab? It won't work unless you do a lot of manual disaster recovery. No side will have Ceph quorum due to MON election loop. Can't remember the dirty details atm, but AFAIR there is some logic in Ceph that checks if all MONs see each other or not. That doesn't happen in a true streched cluster. Ah yes, found it, a must watch video on streched clusters [1].

[1] https://www.youtube.com/watch?v=1jE_1jQ_I88
 
  • Like
Reactions: sensei_pv
Against slow ceph with sync/async datacenter replocations you could take a look at even full sds (software defined storage) solution from Hammerspace which do nfs in HA, certified as 1 cluster filesystem in sync or async mode for 64 DC's at worldwide sites (but is not the limit) and provides the easierst data access to users or any applications as all is accessible on any each unmodified client as files so even flawless for pve vm migrations. Production performance on gpu clusters from Meta reached 12,5 TB/s (!!), 4k fio randwrite IOPS in the millions. Hammerspace works under the hood like Ceph bluestore with distributed objects of the files which are exposed via ha nfs, so it's like comparing a Lambo vs a Fiat 500.
 
Nice feature is migrating used files on the fly between different HW tiers, so while reading or writing to a files (eg vm image) migrating objects from hdd to nvme or the other way without any interruption to the application (like pve) or user seen in a filebrowser.
:)
 
Yes, ceph is free when not used and subscripted in a RH environment and Hammerspace takes an obolus but it's on TB usage every month and even if you mean don't need professional support anymore you could go out at any given time ... hopefully don't need support after then ...
:)