cluster performance degradation

ness1602 · Dec 24, 2024

Optimal is 10g for ceph network, and for corosync, management enough is 1g.

toto-ets · Dec 24, 2024

I currently use 10G ports for both the net and also for chep and cluster, but I continue to have degradation of the VMs, and I still can't understand why, is anyone so good since it's Christmas can you help me understand why this degradation?

toto-ets · Dec 24, 2024

this is the current situation, I notice in the database configuration that osd 5 and osd 6 are missing, is this normal?

ness1602 · Dec 24, 2024

The ceph/osd shows everything you need, so everything is okay. the hdd's have low random IOPS, and this is why when you have 70tb written to them you have low speed,and big latency. There are two options, adding db/wal onto ssd/nvme, or replace hdds with enterprise ssds.

toto-ets · Dec 24, 2024

so you're telling me that the cluster slowed down because I added the VMs that occupy 70 TB? I could adopt the first solution of adding db/wal on the SDD, but how much should I add the SDDs, having 4 16 TB HDDs per node? and above all can I add them at the current state with ceph already configured?

wahmed · Dec 24, 2024

toto-ets said:
Can someone please suggest me what is the optimal configuration for the network, having 2 NICs at 10G and 1G for each node?

The network is not your problem. I doubt you are saturating your 10g network with the current setup.

Few things can be contributing to the issue and need answers,
- How many PGs?
- How many Pools?
- How many replica?
- Customized CrushMAP?
- How is the current health of ceph? (ceph -s)
- Scrub running continuously?
- Performed any tweaks on Ceph after deployment?

Ceph is not designed with small cluster in mind. The larger a Ceph cluster becomes, the faster it gets. There is no other storage with this character. But, Ceph does work in small environment such as 3 nodes, but there are things you must keep in mind so the expectation is within the boundary. To give an example, simply going from 4 nodes to 5, it adds roughly 20%-30% performance.

Replica is very very important when a small Ceph in question. You must not use 3 replica. That will kill performance more than anything. Replica 2 is the way to go. You may be using default 3 replica when created the pool. If you want to use small Ceph, the tradeoffs must be accepted. The replica count is one of them. PG count also affects performance quite a bit. Too low or too high, neither is good.

The initial cluster probably was fast before you started loading all of your data. As more and more data gets stored on Ceph, more replicas get created and the need of PG distribution increases. Do not forget, with replica 3, each incoming data gets written 3 times.

With larger spinning drives, performance does degrade a bit. As others have suggested using SSD as journal drive you can mitigate that issue easily. Yes you can enterprise grade SSD. But, there is nothing wrong with using good consumer grade SSDs to add performance. Specially if you add 2 SSDs in mirror to hold the DB/WAL, you can add performance without breaking bank. Lexar NS100 512GB or T-force Vulcan Z 1TB both are cheap viable option. The reason I mention these ones, I have used them in production Ceph after extensive testing.

toto-ets · Dec 24, 2024

so do you recommend adding 2 SDDs for each node, and making 2 replicas instead of 3? I have a cluster of 3 nodes with HA, I thought it was necessary to make 3 replicas, what would be the optimal configuration then? all the guides I read around talk about 3 knots with a minimum of 2 replies and a maximum of 3, I'm very confused about which route to take

Azunai333 · Dec 24, 2024

wahmed said:
Replica is very very important when a small Ceph in question. You must not use 3 replica. That will kill performance more than anything. Replica 2 is the way to go.

Why would you go below 3 replicas?!
With 2/2 you have no HA whatsoever. Some of the PGs will go down as soon a node goes down. Only for a pool with just ISOs acceptable.
With 2/1 you may lose data [1]. Even if you could get more speed, this is a no-go for me.

[1] https://forum.proxmox.com/threads/ceph-pool-size-is-2-1-really-a-bad-idea.68939/post-309189

toto-ets · Dec 24, 2024

I had read this, I'm always more confused

ness1602 · Dec 24, 2024

He wanted to make this easier for you,but there is no easy way out, i will sum his recommendations:
1) lower number of replicas - this just writes to 2 instead of 3 nodes, but this makes your cluster fragile to dying
2) use mirrored ssds for db/wal - in his case buy consumer grade to offload some of ceph writing to ssds.
3) add more nodes - this always makes sense because ceph uses more and more paralellization.

As i said you are in a hot sauce, because someone misengineered the cluster, thinking that you would get big bulky cluster, with good performance. But forgot to mention that hdd's cannot ever, EVER, get good performance. This is the short summarized version.

toto-ets · Dec 24, 2024

so in the end I have to go back to vmw? what could be the solution for me now? I would like to use HA, can I do it without using chep and use the HDD disks locally? taking into account that each node has 4 HDDs of 16 TB

ness1602 · Dec 24, 2024

Use enterprise disk per node and store db/wal of each hdd on it. This is the best you could get.

toto-ets · Dec 24, 2024

How much should I put SSD for each node? having 4 x 16TB HDD per node , Should I then reinstall the entire cluster or do I delete the OSDs and add them again?

What if I put 2 better SSDs or is it lost? I wonder but then if the SSD breaks will you lose the whole node? what would be the function of the SSD disk in this case?

ness1602 · Dec 24, 2024

Yeah, something like 960-1.2tb ssd for each node, if it fails you lose all osds in node. So yes, you could raid them up in mirror.

toto-ets · Dec 25, 2024

ok, I understand, but what would be the task of this SSD in this case?

ness1602 · Dec 25, 2024

To offload ceph writes to db/wal .

toto-ets · Dec 25, 2024

ok, but should I reinstall the node, or should I just delete the existing OSDs and recreate them again with db/wal?

kellogs · Dec 25, 2024

Hello toto,

Is this going to scale up to a lot of VMs or just few?

To be safest bet i would recommend

at least 5 nodes for CEPH only with enterprise NVME with dual 25G interface or 2x 100G. Then you could add maybe another 4 nodes for your compute.

Althought it seems attractive to cramp everything in 3 nodes and ceph, in reality you are setting up yourself with alot of trouble.

If the budget is only 3 nodes, maybe it is better to use ZFS and with replication to another node ... IMO 3 node is really not a good idea to run ceph at all.

toto-ets · Dec 25, 2024

I have a total of about 40 VMs, I thought that with 3 nodes I could exploit all the potential of ceph, if I use ZFS as you tell me, then can I use the HA function?

kellogs · Dec 25, 2024

Ceph is good but you need to set it right with the correct number of node (personally i think 7 is the best as a starting point and 5 as a minimum) and enough OSDs with proper enterprise NVME/SSD disks and fast network (25G and above)

ZFS HA in proxmox you can setup replication and i think you would lost 1min (shortest replication time) of data in the event if you need to failover.

cluster performance degradation

Famous Member

Member

Member

Famous Member

Member

Famous Member

Member

Active Member

Member

Famous Member

Member

Famous Member

Member

Famous Member

Member

Famous Member

Member

Active Member

Member

Active Member

We value your privacy