cluster performance degradation

I currently use 10G ports for both the net and also for chep and cluster, but I continue to have degradation of the VMs, and I still can't understand why, is anyone so good since it's Christmas can you help me understand why this degradation?
 
The ceph/osd shows everything you need, so everything is okay. the hdd's have low random IOPS, and this is why when you have 70tb written to them you have low speed,and big latency. There are two options, adding db/wal onto ssd/nvme, or replace hdds with enterprise ssds.
 
so you're telling me that the cluster slowed down because I added the VMs that occupy 70 TB? I could adopt the first solution of adding db/wal on the SDD, but how much should I add the SDDs, having 4 16 TB HDDs per node? and above all can I add them at the current state with ceph already configured?
 
Can someone please suggest me what is the optimal configuration for the network, having 2 NICs at 10G and 1G for each node?
The network is not your problem. I doubt you are saturating your 10g network with the current setup.

Few things can be contributing to the issue and need answers,
- How many PGs?
- How many Pools?
- How many replica?
- Customized CrushMAP?
- How is the current health of ceph? (ceph -s)
- Scrub running continuously?
- Performed any tweaks on Ceph after deployment?

Ceph is not designed with small cluster in mind. The larger a Ceph cluster becomes, the faster it gets. There is no other storage with this character. But, Ceph does work in small environment such as 3 nodes, but there are things you must keep in mind so the expectation is within the boundary. To give an example, simply going from 4 nodes to 5, it adds roughly 20%-30% performance.

Replica is very very important when a small Ceph in question. You must not use 3 replica. That will kill performance more than anything. Replica 2 is the way to go. You may be using default 3 replica when created the pool. If you want to use small Ceph, the tradeoffs must be accepted. The replica count is one of them. PG count also affects performance quite a bit. Too low or too high, neither is good.

The initial cluster probably was fast before you started loading all of your data. As more and more data gets stored on Ceph, more replicas get created and the need of PG distribution increases. Do not forget, with replica 3, each incoming data gets written 3 times.

With larger spinning drives, performance does degrade a bit. As others have suggested using SSD as journal drive you can mitigate that issue easily. Yes you can enterprise grade SSD. But, there is nothing wrong with using good consumer grade SSDs to add performance. Specially if you add 2 SSDs in mirror to hold the DB/WAL, you can add performance without breaking bank. Lexar NS100 512GB or T-force Vulcan Z 1TB both are cheap viable option. The reason I mention these ones, I have used them in production Ceph after extensive testing.
 
so do you recommend adding 2 SDDs for each node, and making 2 replicas instead of 3? I have a cluster of 3 nodes with HA, I thought it was necessary to make 3 replicas, what would be the optimal configuration then? all the guides I read around talk about 3 knots with a minimum of 2 replies and a maximum of 3, I'm very confused about which route to take
 
Replica is very very important when a small Ceph in question. You must not use 3 replica. That will kill performance more than anything. Replica 2 is the way to go.
Why would you go below 3 replicas?!
With 2/2 you have no HA whatsoever. Some of the PGs will go down as soon a node goes down. Only for a pool with just ISOs acceptable.
With 2/1 you may lose data [1]. Even if you could get more speed, this is a no-go for me.

[1] https://forum.proxmox.com/threads/ceph-pool-size-is-2-1-really-a-bad-idea.68939/post-309189
 
Last edited:
He wanted to make this easier for you,but there is no easy way out, i will sum his recommendations:
1) lower number of replicas - this just writes to 2 instead of 3 nodes, but this makes your cluster fragile to dying
2) use mirrored ssds for db/wal - in his case buy consumer grade to offload some of ceph writing to ssds.
3) add more nodes - this always makes sense because ceph uses more and more paralellization.

As i said you are in a hot sauce, because someone misengineered the cluster, thinking that you would get big bulky cluster, with good performance. But forgot to mention that hdd's cannot ever, EVER, get good performance. This is the short summarized version.
 
so in the end I have to go back to vmw? what could be the solution for me now? I would like to use HA, can I do it without using chep and use the HDD disks locally? taking into account that each node has 4 HDDs of 16 TB
 
How much should I put SSD for each node? having 4 x 16TB HDD per node , Should I then reinstall the entire cluster or do I delete the OSDs and add them again?


What if I put 2 better SSDs or is it lost? I wonder but then if the SSD breaks will you lose the whole node? what would be the function of the SSD disk in this case?
 
Last edited:
Yeah, something like 960-1.2tb ssd for each node, if it fails you lose all osds in node. So yes, you could raid them up in mirror.
 
Hello toto,

Is this going to scale up to a lot of VMs or just few?

To be safest bet i would recommend

at least 5 nodes for CEPH only with enterprise NVME with dual 25G interface or 2x 100G. Then you could add maybe another 4 nodes for your compute.

Althought it seems attractive to cramp everything in 3 nodes and ceph, in reality you are setting up yourself with alot of trouble.

If the budget is only 3 nodes, maybe it is better to use ZFS and with replication to another node ... IMO 3 node is really not a good idea to run ceph at all.
 
  • Like
Reactions: waltar
Ceph is good but you need to set it right with the correct number of node (personally i think 7 is the best as a starting point and 5 as a minimum) and enough OSDs with proper enterprise NVME/SSD disks and fast network (25G and above)

ZFS HA in proxmox you can setup replication and i think you would lost 1min (shortest replication time) of data in the event if you need to failover.
 
  • Like
Reactions: UdoB

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!