Understanding ceph performance and scaling

crembz

Member
May 8, 2023
41
5
8
So I've been playing around in my lab to get ceph going on a 5 node cluster backed by a 10g network.

I've setup 3 nodes to test ... each with one nvme and one hdd

I've created two crushmaps, the default one and another targeting the ssd/nvme drives

I've setup the corresponding pools and created 4 vms:

VM1: ceph ssd pool
VM2: ceph hdd pool
VM3: local zfs pool
VM4: zfs 5vdev hdd pool

I tested the speeds in the vm (vms are all using virtio) using fio:

fio --directory=/ --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio --size=1G --numjobs=1

I'm getting massive swings in results between the ceph and zfs setups, with zfs on average 10x faster than ceph

ceph nvme: 110 IOPS
ceph hdd: 10 IOPS
zfs nvme: 890 IOPS
zfs hdd pool: 47 IOPS

I'm using consumer grade parts across all test scenarios

My question is ... why is there such a massive discrepancy between ceph and zfs ... and is this expected? Is there something I am doing wrong here?
 
Yes, this is expected. You are comparing apples and oranges resp. a local filesystem on local storage and a distributed storage system.
So how then does performance scale in ceph? is it purely adding nodes or does it also scale by adding osds?
 
The first thing to do is to have the RocksDB of HDD-OSDs on SSD.

Then add OSDs and nodes. This will increase th eoverall performance of the storage cluster.
But it will not increase the single-thread IO performance for a single VM.

Ceph is not primarily about performance but about consistency and redundancy.
 
  • Like
Reactions: takeokun
The first thing to do is to have the RocksDB of HDD-OSDs on SSD.

Then add OSDs and nodes. This will increase th eoverall performance of the storage cluster.
But it will not increase the single-thread IO performance for a single VM.

Ceph is not primarily about performance but about consistency and redundancy.
I was reading about shifting the wal and db off to ssd, how are they protected in that case. My understanding is if you loose them you lose the entire node?

I'm happy to loose a bit on performance but I'm genuinely surprised at just how much is lost in this simple test.

I'm guessing performance would scale out with # of nodes?
 
My understanding is if you loose them you lose the entire node?
You will lose all OSDs that use the lost SSD as RocksDB device.
Usually you have a ratio of 4 to 6 HDDs on one SSD, with a total of 15 to 20 HDD in one node. Losing the capacity provided by these HDDs should not be an issue for a Ceph cluster of a certain size.

This is why very small Ceph clusters (3 to 5 nodes) should be all flash (SSD or even NVMe).
 
Hrm I might have to rethink this then. I was planning to have an SSD and 6 HDDs per node across three nodes, maybe expanding to a fourth. I'm also using a replication of 2/2.

If building an SSD cluster, I'm guessing same rule applies to use enterprise ssds?

What about mixed pools? I.e. HDD pools for media storage and ssd pools for vms?
 
replication of 2/2.
Do not do that, it will lead to data loss.

You need "mixed use" or "enterprise" SSDs and not "read intensive" or "consumer grade" SSDs. The latter perform worse than HDDs in heavy write situations. Ceph always writes randomly.

Do not mix device classes in one pool. Separate SSD-OSDs and HDD-OSDs (even with RocksDB on SSD) into separate pools. If you have enough capacity on SSD it may work for VMs on a smaller cluster.
 
  • Like
Reactions: takeokun
I am using a pool with 2/1 replication for my lab VMs in the last 5 years, and the cluster survived a couple of SSD losses and a lot of node reboots. Maybe I am just lucky, but I grew to trust that ceph does OK even wtih the non-recommended configuration.

If you know what you are doing and OK potentially losing the data (have backups) I think it is a good option. BTW, the replication 2/2 is not good, when you reboot a node the VMs would not be able to write...

In my cluster I have cephfs on HDDs for media/file storage and the SSDs for VMs... The HDD is using RAID5 configuration on the 3 disks, and performance is OK to stream the video....
 
  • Like
Reactions: crembz
am using a pool with 2/1 replication for my lab VMs in the last 5 years, and the cluster survived a couple of SSD losses and a lot of node reboots. Maybe I am just lucky, but I grew to trust that ceph does OK even wtih the non-recommended configuration.
Lucky, sure- but more likely you have no load on your file system. there's not much damage to worry about when there are no in flight operations during a fault.

If you know what you are doing and OK potentially losing the data (have backups) I think it is a good option
ahh there's the rub. "know what you are doing" is very much subject to the Dunning Kruger effect.

BTW, the replication 2/2 is not good, when you reboot a node the VMs would not be able to write...
For VERY good reason. you have no parity to compare your data to, and cannot depend on it being correct. If your use case is completely transient I suppose you can do so, but that's the thing about having a product with all the knobs exposed to the end user- there is no protection from the user doing potentially harmful things.

So how then does performance scale in ceph? is it purely adding nodes or does it also scale by adding osds?
Yes and yes, assuming you have no other bottlenecks. search the forums, there are many discussions on this topic. bear in mind, however, that "performance" scales in the aggregate, not necessarily to a single guest.
 
Yes and yes, assuming you have no other bottlenecks. search the forums, there are many discussions on this topic. bear in mind, however, that "performance" scales in the aggregate, not necessarily to a single guest.

Ah I see ... so the pool gets faster and is able to handle more parallel IO. So if running a vm on a hdd pool ... how do I get more than 9 IOPS hahahaha? Or am I being unrealistic with my expectations of a distributed HDD pool?
 
So if running a vm on a hdd pool ... how do I get more than 9 IOPS
Thats... slow. any single write to a HDD based pool maxes out at ~150 IOps; 9 is bad. Have a look at your interconnects- what speed and link count have you deployed for your ceph public and private interfaces? do you share either (both?!) with other types of traffic?
 
I have both public and cluster interfaces on seperate vlans over a 10g connection to the 10g switch. Links are up at 10g
 
Lucky, sure- but more likely you have no load on your file system. there's not much damage to worry about when there are no in flight operations during a fault.
There definitely should have been at least some activity during the faults (or host reboots). At the very least my test splunk server should receive a permanent inflow of data. It does looks like ceph has a way to know that the data on the OSD that stayed up and were written to need to be sent to the OSD that was temporarily down...


ahh there's the rub. "know what you are doing" is very much subject to the Dunning Kruger effect.


For VERY good reason. you have no parity to compare your data to, and cannot depend on it being correct. If your use case is completely transient I suppose you can do so, but that's the thing about having a product with all the knobs exposed to the end user- there is no protection from the user doing potentially harmful things.

And this is why I like it. The whole purpose of the lab is to learn by breaking things. And after 5 years I am still waiting for my potentially harmful things to actually break something...

 
  • Like
Reactions: crembz
So I did another test using a consumer ssd as a db/wal drive and performance across the ceph HDD pool shot up to 130iops.

I haven't had time to throw more disks in but does ceph performance scale with the # of osds on the nodes or are you forced to scale out nodes to get better performance?
 
does ceph performance scale with the # of osds on the nodes or are you forced to scale out nodes to get better performance?
The answer depends on what you're trying to accomplish. If you're trying to harness performance to a single guest- no. If you're trying to increase iops- no. If you're trying to get many guests full performance- yes.
 
  • Like
Reactions: takeokun
The answer depends on what you're trying to accomplish. If you're trying to harness performance to a single guest- no. If you're trying to increase iops- no. If you're trying to get many guests full performance- yes.
I'll probably only have a handful of critical guests requiring strong performance, they'll be running streaming and media management services. I'll have a few running network services such as DNS/dhcp etc. Then I'll have nested hypervisors. All other vms will largely be sacrificial and I'm not too concerned about performance on those.

I'm beginning to think that as nice as a clustered 0 spof platform will be in theory, it's probably not workable at a small scale. Have a 'main' master host which serves the shared storage for the entire cluster and protect it at all costs. Maybe just use zfs local replication for the vms
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!