Understanding ceph performance and scaling

crembz · Jun 6, 2023

So I've been playing around in my lab to get ceph going on a 5 node cluster backed by a 10g network.

I've setup 3 nodes to test ... each with one nvme and one hdd

I've created two crushmaps, the default one and another targeting the ssd/nvme drives

I've setup the corresponding pools and created 4 vms:

VM1: ceph ssd pool
VM2: ceph hdd pool
VM3: local zfs pool
VM4: zfs 5vdev hdd pool

I tested the speeds in the vm (vms are all using virtio) using fio:

fio --directory=/ --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio --size=1G --numjobs=1

I'm getting massive swings in results between the ceph and zfs setups, with zfs on average 10x faster than ceph

ceph nvme: 110 IOPS
ceph hdd: 10 IOPS
zfs nvme: 890 IOPS
zfs hdd pool: 47 IOPS

I'm using consumer grade parts across all test scenarios

My question is ... why is there such a massive discrepancy between ceph and zfs ... and is this expected? Is there something I am doing wrong here?

gurubert · Jun 6, 2023

crembz said:
why is there such a massive discrepancy between ceph and zfs ... and is this expected?

Yes, this is expected. You are comparing apples and oranges resp. a local filesystem on local storage and a distributed storage system.

crembz · Jun 6, 2023

gurubert said:
Yes, this is expected. You are comparing apples and oranges resp. a local filesystem on local storage and a distributed storage system.

So how then does performance scale in ceph? is it purely adding nodes or does it also scale by adding osds?

gurubert · Jun 6, 2023

The first thing to do is to have the RocksDB of HDD-OSDs on SSD.

Then add OSDs and nodes. This will increase th eoverall performance of the storage cluster.
But it will not increase the single-thread IO performance for a single VM.

Ceph is not primarily about performance but about consistency and redundancy.

crembz · Jun 6, 2023

gurubert said:
The first thing to do is to have the RocksDB of HDD-OSDs on SSD.

Then add OSDs and nodes. This will increase th eoverall performance of the storage cluster.
But it will not increase the single-thread IO performance for a single VM.

Ceph is not primarily about performance but about consistency and redundancy.

I was reading about shifting the wal and db off to ssd, how are they protected in that case. My understanding is if you loose them you lose the entire node?

I'm happy to loose a bit on performance but I'm genuinely surprised at just how much is lost in this simple test.

I'm guessing performance would scale out with # of nodes?

gurubert · Jun 6, 2023

crembz said:
My understanding is if you loose them you lose the entire node?

You will lose all OSDs that use the lost SSD as RocksDB device.
Usually you have a ratio of 4 to 6 HDDs on one SSD, with a total of 15 to 20 HDD in one node. Losing the capacity provided by these HDDs should not be an issue for a Ceph cluster of a certain size.

This is why very small Ceph clusters (3 to 5 nodes) should be all flash (SSD or even NVMe).

crembz · Jun 6, 2023

Hrm I might have to rethink this then. I was planning to have an SSD and 6 HDDs per node across three nodes, maybe expanding to a fourth. I'm also using a replication of 2/2.

If building an SSD cluster, I'm guessing same rule applies to use enterprise ssds?

What about mixed pools? I.e. HDD pools for media storage and ssd pools for vms?

gurubert · Jun 6, 2023

crembz said:
replication of 2/2.

Do not do that, it will lead to data loss.

You need "mixed use" or "enterprise" SSDs and not "read intensive" or "consumer grade" SSDs. The latter perform worse than HDDs in heavy write situations. Ceph always writes randomly.

Do not mix device classes in one pool. Separate SSD-OSDs and HDD-OSDs (even with RocksDB on SSD) into separate pools. If you have enough capacity on SSD it may work for VMs on a smaller cluster.

mfed · Jun 6, 2023

I am using a pool with 2/1 replication for my lab VMs in the last 5 years, and the cluster survived a couple of SSD losses and a lot of node reboots. Maybe I am just lucky, but I grew to trust that ceph does OK even wtih the non-recommended configuration.

If you know what you are doing and OK potentially losing the data (have backups) I think it is a good option. BTW, the replication 2/2 is not good, when you reboot a node the VMs would not be able to write...

In my cluster I have cephfs on HDDs for media/file storage and the SSDs for VMs... The HDD is using RAID5 configuration on the 3 disks, and performance is OK to stream the video....

alexskysilk · Jun 7, 2023

mfed said:
am using a pool with 2/1 replication for my lab VMs in the last 5 years, and the cluster survived a couple of SSD losses and a lot of node reboots. Maybe I am just lucky, but I grew to trust that ceph does OK even wtih the non-recommended configuration.

Lucky, sure- but more likely you have no load on your file system. there's not much damage to worry about when there are no in flight operations during a fault.

mfed said:
If you know what you are doing and OK potentially losing the data (have backups) I think it is a good option

ahh there's the rub. "know what you are doing" is very much subject to the Dunning Kruger effect.

mfed said:
BTW, the replication 2/2 is not good, when you reboot a node the VMs would not be able to write...

For VERY good reason. you have no parity to compare your data to, and cannot depend on it being correct. If your use case is completely transient I suppose you can do so, but that's the thing about having a product with all the knobs exposed to the end user- there is no protection from the user doing potentially harmful things.

crembz said:
So how then does performance scale in ceph? is it purely adding nodes or does it also scale by adding osds?

Yes and yes, assuming you have no other bottlenecks. search the forums, there are many discussions on this topic. bear in mind, however, that "performance" scales in the aggregate, not necessarily to a single guest.

crembz · Jun 7, 2023

alexskysilk said:
Yes and yes, assuming you have no other bottlenecks. search the forums, there are many discussions on this topic. bear in mind, however, that "performance" scales in the aggregate, not necessarily to a single guest.

Ah I see ... so the pool gets faster and is able to handle more parallel IO. So if running a vm on a hdd pool ... how do I get more than 9 IOPS hahahaha? Or am I being unrealistic with my expectations of a distributed HDD pool?

alexskysilk · Jun 7, 2023

crembz said:
So if running a vm on a hdd pool ... how do I get more than 9 IOPS

Thats... slow. any single write to a HDD based pool maxes out at ~150 IOps; 9 is bad. Have a look at your interconnects- what speed and link count have you deployed for your ceph public and private interfaces? do you share either (both?!) with other types of traffic?

crembz · Jun 7, 2023

I have both public and cluster interfaces on seperate vlans over a 10g connection to the 10g switch. Links are up at 10g

gurubert · Jun 7, 2023

crembz said:
how do I get more than 9 IOPS

Put the RocksDB on SSD.

crembz · Jun 7, 2023

gurubert said:
Put the RocksDB on SSD.

I'm guessing that would need to be done on every node otherwise you'll be bound by the smallest one?

mfed · Jun 7, 2023

alexskysilk said:
Lucky, sure- but more likely you have no load on your file system. there's not much damage to worry about when there are no in flight operations during a fault.

There definitely should have been at least some activity during the faults (or host reboots). At the very least my test splunk server should receive a permanent inflow of data. It does looks like ceph has a way to know that the data on the OSD that stayed up and were written to need to be sent to the OSD that was temporarily down...

alexskysilk said:
ahh there's the rub. "know what you are doing" is very much subject to the Dunning Kruger effect.

For VERY good reason. you have no parity to compare your data to, and cannot depend on it being correct. If your use case is completely transient I suppose you can do so, but that's the thing about having a product with all the knobs exposed to the end user- there is no protection from the user doing potentially harmful things.

And this is why I like it. The whole purpose of the lab is to learn by breaking things. And after 5 years I am still waiting for my potentially harmful things to actually break something...

crembz · Jun 10, 2023

So I did another test using a consumer ssd as a db/wal drive and performance across the ceph HDD pool shot up to 130iops.

I haven't had time to throw more disks in but does ceph performance scale with the # of osds on the nodes or are you forced to scale out nodes to get better performance?

alexskysilk · Jun 10, 2023

crembz said:
does ceph performance scale with the # of osds on the nodes or are you forced to scale out nodes to get better performance?

The answer depends on what you're trying to accomplish. If you're trying to harness performance to a single guest- no. If you're trying to increase iops- no. If you're trying to get many guests full performance- yes.

crembz · Jun 11, 2023

alexskysilk said:
The answer depends on what you're trying to accomplish. If you're trying to harness performance to a single guest- no. If you're trying to increase iops- no. If you're trying to get many guests full performance- yes.

I'll probably only have a handful of critical guests requiring strong performance, they'll be running streaming and media management services. I'll have a few running network services such as DNS/dhcp etc. Then I'll have nested hypervisors. All other vms will largely be sacrificial and I'm not too concerned about performance on those.

I'm beginning to think that as nice as a clustered 0 spof platform will be in theory, it's probably not workable at a small scale. Have a 'main' master host which serves the shared storage for the entire cluster and protect it at all costs. Maybe just use zfs local replication for the vms

alexskysilk · Jun 11, 2023

crembz said:
I'm beginning to think that as nice as a clustered 0 spof platform will be in theory, it's probably not workable at a small scale.

well, unless you like to spend 3 times the cost of the hardware you need to have inferior performance, that was a given from the start

Understanding ceph performance and scaling

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Well-Known Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Well-Known Member

Member

Distinguished Member

Member

Distinguished Member

We value your privacy