CEPH cache disk

Apr 27, 2024
467
175
43
Portland, OR
www.gnetsys.net
Hello folks

Sorry. This is long-ish. It's been a saga in my life ...
I'm not clear on exactly how to deploy a cache disk with CEPH.
I've read a lot of stuff about setting up CEPH, from Proxmox and the CEPH site. Browsed some forum posts. Done a bunch of test builds.

I'm an old VMware guy. And I'm converting an existing VSAN to CEPH.
In VSAN, you have disk groups, and each disk group can have cache disks assigned.

In CEPH when you setup an OSD, you have the option of putting the DB and WALL either on the OSD or another disk.
You can use one DB/WAL disk for multiple OSDs.
That seems kinda like cache for a VMware disk group. (Correct me if that's wrong.)

The cluster I'm converting to CEPH is performing fairly horrible. VMs get ~ 350 MB/s for both read and write. That's spinning rust speed. Bad.
(EDIT ... THIS 350MB/s WAS A BAD TEST CONFIG ON MY PART.)
I tested the original VSAN cluster before it was torn down, and it did 450 MB/s write and 4.5 GB/s read. Not great, but not this bad.
If VSAN could go that fast, maybe CEPH can with the proper config...

The hosts have identical SSD, no fast cache disks. (I just discovered this.)
The disks are SATA (not SAS) so they connect at 6gb. (Another lovely discovery.)
The network is 10gb. (Its always been that way. Not perfect, but its exactly what VSAN ran on ... much faster.)
It is clear why they are slow.
What is not clear is why they are so much slower than VSAN was.

I've done a lot of rebuilds, trying to tune it. Gawd the time I have into this already. Weeks. Many weeks.
The amount of disks in the CEPH array seems to not really matter. I get the same speed whether its 2 disks per host or 6 disks per host.
I've tried using one of the disks as local cache (DB/WAL). Zero improvement over no cache disk.
I've actually read a CEPH performance tuning article that indicates CEPH on SATA (6gb) does not significantly benefit from using a SATA (6gb) cache disk.
Yet VSAN was acceptable on this very same hardware with no fast cache. Is there something I'm doing wrong when i setup cache?

The one thing that's really clear is that I must order a SAS (12gb connection) enterprise-class MU SSD for each host to use as cache.
That alone I expect to make significant difference.

But what can I do with this gear today? How can I make it perform like VSAN did?
 
Last edited:
OMG. My test VM was not optimized for Virtio.
I built another VM and retested.
Now it looks so much better that I doubt the results.
This is much better than the old VSAN cluster delivered.

Atto. Default test.
1760236618832.png

Atto. Write cache disabled.
1760236830182.png


Well, that's going to send me into a new cycle of testing optimal configs.
Damn. I'm gonna re-cover a bunch of ground now.

I am still extremely interested in any feedback on the proper deployment of cache disks with CEPH.
I'm gonna have the boss buy some pricey SSDs. I'd like to make sure that they are worth it.
 
Last edited:
The DB/WAL are both things you CAN put in other disks, only recommended if those disks are significantly faster than your disk. Eg. NVRAM for NVMe or NVMe/SAS SSD for spinning disks. You can read up on exactly what they do, but the WAL is something of a write cache/journal, it ingests writes so they can then be more efficiently committed and written to the OSD for storage. The DB is metadata, it can accelerate reads and writes that don’t need the actual data.

A data read cache would be a portion of RAM for each OSD, that can be tuned. Another level of write cache (effectively async writes) with eg. RAM-backed RAID controllers are dangerous, and should not be used, although that is how VSAN works.

If you are going in that direction, you can use NVMe or NVRAM and use their namespaces to split the disk evenly. There are SSD that are even optimized for reads and some for writes. You’ll have to look up some benchmarks and recommendations based on the size of your OSDs to properly size both WAL and DB.
 
Last edited: