CEPH cache disk

Apr 27, 2024
484
178
43
Portland, OR
www.gnetsys.net
Hello folks

Sorry. This is long-ish. It's been a saga in my life ...
I'm not clear on exactly how to deploy a cache disk with CEPH.
I've read a lot of stuff about setting up CEPH, from Proxmox and the CEPH site. Browsed some forum posts. Done a bunch of test builds.

I'm an old VMware guy. And I'm converting an existing VSAN to CEPH.
In VSAN, you have disk groups, and each disk group can have cache disks assigned.

In CEPH when you setup an OSD, you have the option of putting the DB and WALL either on the OSD or another disk.
You can use one DB/WAL disk for multiple OSDs.
That seems kinda like cache for a VMware disk group. (Correct me if that's wrong.)

The cluster I'm converting to CEPH is performing fairly horrible. VMs get ~ 350 MB/s for both read and write. That's spinning rust speed. Bad.
(EDIT ... THIS 350MB/s WAS A BAD TEST CONFIG ON MY PART.)
I tested the original VSAN cluster before it was torn down, and it did 450 MB/s write and 4.5 GB/s read. Not great, but not this bad.
If VSAN could go that fast, maybe CEPH can with the proper config...

The hosts have identical SSD, no fast cache disks. (I just discovered this.)
The disks are SATA (not SAS) so they connect at 6gb. (Another lovely discovery.)
The network is 10gb. (Its always been that way. Not perfect, but its exactly what VSAN ran on ... much faster.)
It is clear why they are slow.
What is not clear is why they are so much slower than VSAN was.

I've done a lot of rebuilds, trying to tune it. Gawd the time I have into this already. Weeks. Many weeks.
The amount of disks in the CEPH array seems to not really matter. I get the same speed whether its 2 disks per host or 6 disks per host.
I've tried using one of the disks as local cache (DB/WAL). Zero improvement over no cache disk.
I've actually read a CEPH performance tuning article that indicates CEPH on SATA (6gb) does not significantly benefit from using a SATA (6gb) cache disk.
Yet VSAN was acceptable on this very same hardware with no fast cache. Is there something I'm doing wrong when i setup cache?

The one thing that's really clear is that I must order a SAS (12gb connection) enterprise-class MU SSD for each host to use as cache.
That alone I expect to make significant difference.

But what can I do with this gear today? How can I make it perform like VSAN did?
 
Last edited:
OMG. My test VM was not optimized for Virtio.
I built another VM and retested.
Now it looks so much better that I doubt the results.
This is much better than the old VSAN cluster delivered.

Atto. Default test.
1760236618832.png

Atto. Write cache disabled.
1760236830182.png


Well, that's going to send me into a new cycle of testing optimal configs.
Damn. I'm gonna re-cover a bunch of ground now.

I am still extremely interested in any feedback on the proper deployment of cache disks with CEPH.
I'm gonna have the boss buy some pricey SSDs. I'd like to make sure that they are worth it.
 
Last edited:
  • Like
Reactions: gurubert
The DB/WAL are both things you CAN put in other disks, only recommended if those disks are significantly faster than your disk. Eg. NVRAM for NVMe or NVMe/SAS SSD for spinning disks. You can read up on exactly what they do, but the WAL is something of a write cache/journal, it ingests writes so they can then be more efficiently committed and written to the OSD for storage. The DB is metadata, it can accelerate reads and writes that don’t need the actual data.

A data read cache would be a portion of RAM for each OSD, that can be tuned. Another level of write cache (effectively async writes) with eg. RAM-backed RAID controllers are dangerous, and should not be used, although that is how VSAN works.

If you are going in that direction, you can use NVMe or NVRAM and use their namespaces to split the disk evenly. There are SSD that are even optimized for reads and some for writes. You’ll have to look up some benchmarks and recommendations based on the size of your OSDs to properly size both WAL and DB.
 
Last edited:
@tcabernoch IIRC from when I looked into it using a disk for read cache was deprecated or otherwise not viable with Ceph. I just don’t recall the details now. It does have memory caching.

Our prior Virtuozzo Storage setup did have that ability.

You can set HDD OSDs to not be the primary read, to “force” reading from SSD. Or not mix the two in the same pool.
 
  • Like
Reactions: tcabernoch
Thanks @guruevi. (And @SteveITS)
- So don't bother with doing cache with one of the current disks, as they are all the same. K.
- And a SAS disk with high IOPs for DB/WAL on each host should be significantly faster than these SATAs. I've had good results from MU enterprise drives in the past. Gonna do this.

I read about cache tiering. And its deprecation. I believe this is what you discussed in your second paragraph? (https://docs.ceph.com/en/reef/rados/operations/cache-tiering/)

In your third paragraph, this reference to splitting the disk evenly ... you are referring to features I've not previously encountered. The Proxmox GUI allows me to select the same 'cache' disk for DB/WAL for multiple OSDs, and doesn't offer or suggest any splitting or suballocation.
It worked when I built it that way, albeit with no speed increase because the current set of disks are all identical.
(I also won't have an NVMe or NVRAM, just drives in bays.) If I'm missing something critical here, please let me know!

I struggled with a deployment plan for the fast disk.
You can't go back and add a DB/WAL location to existing OSDs, or I haven't found a way.
It seems you can just remove the OSD and redeploy it with DB/WAL pointed at your fast disk.
This matters because my cluster is built, and i've rebuilt it so many times that my boss is on my butt to get it done. Not that I care. This is rocket surgery. Non-trivial, delicate, and explosive. But I sure would like to be done soon.


.............
The info I've found on DB/WAL setup.

Another way to speed up OSDs is to use a faster disk as a journal or DB/Write-Ahead-Log device, see creating Ceph OSDs. If a faster disk is used for multiple OSDs, a proper balance between OSD and WAL / DB (or journal) disk must be selected, otherwise the faster disk becomes the bottleneck for all linked OSDs.

Huh? Looks like i need to read about that. And ... this is all there is to read ...


--db_dev <string>
Block device name for block.db.

--db_dev_size <number> (1 - N) (default = bluestore_block_db_size or 10% of OSD size)
Size in GiB for block.db.

--wal_dev <string>
Block device name for block.wal.

--wal_dev_size <number> (0.5 - N) (default = bluestore_block_wal_size or 1% of OSD size)
Size in GiB for block.wal.

https://pve.proxmox.com/pve-docs/chapter-pveceph.html#pve_ceph_osds
 
Last edited:
So, after all of that questing, testing, and inquiry, I exposed it to some real-world load.
I restored a whole bunch of client VMs to the cluster and ran them like hell.
And then I completely filled the datastore till it stopped working, to see what it did.

Check out this amazing graph.
The total amount of storage went down as I dumped more stuff on it.
It appears to me that it drops backfill and backfill_wait from the total storage, so when you are close to filling up and it lags, you get this really strange scene.

1760322424113.png

I found that pretty disturbing, but it made me really think about the total storage on the cluster.
VSAN delivered a lot more usable space on this set of devices.
This was obviously not enough.
And clearly, when you get near the 'not enough' point, CEPH does odd things.

So I prioritized capacity over performance.
Instead of a cache disk, I just used the capacity disk that I had planned on removing.

And then I prioritized capacity over redundancy.
This is a CEPH array. The hosts themselves are redundant.
I stole the mirror disk from the OS, so each host is just running on one SSD now.
Added that to the CEPH array.


Final result:
4 x hosts, each with 7 x 1TB SSD = 28 total OSDs.

I get about 6.1 TB of usable CEPH RBD
(I think. Unless it changes again after the rebalance.)

The speeds are damn good. Much better than VSAN.
I may have lost hosting space with CEPH, but I'll deliver better hosting performance.
This Atto test was run with the cluster under load and doing an extensive rebalance.
Really. These are killer speeds. I'd be happy to get this from my hosting provider.

1760323412944.png
 
Last edited:
Thanks, Steve.

The reason I wanted to do a DB/WAL disk is that the capacity SSDs are SATA ... so they mount at 6gb/s.
This is a Gen13 Dell. They should have bought SAS for12gb/s. Terrible original build choices.
And I thought the speed I was getting was bad when I first got that idea.
So I wanted to get a SAS cache disk that would mount at 12gb/s with high IOPs to help overcome the hobbled 6gb/s disks.

But the speed I'm testing now is much better than I originally thought, and its way freakin better than VSAN delivered.
And at the same time, the amount of usable space I get is way less than VSAN delivered.
So it turned out that capacity was my main priority, not speed, and I chose the disk architecture accordingly.

This "Ceph capacity depends on replication" is still pretty bizarre to me.
It is nutty to watch overall capacity go up and down as the backfill queues and catches up.
At the time of my last post, it had 6.1TB capacity.
Now I'm 95% rebalanced, and it shows 7.3TB capacity.
This is going to take some getting used to. I understand whats happening, but this is quite unlike any other storage I've administered, including VSAN.

I appreciate the feedback from you guys. I think I'm ready to go live with this thing next week.
 
Last edited:
When I came back to work this week, that CEPH datastore is now 8.19 TB.

That is just wild, watching total space go up and down. I'm not sure how to plan for that.
I did abuse it. I filled it up. I ran a bunch of things at once. But, that might happen in day to day use.

Should I just pretend it says 6 TB and use that as a reasonable limit? I've never even heard about fluctuating total space as a CEPH issue.
 
Are you adding to/removing from cephfs by chance?

Edit: ours varies over time but not by that much, 10-20 GB here or there. We only use cephfs for ISO storage and that rarely changes.
 
Last edited:
The total capacity of the cluster is defined as the sum of all OSDs. This number only changes when you add or remove disks.

Do not confuse that with the maximum available space for pools which depends on replication factor or erasure code settings and currently used capacity.
 
I was referring to the "Total Size" which changes:
1760642066307.png
1760642095503.png
It is subtle but note the total size line is not flat. It varies from 5.69 to 5.74 in this cluster over the past month. IIRC cephfs usage subtracts from that but didn't double check that. Though in our case cephfs usage hasn't changed so that's all internal.
 
Yes, I did two things you probably don't see often.
- I filled up the ceph datastore. This is a pre-prod test, and I needed to explore how it fails. It did not fail, but the things it actually did do seemed quite odd to me.
- I added a disk to each host, so there was a ton of backfill. And that acts very different from other storage types, such an array that needs to resilver.

Yes, I'm talking about Total Size. If you scroll up, you'll see the screenshot I posted. The drop in Total Size is not subtle. I'd call it dramatic.
That screenshot was taken right as I filled up the ceph datastore. I was cloning VMs to generate data and fill it.
 
Last edited: