Hi Giovanni
No, ultimately I never posted my findings, as there are too many variables at play when combining caching with Ceph, and even setups that benchmarked well didn't actually perform particularly well in our real world usage.
We did run with bcache backed OSDs for a bit, it was fine when you were dealing with hot data, but cold data could be painfully slow - ultimately this came down to how ceph works rather than an issue in the caching approach, and it took me a while to understand this.
The key thing to understand about Ceph is that while you might have 3 or 4 copies of every block of data on your network, Ceph will never use these additional copies to speed reads along - each block of data will be read from the particular OSD that ceph has chosen as the primary for that block.
If all of your OSDs are stored on hard drives, that means best case, a single threaded sequential read of data will go at the speed of one single hard drive - even if your data is spread across dozens of them with multiple copies. This is very much contrary to what you might be used to using RAID 1 or RAID 5/6 disk arrays, where the additional copies are used to speed up reads.
It actually gets worse than that though, because while the data you are reading might be one single large file for example, ceph might have spread this across several OSDs (and it is somewhat random which of these OSDs is chosen as the primary OSD for that block of data) - so partway through making your large read you might need to wait while ceph seeks on a different hard drive to get your next block of data.
What this all translates to real world was that if we were copying off a rarely accessed video file from our ceph store, the speed would often run at around 50 megabytes per second sustained - less than half the performance of one of the underlying drives.
This wasn't really good enough!
The solution I settled on was;
- Instead of an OSD for each hard drive, hard drives were paired into RAID 0 arrays using LVM, then each of those two drive arrays was setup as an OSD - with WAL on a separate enterprise grade SSD with PLP. This means that our worst case performance was now the speed of two hard drives in RAID 0, instead of just one hard drive. Obviously this is marginally less safe/discouraged etc but has proved a good compromise in our setup, and we have survived a few hard drive failures at this point without issue (we run 4 copies/2 minimum on our ceph storage)
- This had the added benefit of wasting less RAM on our servers - Ceph is quite greedy in the RAM it will assign to each OSD....
- On one of our four servers (we have four proxmox/ceph servers, plus one witness server) we replaced the hard drive storage with pure SSD storage - so we have 8x3tb Hard drives in three servers, and 3x7.6tb SSDs in one server. I set the affinity on the SSD hosted OSDs so that ceph would always treat them as primary - this means that reads will always happen at SSD speed so long as this server is up and running.
If you have any further questions I'll do my best to answer but hopefully that makes sense!