Flashcache vs Cache Tiering in Ceph

Dilvan

New Member
Feb 15, 2016
2
0
1
60
Hi,

I assembled a small Proxmox 4.1 cluster (3 Proxmox nodes + 1 NFS server for backup) using 1Gbit ethernet. All 3 nodes run OSDs (1x2T disks) with 10G SSD partitions for Journaling. My VMs use ceph rbd for storage. Now, I want to use SSDs for caching (either using free space on the Journaling SSDs or buying new SSDs), which is faster/better:
  • Use SSD as Flashcache (or some similar solution). In this case, would Proxmox be able to use this Flashcache in my VMs running on rbd (I only saw examples using local storage not ceph)?
  • Use SSD as Cache Tiering in Ceph (one cache per node).
BTW, is Proxmox planning to support Docker in the near future?

Dilvan.
 
Hi,

I assembled a small Proxmox 4.1 cluster (3 Proxmox nodes + 1 NFS server for backup) using 1Gbit ethernet. All 3 nodes run OSDs (1x2T disks) with 10G SSD partitions for Journaling. My VMs use ceph rbd for storage. Now, I want to use SSDs for caching (either using free space on the Journaling SSDs or buying new SSDs), which is faster/better:
  • Use SSD as Flashcache (or some similar solution). In this case, would Proxmox be able to use this Flashcache in my VMs running on rbd (I only saw examples using local storage not ceph)?
  • Use SSD as Cache Tiering in Ceph (one cache per node).
[...]

I'd use Cache-tiering (cause that is what i am familiar with and use widely at work altho on a much larger scale) with an appropriate Cache-Mode for your use (see ceph documentation). Use of a custom Crush hook to split HDD-OSD's from SSD-OSD's is highly recommended, since it makes setting this up and extending a lot easier. Google "Wido den Hollander" for an example.
Using custom Crush-Rules you can then set up a Cache-Scheme as you feel like, by using different buckets (osd, host, rack, row, etc) e.g. Rack1-Host1-SSD, Rack1-Host2-SSD. From your description you'd probably wanna use three Cache-pools as Read-Cache, each running on a separate hosts OSD's.
This is true for Read-Cache only.


IF you should move into other Cache-modes, like e.g. Writeback you need to maintain your Replication size to maintain the same tolerance to drive-failure. In this case you probbaly would want Size=3 via bucketType Host

If you should not be using local (specific to the host) SSD's for the Cache-pool, but use replication across multiple hosts, you need to know your bottle neck tho. 1 Gbit/s is 128 MB/s. if you'd use Replication size=3 you'd be limited by 128/2 = 64 MB/s per remote-host. That is because 2 Parts would come from remote nodes, while one comes from the local node. Not sure THAT would make sense on a single 1 Gbit/s link system.

Note: I've never done this with SSD-Partitions, just full SSD-Osds.


ps: you can read this here for some more clues:
https://forum.proxmox.com/threads/ssd-ceph-and-network-planning.25687/#post-128696




In my opinion non-Read-only Cache-Tiering only makes sense when you have bigger network links, a massive amount of OSD's and/or a Erasure Coded pool.
I know for a fact that on Low Number of Nodes (5) / OSD's (4 per node) Writeback Cache-mode and a 4x1Gbit/s in balance-tcp mode bond; makes no sense and is even counter-productive to just sticking the HDD-OSD on SSD-journals and calling it a day. Talked someone through the process of setting it up during december. Results were horrendous. It only makes sense on scale or when sticking to local Caching.

personal note: We use Cache-Tiering at work on multiple 30+ node based clusters each with 48x HDD (no SSD journal) and 8x NVME PCIE based SSDs (for cache-Pools / Fast replicated pools) on 2x10G nic + 2x/4x dual-40G nic based networks. With custom Crush hooks and utilising 3 different local Datacenters (< 10 km distance), we can get quite a bit of customizability out of the Hardware and cache-modes, for best-case performance based on the data/application stored on those pools.
 
Last edited:
I'd use Cache-tiering (cause that is what i am familiar with and use widely at work altho on a much larger scale) with an appropriate Cache-Mode for your use (see ceph documentation). Use of a custom Crush hook to split HDD-OSD's from SSD-OSD's is highly recommended, since it makes setting this up and extending a lot easier. Google "Wido den Hollander" for an example.
Using custom Crush-Rules you can then set up a Cache-Scheme as you feel like, by using different buckets (osd, host, rack, row, etc) e.g. Rack1-Host1-SSD, Rack1-Host2-SSD. From your description you'd probably wanna use three Cache-pools as Read-Cache, each running on a separate hosts OSD's.
This is true for Read-Cache only.


IF you should move into other Cache-modes, like e.g. Writeback you need to maintain your Replication size to maintain the same tolerance to drive-failure. In this case you probbaly would want Size=3 via bucketType Host

If you should not be using local (specific to the host) SSD's for the Cache-pool, but use replication across multiple hosts, you need to know your bottle neck tho. 1 Gbit/s is 128 MB/s. if you'd use Replication size=3 you'd be limited by 128/2 = 64 MB/s per remote-host. That is because 2 Parts would come from remote nodes, while one comes from the local node. Not sure THAT would make sense on a single 1 Gbit/s link system.

Note: I've never done this with SSD-Partitions, just full SSD-Osds.


ps: you can read this here for some more clues:
https://forum.proxmox.com/threads/ssd-ceph-and-network-planning.25687/#post-128696




In my opinion non-Read-only Cache-Tiering only makes sense when you have bigger network links, a massive amount of OSD's and/or a Erasure Coded pool.
I know for a fact that on Low Number of Nodes (5) / OSD's (4 per node) Writeback Cache-mode and a 4x1Gbit/s in balance-tcp mode bond; makes no sense and is even counter-productive to just sticking the HDD-OSD on SSD-journals and calling it a day. Talked someone through the process of setting it up during december. Results were horrendous. It only makes sense on scale or when sticking to local Caching.

personal note: We use Cache-Tiering at work on multiple 30+ node based clusters each with 48x HDD (no SSD journal) and 8x NVME PCIE based SSDs (for cache-Pools / Fast replicated pools) on 2x10G nic + 2x/4x dual-40G nic based networks. With custom Crush hooks and utilising 3 different local Datacenters (< 10 km distance), we can get quite a bit of customizability out of the Hardware and cache-modes, for best-case performance based on the data/application stored on those pools.

Thank you very much for your detailed answer, it was very helpful. May I pick your brain once more?
I am not sure about the impact of the various types of caches.
In my case, I have, in each node, Proxmox + OSD Journaling (11G) in the SSD and the OSD in a 2T disk. I can easily get small HDs. Would it make sense to run Proxmox in a small HD, an OSD for read-only tiering in the SSD, and the OSD+Journaling in the 2T disk? And if I move Journaling to the SSD (2 partitions: one for Journaling and another for OSD)?
 
I am not sure about the impact of the various types of caches.
http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
Should fix that for you in detail.

There are basically 4 Modes:
  1. Writeback
  2. Read-only
  3. read-forward
  4. read-Proxy
Afaik, only in "writeback" mode you need to keep in mind that the pool used as Cache-tier also replicates data according to your desire failure-domains. It is also the only mode where you can forgo journal's on SSD, if you have a writeback Cache-Tier on said SSD's. It would basically allow you to absorb large amounts of IO, before flushing to teh slow HDD's based on what ever settings you set. (compare Cache-sizing http://docs.ceph.com/docs/master/rados/operations/cache-tiering/#cache-sizing and Cache-Age http://docs.ceph.com/docs/master/rados/operations/cache-tiering/#cache-age)

n my case, I have, in each node, Proxmox + OSD Journaling (11G) in the SSD and the OSD in a 2T disk. I can easily get small HDs. Would it make sense to run Proxmox in a small HD, an OSD for read-only tiering in the SSD, and the OSD+Journaling in the 2T disk? And if I move Journaling to the SSD (2 partitions: one for Journaling and another for OSD)?

So you have per node:
  • 1x SSD with 2 partitions:
    1. OS: Proxmox
    2. 11GB Journal for HDD-OSD.X
  • 1x 2TB HDD-OSD.X

OS-Disk:
I always advise people that it is better to use a small an cheap SSD (32 GB would do) over a small and cheap HDD for Proxmox as an OS-Drive (Operating System). Because it has typically between 5-30x faster access times (latency), thereby leading to less IO-Wait. It does not even have to be an expensive power-loss enterprise-grade SSD. Just make sure to have a working backup for your OS-Drive (e.g. by using Clonezilla + at least a regular backup of " /etc/* " or any other method that gives you the ability to be back up and running within 30 minutes)

We use Adata Premier Pro SP 900 at work. But that is because we have around 1000 of those SSD's left over from when we used em as SSD-OSD's for Cache-Tiering (we since have moved to Samsung SSD 950 Pro 512GB, M.2).
At home i use SanDisk SDSSD P06 (not the best Disk, you can get a lot better performance for 30-40 Euro nowadays)
But generally on a budget ANY Consumer SSD you can get is better then a HDD (exception are SSHD).

Journals:
Since you are only using a single HDD-OSD per node, you can safely ignore most of the recommendations regarding SSD-Lifetime with regards to TBW (Terabyte Written).
100 TBW e.g. means that you can write your 2 TB completely 50x. How often would THAT happen (check how often you delete and rewrite data)? especially considering that your 2 TB Disk probably has less then 200 MB/s write speed, thereby making it about 6 Days of writing 200 MB/s continuously.
The more HDD-OSD's you stick on a journal-SSD, the easier it will likely break, because it will take 2, 3 or 4 times the amount of data written compared to a single OSD when you use 2, 3 or 4 HDD-OSD's per journal.
Keep in mind, when you loose your journal, you typically loose your OSD aswell.
Generally check here for some guidance on journal drives:
http://www.sebastien-han.fr/blog/20...-if-your-ssd-is-suitable-as-a-journal-device/

To journal or not to journal:
At work we run all HDD-OSD's without dedicated journals on any faster media. But that is because we have a lot of them, which in turn can make up for slower individual write speeds. When i'm saying a lot, i am talking around 1440 HDD's per Cluster. For a small scale Cluster, like yours, you want to have a SSD-journal to absorb the write IO into the Cluster. Especially when you only have 3 OSD's. A single HDD is capable of handling 109-150 IO per Disk.
SSD's handle a couple thousand IO's for single queues.



Working with partitions and Ceph:
You can use partitions, but that does not mean you should. All official guidance says to not do it. And i have to concur. Its a nightmare with regards to performance, maintainability and is quite finicky.

IF you have to consolidate stuff to a single SSD, I'd sooner stick Proxmox and the single journal for your 2 TB HDD onto the same SSD. and use a full Disk as a LOCAL SSD-OSD.
 
I'm not sure the "custom Crush Hook" part was explained well by me.
Its basically a script that gets triggered everytime a OSD gets started on a Ceph-Node. It makes sure that said OSD is added to the Crush map according to characteristics of a disk, perhaps even the hostname or other information you make it parse from the host it is triggered on.

Without a custom hook you typcially end up with this:
  • Root
    • Node1
      • OSD.0
      • OSD.3
    • Node2
      • OSD.1
      • OSD:4
    • Node3
      • OSD.2
      • OSD.5

With a custom hook script you can give your nodes the following hostname-Scheme:
"RoomW-RackX-NodeY-ProxmoxZ"
e.g. "RoomA101-Rack3-Node5-Proxmox1"
e.g. "RoomA101-Rack2-Node1-Proxmox2"
e.g. "RoomA101-Rack3-Node6-Proxmox3"

Assuming you have one SSD-OSD and one HDD-OSD you'd end up with this:
  • Root
    • Root-HDD
      • A101-HDD
        • A101-Rack2-HDD
          • A101-Rack2-Node1-HDD
            • OSD.0
        • A101-Rack3-HDD
          • A101-Rack3-Node5-HDD
            • OSD.1
          • A101-Rack3-Node6-HDD
            • OSD.2
    • Root-SSD
      • A101-SSD
        • A101-Rack2-SSD
          • A101-Rack2-Node1-SSD
            • OSD.3
        • A101-Rack3-SSD
          • A101-Rack3-Node5-SSD
            • OSD.4
          • A101-Rack3-Node6-SSD
            • OSD.5
Now in your case you could create a Cache-Tier pool that uses only SSD OSD's in BucketType Host, by selecting A101-Rack2-Node1-SSD only.

So you basically would create 3 Pools, one on each Host. And set em as a Cache-Pool in Read-only mode for the Backing Storage pool that uses Replication of 3 on all HDD-OSD's by using Bucket-Type OSD under the Root-HDD.

Now with this its very easy to get pools configured exactly how you want em. Not just cacheTiers, but also Replicated pools that avoid specific Failure domains, So instead of doing Replicationsize=3 on HDD Host buckets, you could do Size=2 on Racks. Thereby ensuring that data is replicated onto 2 different Racks and you can take the loss of a whole rack.
 
I once tried to setup like this 2 type roots and then specify in rules to put first data copy on SSD root and replica on HDD. It went quite well untill I decided to do maintenance on 3 node Promxox/CEPH cluster. It runed out that there was a percentage of data where primary and replica was on the same node. Unfortunately that meant that data was gone. Luckily I could quickly recover without data loss by just booting up a node. Since then I moved to single root with read-affinity setup. That still is not optimal as there is a percentage of data that lands only on HDDs and read speed suffers from this.

Does anyone have a crush ruleset template that has a concept of SSD/HDD roots and is capable of placing replica on different node that primary data copy?

I'm capable of reading about at 40MB/sec in VMs with quite high IOwait. Writing is no issue as I have placed journals on SSDs for HDDs. Proxmox Backups are even worse if there is real data to be read from disks! I configure discards in VM, so they are taking up just the space on CEPH cluster of real data.

Code:
INFO: Starting Backup of VM 138 (qemu)
INFO: status = running
INFO: update VM 138: -lock backup
INFO: VM Name: tbase-dc0-pg0
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/NAS-4/dump/vzdump-qemu-138-2016_09_24-22_14_33.vma.lzo'
INFO: started backup task 'a01593fd-e122-4253-b513-32a3741d11be'
INFO: status: 0% (263782400/594852970496), sparse 0% (229019648), duration 3, 87/11 MB/s
INFO: status: 1% (6957760512/594852970496), sparse 0% (5655617536), duration 48, 148/28 MB/s
INFO: status: 2% (12409765888/594852970496), sparse 1% (10438397952), duration 76, 194/23 MB/s
INFO: status: 3% (18453954560/594852970496), sparse 2% (16083025920), duration 106, 201/13 MB/s
INFO: status: 4% (24157749248/594852970496), sparse 3% (21719265280), duration 114, 712/8 MB/s
INFO: status: 5% (29792862208/594852970496), sparse 4% (27197272064), duration 123, 626/17 MB/s
INFO: status: 6% (36665229312/594852970496), sparse 5% (33803714560), duration 141, 381/14 MB/s
INFO: status: 7% (42680385536/594852970496), sparse 6% (39655845888), duration 152, 546/14 MB/s
INFO: status: 8% (47658237952/594852970496), sparse 7% (44587409408), duration 158, 829/7 MB/s
INFO: status: 9% (54364405760/594852970496), sparse 8% (51287932928), duration 165, 958/0 MB/s
INFO: status: 10% (59532443648/594852970496), sparse 9% (54914899968), duration 221, 92/27 MB/s
INFO: status: 11% (65443332096/594852970496), sparse 9% (54914899968), duration 446, 26/26 MB/s
INFO: status: 12% (71387643904/594852970496), sparse 9% (54915158016), duration 695, 23/23 MB/s
INFO: status: 13% (77357318144/594852970496), sparse 9% (54915158016), duration 924, 26/26 MB/s
INFO: status: 14% (83304382464/594852970496), sparse 9% (54915158016), duration 1162, 24/24 MB/s
INFO: status: 15% (89268551680/594852970496), sparse 9% (54915158016), duration 1424, 22/22 MB/s
INFO: status: 16% (95206703104/594852970496), sparse 9% (54915272704), duration 1697, 21/21 MB/s
INFO: status: 17% (101127946240/594852970496), sparse 9% (54915272704), duration 1927, 25/25 MB/s
INFO: status: 18% (107099455488/594852970496), sparse 9% (54915334144), duration 2164, 25/25 MB/s
INFO: status: 19% (113046126592/594852970496), sparse 9% (54915383296), duration 2399, 25/25 MB/s
INFO: status: 20% (119009443840/594852970496), sparse 9% (54915383296), duration 2644, 24/24 MB/s
INFO: status: 21% (124923871232/594852970496), sparse 9% (54915403776), duration 2916, 21/21 MB/s
INFO: status: 22% (130873294848/594852970496), sparse 9% (54916509696), duration 3157, 24/24 MB/s
INFO: status: 23% (136821866496/594852970496), sparse 9% (54917058560), duration 3415, 23/23 MB/s
INFO: status: 24% (142795472896/594852970496), sparse 9% (54917058560), duration 3678, 22/22 MB/s
INFO: status: 25% (148747976704/594852970496), sparse 9% (54917165056), duration 3926, 24/24 MB/s
INFO: status: 26% (154692616192/594852970496), sparse 9% (54917165056), duration 4172, 24/24 MB/s
INFO: status: 27% (160650625024/594852970496), sparse 9% (54917165056), duration 4440, 22/22 MB/s
 
Last edited:
[...] 2 type roots and then specify in rules to put first data copy on SSD root and replica on HDD.[...
Does anyone have a crush ruleset template that has a concept of SSD/HDD roots and is capable of placing replica on different node that primary data copy?

We only use SSD/HDD with copies on different mediums for a large capacity single node Cluster (very specific usecase) and Datacenter (Campus) failure domains where the same node has SSDs, NVME's and HDD's, whereby we can fail 2 out of 5 datacenters . Not ran into this issue yet. We do not use it for production tho. more like internal semi-serious testing on live systems.

Other then that we use storage medium separation only in conjunction with caching tiers.


For that it works very well.
It is in production Level usage on Failure domains from "Region" to "Campus" to "building", even down to "Rack" level, based on the specific use-case. Hosting both customers and inhouse solutions.

Maybe Cache-Tiering is a way to go for you ?
 
From your description you'd probably wanna use three Cache-pools as Read-Cache, each running on a separate hosts OSD's.
This is true for Read-Cache only.

sorry for warming up this old thread, but it is appropriate for my question and probably the place others would look for it:
you mentioned using more than one separate cache pools, one per host, in read-only mode. Even the Ceph documentation mentions the possibility of having more than one read-only cache pool linked to a backing pool, and that you do not need overlay in this case. However, it does not work for me.
What I do:
  • have a pool hdd_pool, which contains the backing data
  • create 3 pools designated as cache (ssd_cache_01, ssd_cache_02, ssd_cache_03), each consisting of one OSD on a different host
  • connect these pools to hdd_pool as cache-tiers in read-only mode.
  • set some basic parameters on the cache pools, mainly their designated cache size in bytes
As far as the Ceph documentation is concerned, this should be enough for read-only operation. However, using
rados -p <pool> ls
never shows any content in these pools, and my backing pool performance does not indicate the cache being used.
When I enable overlay it works, but with overlay you can only use one cache pool altogether, not one per host.

Can anyone shed some light on this?