Ceph tier cache question

plastilin

Renowned Member
Oct 9, 2012
109
5
83
Ukraine
Hi all. I have the following configuration. 3 nodes. Each node has 4 6TB disks and 1 1TB nvme disk. A total of 5 OSDs per node. I decided to launch the ceph caching functionality by following these steps:

ceph osd pool create data 128 128
ceph osd pool create cache 128 128
ceph osd tier add data cache
ceph osd tier cache-mode cache writeback
ceph osd tier set-overlay data cache
ceph osd pool set cache target_max_bytes 100G
ceph osd pool set cache hit_set_type explicit_hash
ceph osd pool set cache hit_set_count 8
ceph osd pool set cache hit_set_period 3600
ceph osd pool set cache min_read_recency_for_promote 1
ceph osd pool set cache min_write_recency_for_promote 1
ceph osd pool application enable data rbd
pvesm add rbd ceph --monhost "10.50.250.1 10.50.250.2 10.50.250.3" --pool data --content images

Will this have any effect? Or are all these steps useless? Possibly a misconfiguration?
 
Thanks for the answer. Then some questions arise.

1. Can I partition an existing nvme into an equal number of partitions, for example, for each node 4 disks of 250 GB each. and specify these partitions when creating osd to store rocks-db and wal.

2. Is it possible to specify one nvme partition for both types of metadata, or is there a separate section for each type of metadata?

3. How much metadata space is needed with the 1 osd in 6 TB option?

4. Also, if I understand correctly - having lost one of the nvme - in fact, I will fully lose one node on which it failed. Until I replace it and recreate osd?

Thank you
 
Last edited:
Thanks for the answer. Then some questions arise.

1. Can I partition an existing nvme into an equal number of partitions, for example, for each node 4 disks of 250 GB each. and specify these partitions when creating osd to store rocks-db and wal.
Yes by defining DB/WAL-Size in OSD-Creation proxmox automatically creates a partition by this size. I would manually specifiy the size, if your not manually defining it, proxmox takes a % of your disk, which might end you up not getting all osds backed by db/wal device because your running out of space.

2. Is it possible to specify one nvme partition for both types of metadata, or is there a separate section for each type of metadata?
Yes if you have DB-Disk, WAL gets automatically stored on DB-Disk unless you tell it not to do this.

The WAL is placed with the DB, if not specified separately. https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pve_ceph_osds

3. How much metadata space is needed with the 1 osd in 6 TB option?
4% of cross capacity: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing --> 240 GB

4. Also, if I understand correctly - having lost one of the nvme - in fact, I will fully lose one node on which it failed. Until I replace it and recreate osd?

Yes, all osds that use a db/wal device are down if you loose the db-wal device. If you have one db/wal device on a host and all osds are backed by this device, you will loose all nvmes -> means complete node down. If you loose another disk on a different host, you might have unavaiable data because MIN-Size not reached on a 3 node system.
 
And what about the implementation of dm-cache, which, as I understand it, came to replace tier-cache. Does it make sense to use it?

Suppose that I have 4 HDDs of 4 TB with the names /dev/sda, /dev/sdb, /dev/sdc and /dev/sdd, 1 SSD of 2 TB with the name /dev/sde and another SSD of 240 GB with name /dev/sdf.

To create dm-cache, it is recommended to use 1% of the HDD size for metadata and 10% of the HDD size for the cache

Create physical volumes on all drives using the command:
pvcreate /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf

Create a volume group named ceph-vg on 4 HDDs using the command:
vgcreate ceph-vg /dev/sda /dev/sdb /dev/sdc /dev/sdd

Create a volume group named cache-vg on 2 SSDs using the command:
vgcreate cache-vg /dev/sde /dev/sdf

Create a logical volume named data-lv on 4 HDDs with a size of 16 TB (4 TB per drive) using the command:
lvcreate -n data-lv -L 16T ceph-vg

Create a logical volume named cache-lv on 1 SSD with a size of 2 TB using the command:
lvcreate -n cache-lv -L 2T cache-vg

Create a logical volume named meta-lv on a second SSD with a size of 240 GB using the command:
lvcreate -n meta-lv -L 240G cache-vg

Create a dm-cache hybrid volume named ceph-cache with writeback mode and smq caching policy (by default) using the command:
dmsetup create ceph-cache --table "0 $(blockdev --getsize /dev/ceph-vg/data -lv) cache /dev/cache-vg/meta-lv /dev/cache-vg/cache-lv /dev/ceph-vg/data-lv 512 1 writeback default 0"

Create OSD Ceph on the dm-cache hybrid volume using the command:
pveceph createosd /dev/mapper/ceph-cache

Will it work? And how effective?
 
Last edited:
Will it work? And how effective?
It may work but then you would have only one OSD in the host. And with all the extra layers debugging or replacing a disk would be a nightmare.

Use the 240G SSD for the RocksDBs and WALs of the HDD-OSDs and the 2TB SSD as a single OSD. This way you can have HDD pools and SSD pools which will work for e.g. CephFS or RBD.
 
  • Like
Reactions: plastilin
Hi @plastilin

I played with enabling LVM cache on the OSD logical volume, and it worked but I did not do any performance comparisons, I don't believe I noticed a huge difference so eventually I decided it's not worth it to have the extra fault domain. Not sure how it's different from dm-cache
But below are my notes in case you want to try (/dev/sdc was the SATA SSD):

Code:
pvcreate /dev/sdc

VG=ceph-6e0a76c7-9c67-4f4b-bd63-523dd2ce3e9e
LV=osd-block-67d5264a-0933-4946-bdac-fa9f4a23326d
vgextend $VG /dev/sdc 
lvcreate -L 130M -n cache_meta $VG /dev/sdc
lvcreate -L 130G -n cache $VG /dev/sdc

lvconvert --type cache-pool --cachemode writeback --poolmetadata $VG/cache_meta $VG/cache

lvconvert --type cache --cachepool $VG/cache $VG/$LV

The benefit to me was that it was possible to remove the cache if need:

Code:
lvconvert --uncache $VG/$LV
 
Hi @plastilin

I played with enabling LVM cache on the OSD logical volume, and it worked but I did not do any performance comparisons, I don't believe I noticed a huge difference so eventually I decided it's not worth it to have the extra fault domain. Not sure how it's different from dm-cache
But below are my notes in case you want to try (/dev/sdc was the SATA SSD):

Code:
pvcreate /dev/sdc

VG=ceph-6e0a76c7-9c67-4f4b-bd63-523dd2ce3e9e
LV=osd-block-67d5264a-0933-4946-bdac-fa9f4a23326d
vgextend $VG /dev/sdc
lvcreate -L 130M -n cache_meta $VG /dev/sdc
lvcreate -L 130G -n cache $VG /dev/sdc

lvconvert --type cache-pool --cachemode writeback --poolmetadata $VG/cache_meta $VG/cache

lvconvert --type cache --cachepool $VG/cache $VG/$LV

The benefit to me was that it was possible to remove the cache if need:

Code:
lvconvert --uncache $VG/$LV

Just saying redhat says its deprecated, yes its redhat, but this usually also counts for ceph upstream (red ceph development always goes upstream). https://access.redhat.com/documenta...2/html/release_notes/deprecated-functionality
 
Last edited:
ust saying redhat says its deprecated, yes its redhat, but this usually also counts for ceph upstream (red ceph development always goes upstream). https://access.redhat.com/documenta...2/html/release_notes/deprecated-functionality
here is the notice on the ceph documentation directly:
https://docs.ceph.com/en/latest/rados/operations/cache-tiering/

even if it wasnt deprecated, this may give you pause:
"A WORD OF CAUTION
Cache tiering will degrade performance for most workloads. Users should use extreme caution before using this feature."