Ceph tier cache question

plastilin

Renowned Member
Oct 9, 2012
99
5
73
Ukraine
Hi all. I have the following configuration. 3 nodes. Each node has 4 6TB disks and 1 1TB nvme disk. A total of 5 OSDs per node. I decided to launch the ceph caching functionality by following these steps:

ceph osd pool create data 128 128
ceph osd pool create cache 128 128
ceph osd tier add data cache
ceph osd tier cache-mode cache writeback
ceph osd tier set-overlay data cache
ceph osd pool set cache target_max_bytes 100G
ceph osd pool set cache hit_set_type explicit_hash
ceph osd pool set cache hit_set_count 8
ceph osd pool set cache hit_set_period 3600
ceph osd pool set cache min_read_recency_for_promote 1
ceph osd pool set cache min_write_recency_for_promote 1
ceph osd pool application enable data rbd
pvesm add rbd ceph --monhost "10.50.250.1 10.50.250.2 10.50.250.3" --pool data --content images

Will this have any effect? Or are all these steps useless? Possibly a misconfiguration?
 
Thanks for the answer. Then some questions arise.

1. Can I partition an existing nvme into an equal number of partitions, for example, for each node 4 disks of 250 GB each. and specify these partitions when creating osd to store rocks-db and wal.

2. Is it possible to specify one nvme partition for both types of metadata, or is there a separate section for each type of metadata?

3. How much metadata space is needed with the 1 osd in 6 TB option?

4. Also, if I understand correctly - having lost one of the nvme - in fact, I will fully lose one node on which it failed. Until I replace it and recreate osd?

Thank you
 
Last edited:
Thanks for the answer. Then some questions arise.

1. Can I partition an existing nvme into an equal number of partitions, for example, for each node 4 disks of 250 GB each. and specify these partitions when creating osd to store rocks-db and wal.
Yes by defining DB/WAL-Size in OSD-Creation proxmox automatically creates a partition by this size. I would manually specifiy the size, if your not manually defining it, proxmox takes a % of your disk, which might end you up not getting all osds backed by db/wal device because your running out of space.

2. Is it possible to specify one nvme partition for both types of metadata, or is there a separate section for each type of metadata?
Yes if you have DB-Disk, WAL gets automatically stored on DB-Disk unless you tell it not to do this.

The WAL is placed with the DB, if not specified separately. https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pve_ceph_osds

3. How much metadata space is needed with the 1 osd in 6 TB option?
4% of cross capacity: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing --> 240 GB

4. Also, if I understand correctly - having lost one of the nvme - in fact, I will fully lose one node on which it failed. Until I replace it and recreate osd?

Yes, all osds that use a db/wal device are down if you loose the db-wal device. If you have one db/wal device on a host and all osds are backed by this device, you will loose all nvmes -> means complete node down. If you loose another disk on a different host, you might have unavaiable data because MIN-Size not reached on a 3 node system.
 
And what about the implementation of dm-cache, which, as I understand it, came to replace tier-cache. Does it make sense to use it?

Suppose that I have 4 HDDs of 4 TB with the names /dev/sda, /dev/sdb, /dev/sdc and /dev/sdd, 1 SSD of 2 TB with the name /dev/sde and another SSD of 240 GB with name /dev/sdf.

To create dm-cache, it is recommended to use 1% of the HDD size for metadata and 10% of the HDD size for the cache

Create physical volumes on all drives using the command:
pvcreate /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf

Create a volume group named ceph-vg on 4 HDDs using the command:
vgcreate ceph-vg /dev/sda /dev/sdb /dev/sdc /dev/sdd

Create a volume group named cache-vg on 2 SSDs using the command:
vgcreate cache-vg /dev/sde /dev/sdf

Create a logical volume named data-lv on 4 HDDs with a size of 16 TB (4 TB per drive) using the command:
lvcreate -n data-lv -L 16T ceph-vg

Create a logical volume named cache-lv on 1 SSD with a size of 2 TB using the command:
lvcreate -n cache-lv -L 2T cache-vg

Create a logical volume named meta-lv on a second SSD with a size of 240 GB using the command:
lvcreate -n meta-lv -L 240G cache-vg

Create a dm-cache hybrid volume named ceph-cache with writeback mode and smq caching policy (by default) using the command:
dmsetup create ceph-cache --table "0 $(blockdev --getsize /dev/ceph-vg/data -lv) cache /dev/cache-vg/meta-lv /dev/cache-vg/cache-lv /dev/ceph-vg/data-lv 512 1 writeback default 0"

Create OSD Ceph on the dm-cache hybrid volume using the command:
pveceph createosd /dev/mapper/ceph-cache

Will it work? And how effective?
 
Last edited:
Will it work? And how effective?
It may work but then you would have only one OSD in the host. And with all the extra layers debugging or replacing a disk would be a nightmare.

Use the 240G SSD for the RocksDBs and WALs of the HDD-OSDs and the 2TB SSD as a single OSD. This way you can have HDD pools and SSD pools which will work for e.g. CephFS or RBD.
 
  • Like
Reactions: plastilin
Hi @plastilin

I played with enabling LVM cache on the OSD logical volume, and it worked but I did not do any performance comparisons, I don't believe I noticed a huge difference so eventually I decided it's not worth it to have the extra fault domain. Not sure how it's different from dm-cache
But below are my notes in case you want to try (/dev/sdc was the SATA SSD):

Code:
pvcreate /dev/sdc

VG=ceph-6e0a76c7-9c67-4f4b-bd63-523dd2ce3e9e
LV=osd-block-67d5264a-0933-4946-bdac-fa9f4a23326d
vgextend $VG /dev/sdc 
lvcreate -L 130M -n cache_meta $VG /dev/sdc
lvcreate -L 130G -n cache $VG /dev/sdc

lvconvert --type cache-pool --cachemode writeback --poolmetadata $VG/cache_meta $VG/cache

lvconvert --type cache --cachepool $VG/cache $VG/$LV

The benefit to me was that it was possible to remove the cache if need:

Code:
lvconvert --uncache $VG/$LV
 
Hi @plastilin

I played with enabling LVM cache on the OSD logical volume, and it worked but I did not do any performance comparisons, I don't believe I noticed a huge difference so eventually I decided it's not worth it to have the extra fault domain. Not sure how it's different from dm-cache
But below are my notes in case you want to try (/dev/sdc was the SATA SSD):

Code:
pvcreate /dev/sdc

VG=ceph-6e0a76c7-9c67-4f4b-bd63-523dd2ce3e9e
LV=osd-block-67d5264a-0933-4946-bdac-fa9f4a23326d
vgextend $VG /dev/sdc
lvcreate -L 130M -n cache_meta $VG /dev/sdc
lvcreate -L 130G -n cache $VG /dev/sdc

lvconvert --type cache-pool --cachemode writeback --poolmetadata $VG/cache_meta $VG/cache

lvconvert --type cache --cachepool $VG/cache $VG/$LV

The benefit to me was that it was possible to remove the cache if need:

Code:
lvconvert --uncache $VG/$LV

Just saying redhat says its deprecated, yes its redhat, but this usually also counts for ceph upstream (red ceph development always goes upstream). https://access.redhat.com/documenta...2/html/release_notes/deprecated-functionality
 
Last edited:
ust saying redhat says its deprecated, yes its redhat, but this usually also counts for ceph upstream (red ceph development always goes upstream). https://access.redhat.com/documenta...2/html/release_notes/deprecated-functionality
here is the notice on the ceph documentation directly:
https://docs.ceph.com/en/latest/rados/operations/cache-tiering/

even if it wasnt deprecated, this may give you pause:
"A WORD OF CAUTION
Cache tiering will degrade performance for most workloads. Users should use extreme caution before using this feature."
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!