Ceph tier cache question

plastilin · May 24, 2023

Hi all. I have the following configuration. 3 nodes. Each node has 4 6TB disks and 1 1TB nvme disk. A total of 5 OSDs per node. I decided to launch the ceph caching functionality by following these steps:

ceph osd pool create data 128 128
ceph osd pool create cache 128 128
ceph osd tier add data cache
ceph osd tier cache-mode cache writeback
ceph osd tier set-overlay data cache
ceph osd pool set cache target_max_bytes 100G
ceph osd pool set cache hit_set_type explicit_hash
ceph osd pool set cache hit_set_count 8
ceph osd pool set cache hit_set_period 3600
ceph osd pool set cache min_read_recency_for_promote 1
ceph osd pool set cache min_write_recency_for_promote 1
ceph osd pool application enable data rbd
pvesm add rbd ceph --monhost "10.50.250.1 10.50.250.2 10.50.250.3" --pool data --content images

Will this have any effect? Or are all these steps useless? Possibly a misconfiguration?

gurubert · May 24, 2023

The cache tier feature is deprecated.

You would achieve better overall performance if you use the NVMe for the RocksDB and WAL of the HDD-OSDs.

plastilin · May 24, 2023

Thanks for the answer. Then some questions arise.

1. Can I partition an existing nvme into an equal number of partitions, for example, for each node 4 disks of 250 GB each. and specify these partitions when creating osd to store rocks-db and wal.

2. Is it possible to specify one nvme partition for both types of metadata, or is there a separate section for each type of metadata?

3. How much metadata space is needed with the 1 osd in 6 TB option?

4. Also, if I understand correctly - having lost one of the nvme - in fact, I will fully lose one node on which it failed. Until I replace it and recreate osd?

Thank you

jsterr · May 24, 2023

plastilin said:
Thanks for the answer. Then some questions arise.

1. Can I partition an existing nvme into an equal number of partitions, for example, for each node 4 disks of 250 GB each. and specify these partitions when creating osd to store rocks-db and wal.

Yes by defining DB/WAL-Size in OSD-Creation proxmox automatically creates a partition by this size. I would manually specifiy the size, if your not manually defining it, proxmox takes a % of your disk, which might end you up not getting all osds backed by db/wal device because your running out of space.

plastilin said:
2. Is it possible to specify one nvme partition for both types of metadata, or is there a separate section for each type of metadata?

Yes if you have DB-Disk, WAL gets automatically stored on DB-Disk unless you tell it not to do this.

The WAL is placed with the DB, if not specified separately. https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pve_ceph_osds

plastilin said:
3. How much metadata space is needed with the 1 osd in 6 TB option?

4% of cross capacity: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing --> 240 GB

plastilin said:
4. Also, if I understand correctly - having lost one of the nvme - in fact, I will fully lose one node on which it failed. Until I replace it and recreate osd?

Yes, all osds that use a db/wal device are down if you loose the db-wal device. If you have one db/wal device on a host and all osds are backed by this device, you will loose all nvmes -> means complete node down. If you loose another disk on a different host, you might have unavaiable data because MIN-Size not reached on a 3 node system.

plastilin · May 31, 2023

And what about the implementation of dm-cache, which, as I understand it, came to replace tier-cache. Does it make sense to use it?

Suppose that I have 4 HDDs of 4 TB with the names /dev/sda, /dev/sdb, /dev/sdc and /dev/sdd, 1 SSD of 2 TB with the name /dev/sde and another SSD of 240 GB with name /dev/sdf.

To create dm-cache, it is recommended to use 1% of the HDD size for metadata and 10% of the HDD size for the cache

Create physical volumes on all drives using the command:
pvcreate /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf

Create a volume group named ceph-vg on 4 HDDs using the command:
vgcreate ceph-vg /dev/sda /dev/sdb /dev/sdc /dev/sdd

Create a volume group named cache-vg on 2 SSDs using the command:
vgcreate cache-vg /dev/sde /dev/sdf

Create a logical volume named data-lv on 4 HDDs with a size of 16 TB (4 TB per drive) using the command:
lvcreate -n data-lv -L 16T ceph-vg

Create a logical volume named cache-lv on 1 SSD with a size of 2 TB using the command:
lvcreate -n cache-lv -L 2T cache-vg

Create a logical volume named meta-lv on a second SSD with a size of 240 GB using the command:
lvcreate -n meta-lv -L 240G cache-vg

Create a dm-cache hybrid volume named ceph-cache with writeback mode and smq caching policy (by default) using the command:

dmsetup create ceph-cache --table "0 $(blockdev --getsize /dev/ceph-vg/data -lv) cache /dev/cache-vg/meta-lv /dev/cache-vg/cache-lv /dev/ceph-vg/data-lv 512 1 writeback default 0"

Create OSD Ceph on the dm-cache hybrid volume using the command:
pveceph createosd /dev/mapper/ceph-cache

Will it work? And how effective?

gurubert · May 31, 2023

plastilin said:
Will it work? And how effective?

It may work but then you would have only one OSD in the host. And with all the extra layers debugging or replacing a disk would be a nightmare.

Use the 240G SSD for the RocksDBs and WALs of the HDD-OSDs and the 2TB SSD as a single OSD. This way you can have HDD pools and SSD pools which will work for e.g. CephFS or RBD.

mfed · May 31, 2023

Hi @plastilin

I played with enabling LVM cache on the OSD logical volume, and it worked but I did not do any performance comparisons, I don't believe I noticed a huge difference so eventually I decided it's not worth it to have the extra fault domain. Not sure how it's different from dm-cache
But below are my notes in case you want to try (/dev/sdc was the SATA SSD):

Code:

pvcreate /dev/sdc

VG=ceph-6e0a76c7-9c67-4f4b-bd63-523dd2ce3e9e
LV=osd-block-67d5264a-0933-4946-bdac-fa9f4a23326d
vgextend $VG /dev/sdc 
lvcreate -L 130M -n cache_meta $VG /dev/sdc
lvcreate -L 130G -n cache $VG /dev/sdc

lvconvert --type cache-pool --cachemode writeback --poolmetadata $VG/cache_meta $VG/cache

lvconvert --type cache --cachepool $VG/cache $VG/$LV

The benefit to me was that it was possible to remove the cache if need:

Code:

lvconvert --uncache $VG/$LV

jsterr · May 31, 2023

mfed said:
Hi @plastilin

I played with enabling LVM cache on the OSD logical volume, and it worked but I did not do any performance comparisons, I don't believe I noticed a huge difference so eventually I decided it's not worth it to have the extra fault domain. Not sure how it's different from dm-cache
But below are my notes in case you want to try (/dev/sdc was the SATA SSD):

Code:

pvcreate /dev/sdc VG=ceph-6e0a76c7-9c67-4f4b-bd63-523dd2ce3e9e LV=osd-block-67d5264a-0933-4946-bdac-fa9f4a23326d vgextend $VG /dev/sdc lvcreate -L 130M -n cache_meta $VG /dev/sdc lvcreate -L 130G -n cache $VG /dev/sdc lvconvert --type cache-pool --cachemode writeback --poolmetadata $VG/cache_meta $VG/cache lvconvert --type cache --cachepool $VG/cache $VG/$LV

The benefit to me was that it was possible to remove the cache if need:

Code:

lvconvert --uncache $VG/$LV

Just saying redhat says its deprecated, yes its redhat, but this usually also counts for ceph upstream (red ceph development always goes upstream). https://access.redhat.com/documenta...2/html/release_notes/deprecated-functionality

alexskysilk · May 31, 2023

jsterr said:
ust saying redhat says its deprecated, yes its redhat, but this usually also counts for ceph upstream (red ceph development always goes upstream). https://access.redhat.com/documenta...2/html/release_notes/deprecated-functionality

here is the notice on the ceph documentation directly:
https://docs.ceph.com/en/latest/rados/operations/cache-tiering/

even if it wasnt deprecated, this may give you pause:
"A WORD OF CAUTION
Cache tiering will degrade performance for most workloads. Users should use extreme caution before using this feature."

Ceph tier cache question

plastilin

Renowned Member

gurubert

Distinguished Member

plastilin

Renowned Member

jsterr

Famous Member

plastilin

Renowned Member

gurubert

Distinguished Member

mfed

Well-Known Member

jsterr

Famous Member

alexskysilk

Distinguished Member

We value your privacy