Ceph OSD block.db on NVMe / Sizing recommendations and usage

herzkerl · Sep 11, 2023

Dear community,

the HDD pool on our 3 node Ceph cluster was quite slow, so we recreated the OSDs with block.db on NVMe drives (Enterprise, Samsung PM983/PM9A3).

The sizing recommendations in the Ceph documentation recommend 4% to 6% of 'block' size:

It is generally recommended that the size of block.db be somewhere between 1% and 4% of the size of block. For RGW workloads, it is recommended that the block.db be at least 4% of the block size, because RGW makes heavy use of block.db to store metadata (in particular, omap keys). For example, if the block size is 1TB, then block.db should have a size of at least 40GB. For RBD workloads, however, block.db usually needs no more than 1% to 2% of the blocksize.

block.db is either 3.43% or around 6% (depending on the time of creation we used a different calculation/assignment of NVMe space to HDD).

We've been running that setup for a few months, but still our monitoring only shows 1.5% to 3.1% usage of the block.db 'device' (i.e. NVMe LVM):

Bildschirmfoto 2023-09-11 um 17.05.55.png

For the HDDs about 50% are occupied:

Bildschirmfoto 2023-09-11 um 17.07.40.png

Is that to be expected? I'd think that DB allocation should be a lot higher across all NVMe devices.

Thanks in advance!

gurubert · Sep 12, 2023

Yes, that is expected in a pure RBD / CephFS cluster. With S3 (rados-gateway) the usage of the RocksDB is way higher.
The 4% recommendation of the Ceph dokumentation is very old, from days where this was new and people had no experience with it.
I usually recommend 70GB for the RocksDB per OSD for an RBD / CephFS cluster and around 300GB if S3 is involved.
Newer Ceph versions are also more efficient with handling the different levels of the RocksDB.

herzkerl · Sep 12, 2023

gurubert said:
Yes, that is expected in a pure RBD / CephFS cluster. With S3 (rados-gateway) the usage of the RocksDB is way higher.

We've been using RadosGW for almost two months, as well. The default RGW pool is around 70 TB in size. (replica=3)

gurubert said:
The 4% recommendation of the Ceph dokumentation is very old, from days where this was new and people had no experience with it.

Meaning, 4-6% of e.g. 16.37 TiB or 12.73 TiB would be way too much in both cases (RBD/RGW workloads)?

gurubert said:
I usually recommend 70GB for the RocksDB per OSD for an RBD / CephFS cluster and around 300GB if S3 is involved.
Newer Ceph versions are also more efficient with handling the different levels of the RocksDB.

Without consideration of the size of the block device/HDD?

gurubert · Sep 12, 2023

herzkerl said:
Without consideration of the size of the block device/HDD?

Yes.

Search

Search

Ceph OSD block.db on NVMe / Sizing recommendations and usage

herzkerl

Active Member

gurubert

Distinguished Member

herzkerl

Active Member

gurubert

Distinguished Member

We value your privacy