Ceph OSD block.db on NVMe / Sizing recommendations and usage

herzkerl

Member
Mar 18, 2021
96
16
13
Dear community,

the HDD pool on our 3 node Ceph cluster was quite slow, so we recreated the OSDs with block.db on NVMe drives (Enterprise, Samsung PM983/PM9A3).

The sizing recommendations in the Ceph documentation recommend 4% to 6% of 'block' size:
It is generally recommended that the size of block.db be somewhere between 1% and 4% of the size of block. For RGW workloads, it is recommended that the block.db be at least 4% of the block size, because RGW makes heavy use of block.db to store metadata (in particular, omap keys). For example, if the block size is 1TB, then block.db should have a size of at least 40GB. For RBD workloads, however, block.db usually needs no more than 1% to 2% of the blocksize.

block.db is either 3.43% or around 6% (depending on the time of creation we used a different calculation/assignment of NVMe space to HDD).

We've been running that setup for a few months, but still our monitoring only shows 1.5% to 3.1% usage of the block.db 'device' (i.e. NVMe LVM):

Bildschirmfoto 2023-09-11 um 17.05.55.png

For the HDDs about 50% are occupied:

Bildschirmfoto 2023-09-11 um 17.07.40.png

Is that to be expected? I'd think that DB allocation should be a lot higher across all NVMe devices.

Thanks in advance!
 
Yes, that is expected in a pure RBD / CephFS cluster. With S3 (rados-gateway) the usage of the RocksDB is way higher.
The 4% recommendation of the Ceph dokumentation is very old, from days where this was new and people had no experience with it.
I usually recommend 70GB for the RocksDB per OSD for an RBD / CephFS cluster and around 300GB if S3 is involved.
Newer Ceph versions are also more efficient with handling the different levels of the RocksDB.
 
Last edited:
  • Like
Reactions: herzkerl and jsterr
Yes, that is expected in a pure RBD / CephFS cluster. With S3 (rados-gateway) the usage of the RocksDB is way higher.
We've been using RadosGW for almost two months, as well. The default RGW pool is around 70 TB in size. (replica=3)

The 4% recommendation of the Ceph dokumentation is very old, from days where this was new and people had no experience with it.
Meaning, 4-6% of e.g. 16.37 TiB or 12.73 TiB would be way too much in both cases (RBD/RGW workloads)?

I usually recommend 70GB for the RocksDB per OSD for an RBD / CephFS cluster and around 300GB if S3 is involved.
Newer Ceph versions are also more efficient with handling the different levels of the RocksDB.
Without consideration of the size of the block device/HDD?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!