[SOLVED] Erasure Code Pool WAL and RocksDB usage

networkguy3

New Member
Aug 8, 2025
14
4
3
I have a k=4 m=2 erasure code pool on a single host with 6 - 6 tb SAS drives in a Dell R730. I have a NVME drive (Intel DC P4510) split up into DB and WAL for each OSD. I am using this pool with CephFS. Using iostat, I am not seeing any reads going to the NVME drive even when running fio benchmarks. I am seeing writes to the NVME, just no reads. Is this the expected behavior with erasure code pools?

lvm-volume lvm list shows for each OSD:
[block] /dev/ceph-724f1e31-6a08-4643-a54d-f37d12766ff3/osd-block-9376a70d-ccf4-439a-b38e-2c9ed14ac771

block device /dev/ceph-724f1e31-6a08-4643-a54d-f37d12766ff3/osd-block-9376a70d-ccf4-439a-b38e-2c9ed14ac771
block uuid FrLATf-TUed-DWUL-C4CC-kscp-2jdE-HAIAoK
cephx lockbox secret
cluster fsid 66adcd3d-b086-4186-a577-e628abd1e899
cluster name ceph
crush device class hdd
db device /dev/cache/osd9-db
db uuid udiu7E-9Tp0-Mb91-rQ4A-fBY9-O0Mk-fHf7u1
encrypted 0
osd fsid 9376a70d-ccf4-439a-b38e-2c9ed14ac771
osd id 9
osdspec affinity
type block
vdo 0
wal device /dev/cache/osd9-wal
wal uuid 9uzxhC-7DJD-rSdx-sVFf-NIPN-DnYF-IkXzY1
with tpm 0
devices /dev/sdd
 
This is expected behavior. The reason no reads appear on the NVMe is due to how BlueStore's WAL and RocksDB DB devices work:
  1. RocksDB WAL (Write-Ahead Log) is write-only during normal operation. It's a sequential journal for crash consistency. It is only read during OSD startup/recovery to replay uncommitted transactions.
  2. RocksDB DB stores BlueStore metadata (object mapping, allocation bitmaps, etc.). RocksDB's read path checks in order: memtable (in-memory write buffer) -> immutable memtables (awaiting flush) -> block cache (LRU cache of uncompressed SST blocks in RAM) -> SST files on disk. Disk reads to the DB device occur whenever the block cache misses. In this specific benchmark scenario, the user is likely seeing zero NVMe reads because fio is writing sequentially to a small number of objects, so the metadata working set fits entirely in memtables and block cache. With a larger dataset, more objects, or random I/O across many objects, the DB device will see read I/O as block cache misses force RocksDB to read SST files from the NVMe.
  3. Erasure coding does not change this behavior. EC affects how data chunks are distributed across OSDs, but the WAL/DB I/O pattern per OSD remains the same as with replicated pools.
  4. Data reads go directly to the main block device (`/dev/sdX`), bypassing WAL and DB entirely. The NVMe only handles metadata writes (DB) and journal writes (WAL).

Here's the I/O pattern summary:

NVMe RoleWritesReads
WALYes (all writes journaled)Only during OSD recovery/startup
DB (RocksDB)Yes (metadata updates)Yes, on block cache miss (depends on working set size vs cache size)
Main block (SAS)Yes (data after WAL flush)Yes (all data reads)
 
Just to be curious: This is a test setup?
No, it was intended to be long-term storage for seldom-used data which I am backing up daily with PBS. I understand that the pool is completely dependent on a single host. I have also found that the small-block write performance (4k) is pretty terrible. I am planning to rebuild the pool with 2 more drives and make it a k=2 m=2 pool. to try and reduce the overheads for small-black writes. The read performance is significantly better, but the OSD dameons do hit the CPU pretty hard.
 
Then I would suggest to use ZFS locally.

Ceph is a distributed storage system that works best with five nodes or more.
This is shared storage, so would you suggest mounting the storage in a VM and sharing it with Samba or sharing it with SMB from ZFS itself?