[SOLVED] Erasure Code Pool WAL and RocksDB usage

networkguy3 · Mar 6, 2026

I have a k=4 m=2 erasure code pool on a single host with 6 - 6 tb SAS drives in a Dell R730. I have a NVME drive (Intel DC P4510) split up into DB and WAL for each OSD. I am using this pool with CephFS. Using iostat, I am not seeing any reads going to the NVME drive even when running fio benchmarks. I am seeing writes to the NVME, just no reads. Is this the expected behavior with erasure code pools?

lvm-volume lvm list shows for each OSD:

[block] /dev/ceph-724f1e31-6a08-4643-a54d-f37d12766ff3/osd-block-9376a70d-ccf4-439a-b38e-2c9ed14ac771

block device /dev/ceph-724f1e31-6a08-4643-a54d-f37d12766ff3/osd-block-9376a70d-ccf4-439a-b38e-2c9ed14ac771
block uuid FrLATf-TUed-DWUL-C4CC-kscp-2jdE-HAIAoK
cephx lockbox secret
cluster fsid 66adcd3d-b086-4186-a577-e628abd1e899
cluster name ceph
crush device class hdd
db device /dev/cache/osd9-db
db uuid udiu7E-9Tp0-Mb91-rQ4A-fBY9-O0Mk-fHf7u1
encrypted 0
osd fsid 9376a70d-ccf4-439a-b38e-2c9ed14ac771
osd id 9
osdspec affinity
type block
vdo 0
wal device /dev/cache/osd9-wal
wal uuid 9uzxhC-7DJD-rSdx-sVFf-NIPN-DnYF-IkXzY1
with tpm 0
devices /dev/sdd

gurubert · Mar 6, 2026

networkguy3 said:
I have a k=4 m=2 erasure code pool on a single host with 6 - 6 tb SAS drives in a Dell R730.

Just to be curious: This is a test setup?

tchaikov · Mar 6, 2026

This is expected behavior. The reason no reads appear on the NVMe is due to how BlueStore's WAL and RocksDB DB devices work:

RocksDB WAL (Write-Ahead Log) is write-only during normal operation. It's a sequential journal for crash consistency. It is only read during OSD startup/recovery to replay uncommitted transactions.
RocksDB DB stores BlueStore metadata (object mapping, allocation bitmaps, etc.). RocksDB's read path checks in order: memtable (in-memory write buffer) -> immutable memtables (awaiting flush) -> block cache (LRU cache of uncompressed SST blocks in RAM) -> SST files on disk. Disk reads to the DB device occur whenever the block cache misses. In this specific benchmark scenario, the user is likely seeing zero NVMe reads because fio is writing sequentially to a small number of objects, so the metadata working set fits entirely in memtables and block cache. With a larger dataset, more objects, or random I/O across many objects, the DB device will see read I/O as block cache misses force RocksDB to read SST files from the NVMe.
Erasure coding does not change this behavior. EC affects how data chunks are distributed across OSDs, but the WAL/DB I/O pattern per OSD remains the same as with replicated pools.
Data reads go directly to the main block device (`/dev/sdX`), bypassing WAL and DB entirely. The NVMe only handles metadata writes (DB) and journal writes (WAL).

Here's the I/O pattern summary:

NVMe Role	Writes	Reads
WAL	Yes (all writes journaled)	Only during OSD recovery/startup
DB (RocksDB)	Yes (metadata updates)	Yes, on block cache miss (depends on working set size vs cache size)
Main block (SAS)	Yes (data after WAL flush)	Yes (all data reads)

networkguy3 · Mar 6, 2026

Thank you! Now I understand what I am seeing when looking at the NVMe device.

networkguy3 · Mar 6, 2026

gurubert said:
Just to be curious: This is a test setup?

No, it was intended to be long-term storage for seldom-used data which I am backing up daily with PBS. I understand that the pool is completely dependent on a single host. I have also found that the small-block write performance (4k) is pretty terrible. I am planning to rebuild the pool with 2 more drives and make it a k=2 m=2 pool. to try and reduce the overheads for small-black writes. The read performance is significantly better, but the OSD dameons do hit the CPU pretty hard.

gurubert · Mar 6, 2026

networkguy3 said:
it was intended to be long-term storage for seldom-used data

Then I would suggest to use ZFS locally.

Ceph is a distributed storage system that works best with five nodes or more.

networkguy3 · Mar 6, 2026

gurubert said:
Then I would suggest to use ZFS locally.

Ceph is a distributed storage system that works best with five nodes or more.

This is shared storage, so would you suggest mounting the storage in a VM and sharing it with Samba or sharing it with SMB from ZFS itself?

gurubert · Mar 7, 2026

But it will only be one Proxmox node?

And you need to mount the same filesystem in multiple VMs?

Then ZFS with NFS/SMB shares on Proxmox.

Search

Search

[SOLVED] Erasure Code Pool WAL and RocksDB usage

networkguy3

New Member

gurubert

Distinguished Member

tchaikov

New Member

networkguy3

New Member

networkguy3

New Member

gurubert

Distinguished Member

networkguy3

New Member

gurubert

Distinguished Member

We value your privacy