CEPH Performance tuning with small block size

DynFi User

Renowned Member
Apr 18, 2016
155
19
83
50
dynfi.com
Hello,

We are looking for information on how to properly tune CEPH or any other part of the system in order to get the best possible performance on a 4 node CEPH cluster equipped with NVMe disks. We have very poor perfs when using small block size, we have crafted a script in order to illustrate perfs ∆ between small block write and large one:

Code:
Testing different block sizes with a file size of 1024 MB.
Block size : Write Rate
        64 :     7 MB/s
       128 :   164 MB/s
       256 :   307 MB/s
       512 :   539 MB/s
      1024 :   882 MB/s
      2048 :     4 GB/s
      4096 :     9 GB/s
      8192 :     1 GB/s
     16384 :     2 GB/s
     32768 :     2 GB/s
     65536 :     2 GB/s
    131072 :     1 GB/s
    262144 :     1 GB/s
    524288 :     1 GB/s
   1048576 :     1 GB/s
Tests completed.

As you can see we have ∆ of more than 1000x if we write using 4096k blocksize compared to write with 64k

Any idea will be welcome.
 
As you can see we have ∆ of more than 1000x if we write using 4096k blocksize compared to write with 64k
That is normal if your NVMe namespaces are formatted as 4096. And usually their "internal" best performance. So if you have every layer (NVMe internal, namespace format, Ceph defaults to 4096) at the same block size, this already gives you the best performance.
So you already are optimized in the sweet spot.

Sure, in theory you could format the namespace to 64...(Don't know if the NVMe really accepts it) and force Ceph down to 64...then you have best throughput for 64, but only for 64.
Also keep in mind that a smaller block size means more waste in the overall memory space. But usually you don't want that and 4096 is really the sweet spot at the moment.

You cannot optimize for every block size and the best performance is always achieved by using the fastest of the "lowest device/layer" and configuring the ones built on top of it in the same way. namespace4k,filesystem4k(ceph/zfs),windows-vm(ntfs4k)

As you can see we have ∆ of more than 1000x if we write using 4096k blocksize compared to write with 64k
4096bytes=4k, 64bytes and not 64k
 
Last edited:
I have a 3 node and 3 nvme gen3 (each) - non PLP ( I know - this is pre-prod )
My plan is to eventually get PLPs and until I can just backup daily to local storage.

For the 3/2 RBD each nvme has one OSD.
Performance is liek 10-20MBps writes - Win10 AsyncIO=threads,cache=non,discard/iothread=on.
Do you recommend using more than one OSD per nvme?
 
Last edited:
I have a 3 node and 3 nvme gen3 (each) - non PLP ( I know - this is pre-prod )
My plan is to eventually get PLPs and until I can just backup daily to local storage.

For the 3/2 RBD each nvme has one OSD.
Performance is liek 10-20MBps writes - Win10 AsyncIO=threads,cache=non,discard/iothread=on.
Do you recommend using more than one OSD per nvme?
more than one OSE per nvme will not help. non PLP are really like 500iops 4k write vs 20000iops for plp drive.

At minimum, use cache=writeback, it should help to avoid small write when possible (merging small adjacent writes to big write)