Calculating Journal Size - ceph

ssaman

Active Member
Oct 28, 2015
38
2
28
Hello Support,

we wonder how we could calculate the journal size.

in the documentation we found this:
"osd journal size = 2 * expected throughput * filestore max sync interval"

we have a server with 16 Slots.
Currently we have a 1TB SSD and 6 HDDs.
2 of the HDDs are used for the System.

At the beginning we thought we wanted to use 1 TB SSD for all the remaining HDDs.
But we found a bottleneck.

we have a SSD with:
Sequential Read: 520 MB/s
Sequential Write: 485 MB/s

and we want to use multiple HDDs with
Seqential Read/Write of 250MB/s

if we use 2 of the HDDs we are already on our limit of the SSD.
What do you recommed for 1 SSD. How many we should use.

The second thing we want to ask is: is it good to have a high "filestore max snyc interval" or is it better when it is low?
 
What PVE version are you using? With PVE 5, we support ceph luminous and it uses per default bluestore as OSD backend. This makes a journal less relevant, as it uses a DB + WAL. This removes the double write as with filestore.

and we want to use multiple HDDs with
Seqential Read/Write of 250MB/s
These are very fast HDDs, are you using a RAID controller (usually creates troubles)? What results does fio give?

if we use 2 of the HDDs we are already on our limit of the SSD.
What do you recommed for 1 SSD. How many we should use.
If the disks are really that fast, then you will probably not need a separate DB+WAL device and could use the SSD for a separate pool or the OSD, as the ceph monitors (mon DB) benefit from faster access times.

https://pve.proxmox.com/pve-docs/chapter-pveceph.html
https://forum.proxmox.com/threads/ceph-raw-usage-grows-by-itself.38395/#post-189842

EDIT:
Usually there is no reason to use filestore, other then a conversion from a existing filestore OSD to bluestore.
 
What PVE version are you using?
We are using PVE 5 with luminous.

These are very fast HDDs, are you using a RAID controller (usually creates troubles)? What results does fio give?
Datasheet of our HDDs: https://www.hgst.com/sites/default/files/resources/Ultrastar-He10-DS.pdf
Yes we use a RAID-Crontroller:
Code:
01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)

What results does fio give?
I have never used fio bevor. I hope this is okay:
Code:
root@node1:/mnt/sda# fio  --rw=readwrite --name=test --size=1000M --direct=1 --bs=1024k
test: (g=0): rw=rw, bs=1M-1M/1M-1M/1M-1M, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
test: Laying out IO file(s) (1 file(s) / 1000MB)
Jobs: 1 (f=1): [M(1)] [100.0% done] [43008KB/66560KB/0KB /s] [42/65/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3476: Thu Feb 22 12:36:17 2018
  read : io=500736KB, bw=79977KB/s, iops=78, runt=  6261msec
    clat (msec): min=1, max=36, avg= 7.78, stdev= 8.64
     lat (msec): min=1, max=36, avg= 7.78, stdev= 8.64
    clat percentiles (usec):
     |  1.00th=[ 1304],  5.00th=[ 1352], 10.00th=[ 1384], 20.00th=[ 1416],
     | 30.00th=[ 1480], 40.00th=[ 3408], 50.00th=[ 4896], 60.00th=[ 5216],
     | 70.00th=[ 5280], 80.00th=[14784], 90.00th=[22912], 95.00th=[27776],
     | 99.00th=[33024], 99.50th=[33536], 99.90th=[36608], 99.95th=[36608],
     | 99.99th=[36608]
  write: io=523264KB, bw=83575KB/s, iops=81, runt=  6261msec
    clat (msec): min=1, max=420, avg= 4.77, stdev=19.22
     lat (msec): min=1, max=420, avg= 4.80, stdev=19.22
    clat percentiles (usec):
     |  1.00th=[ 1288],  5.00th=[ 1336], 10.00th=[ 1368], 20.00th=[ 1464],
     | 30.00th=[ 1528], 40.00th=[ 1656], 50.00th=[ 2160], 60.00th=[ 2544],
     | 70.00th=[ 3248], 80.00th=[ 4960], 90.00th=[ 8512], 95.00th=[12608],
     | 99.00th=[37632], 99.50th=[48384], 99.90th=[419840], 99.95th=[419840],
     | 99.99th=[419840]
    lat (msec) : 2=39.30%, 4=19.90%, 10=24.10%, 20=9.90%, 50=6.60%
    lat (msec) : 100=0.10%, 500=0.10%
  cpu          : usr=0.24%, sys=1.53%, ctx=1005, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=489/w=511/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=500736KB, aggrb=79977KB/s, minb=79977KB/s, maxb=79977KB/s, mint=6261msec, maxt=6261msec
  WRITE: io=523264KB, aggrb=83575KB/s, minb=83575KB/s, maxb=83575KB/s, mint=6261msec, maxt=6261msec

Disk stats (read/write):
  sda: ios=1920/2164, merge=0/3, ticks=12068/10348, in_queue=22500, util=96.75%

We bought a new SSD yesterday: http://www.samsung.com/semiconductor/ssd/enterprise-ssd/MZPLL1T6HEHP/
We now want to use this one for 7 of the mentioned HDDs.

Do we have to create 7 Partitions on the SSD or is it just fine for ceph when we tell him to use each HDD to one SSD?

We would set "osd jounal size" nearby 200GB.


If we need more space, we will buy a second of this SSD and add 7 again.
Would you recommend this setting?
 
We are using PVE 5 with luminous.
So the default OSD backend will be Bluestore, then there is no Journal anymore. Bluestore uses a DB and WAL (write ahead log). Here you need to test, if you will benefit at all from a separate DB+WAL device. You also need to keep in mind, that if the SSD dies, all OSD that had their DB+WAL on it, are dead too. Also it is not possible to swap the OSDs into different servers because of this.

Yes we use a RAID-Crontroller:
Set your controller to IT-mode, as RAID0 or JBOD use the controller features and often lead to performance degradation or starvation. Both resulting in blocked IO on Ceph, all VMs/CTs cluster wide will notice this.

READ: io=500736KB, aggrb=79977KB/s, minb=79977KB/s, maxb=79977KB/s, mint=6261msec, maxt=6261msec WRITE: io=523264KB, aggrb=83575KB/s, minb=83575KB/s, maxb=83575KB/s, mint=6261msec, maxt=6261msec
This sounds more like it, ~80 MB/s (aggrb = aggregated bandwidth).
Code:
fio --ioengine=libaio --filename=/dev/sdX --direct=1 --sync=1 --rw=write --bs=4M --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=fio-disk
Typically, I would use something like this to test the disks and variate 'bs' and 'rw', from 4k to 4M. Still, your test is fine too.

We bought a new SSD yesterday: http://www.samsung.com/semiconductor/ssd/enterprise-ssd/MZPLL1T6HEHP/
We now want to use this one for 7 of the mentioned HDDs.
I recommend, that you use those as a fast ceph pool and not to add DB+WAL onto it, it will be for sure under utilized if you do.

Do we have to create 7 Partitions on the SSD or is it just fine for ceph when we tell him to use each HDD to one SSD?

We would set "osd jounal size" nearby 200GB.

If we need more space, we will buy a second of this SSD and add 7 again.
Would you recommend this setting?
You don't have to, it will not give a massive performance boost, that said, it will greatly depend on your usage pattern and load of the system. In generell the NVMe will be bored and could be better used on its own. On the bottom line, you need to test this on your setup.

If you still decide to use it as DB+WAL device, then divide your NVMe through all the OSDs that you want/will hook up to it and create even sized partitions (keep total speed of the NVMe in mind). Use each partition for the DB of a OSD, the WAL will be placed on the fastest disk of a OSD (DB=NVMe [+WAL], Data=HDD). The DB will continue to grow on the NVMe partition, the DB spills onto the HDD once the partition on the NVMe is full.
 
If you like the 'filestore design with journal' performance, you need to setup bcache as mentioned somewhere on the forum. Moving DB + WAL to ssd didn't improve write speed with noticeable factor.
 
@Jarek, bcache is a totally different hammer and is not really known to improve ceph performance or simplify the already complex setup. It is also to note that data safety can not be guaranteed with bcache. To spare headache and lessen complexity, I advise against the use of bcache.

It is still possible to use filestore in ceph, but since luminous, bluestore is the standard backing store and filestore is not in the focus of development anymore. You have different options to increase RAM for caching or change bluestore behavior.
http://docs.ceph.com/docs/luminous/rados/configuration/bluestore-config-ref/
 
So the default OSD backend will be Bluestore, then there is no Journal anymore.Bluestore uses a DB and WAL (write ahead log).
How do we have to configurate this than.
When i am on my proxmox gui under ceph > OSD.
Why can I choose this even though it is not needed?
d2cgQ7


We really do not know how we should configurate this new bluestore setting.

Set your controller to IT-mode, as RAID0 or JBOD use the controller features and often lead to performance degradation or starvation.
We have a onboard raidcontroller and we can not set the mode to "IT" because we use 2 ssd as a Raid1 for the OS. The other disk are set to JBOD i hope this will not interrupt our performance.

Typically, I would use something like this to test the disks and variate 'bs' and 'rw', from 4k to 4M. Still, your test is fine too.
make antoher Test with your recommended attitude:
Code:
Run status group 0 (all jobs):
  WRITE: io=120464MB, aggrb=205585KB/s, minb=205585KB/s, maxb=205585KB/s, mint=600018msec, maxt=600018msec

EDIT: If we don't need journal anymore. Do we have to calculate journal size?
 
Last edited:
How do we have to configurate this than.
When i am on my proxmox gui under ceph > OSD.
Why can I choose this even though it is not needed?
d2cgQ7


We really do not know how we should configurate this new bluestore setting.
Bluestore is the default backend, when creating a OSD. Sadly the image link is broken, so I don't know where you can choose what. But I guess, you mean that you can set a seperate WAL/DB device. You don't need to, if you leave it blank then it will set the DB/WAL on the same disk.

We have a onboard raidcontroller and we can not set the mode to "IT" because we use 2 ssd as a Raid1 for the OS. The other disk are set to JBOD i hope this will not interrupt our performance.
Some controllers do interfere other don't, that has to be tested.

make antoher Test with your recommended attitude:
What was the fio command line and which disk did you test?
 
Bluestore is the default backend, when creating a OSD. Sadly the image link is broken, so I don't know where you can choose what. But I guess, you mean that you can set a seperate WAL/DB device. You don't need to, if you leave it blank then it will set the DB/WAL on the same disk.
Thank you that was what im asking for.

What else I ask myself is do we have to calculate or change the journal size if we use bluestore or is it independently of "osd jounal size" in the ceph.cfg

What was the fio command line and which disk did you test?
Code:
fio --ioengine=libaio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4M --numjobs=1 --iodepth=1 --runtime=600 --time_based --group_reporting --name=fio-disk
it was our HDD.
 
The osd journal size is for filestore. The DB expands as long as there is space and the WAL is usually 512MB. But as everything is on the same disk, the OSD takes care of it.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!