Bluestore OSD on SSD

Jan 16, 2018
231
54
68
Hi,

i currently have currently 3 Host with 2 OSD's (bluestore) each on Samsung NVME SSD 960 Pro 1 TB,
these are M.2 NVME Moduls on a pcie adapter from Delock.

Read Speed (e.g. with rados bench) looks really great, but write speed isn't too good.

Are there any tuning parameters for the OSD's which might help with write speed? Or is it senseful to pack multiple OSD's on one SSD (same sources in the net say that one OSD's alone can't keep up with an SSD and recommend up to 4 OSD's for an SSD).

If such a configuration is worth a try, how can we do that with bluestore? I did not figure out in the net how to do this with bluestore, just with filestore ?

or does it make more sense to configure these OSD's as filestore instead, eventually with 2 or 4?

Sincerly,
Klaus
 
Read Speed (e.g. with rados bench) looks really great, but write speed isn't too good.
What are the results?

Are there any tuning parameters for the OSD's which might help with write speed? Or is it senseful to pack multiple OSD's on one SSD (same sources in the net say that one OSD's alone can't keep up with an SSD and recommend up to 4 OSD's for an SSD).
In the comments of Sebastian Han's blog, someone made a couple of fio tests with the Samsung 960 Pro. It is questionable if there are any noticeable performance improvements on adding multiple OSDs onto one device. The adapter adds also some latency on top of it.

https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/

EDIT: I removed the pastebin link, as it shows the wrong SSD for the test.
 
Last edited:
3 Hosts with 2 bluestore OSD's on 960 Pro each

root@sal-ha-pve01:~# ceph osd pool create scbench 100 100
pool 'scbench' created
root@sal-ha-pve01:~# rados bench -p scbench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sal-ha-pve01_4101
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 34 18 71.9959 72 0.303262 0.566175
2 16 65 49 97.9862 124 0.484239 0.564753
3 16 95 79 105.319 120 0.572162 0.554339
4 16 125 109 108.984 120 0.33221 0.533316
5 16 163 147 117.583 152 0.232914 0.509752
6 16 202 186 123.983 156 0.170877 0.500039
7 16 235 219 125.125 132 0.456458 0.498587
8 16 270 254 126.982 140 0.534674 0.494343
9 16 295 279 123.982 100 0.441698 0.498241
10 16 328 312 124.782 132 0.298143 0.494277
Total time run: 10.217814
Total writes made: 329
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 128.795
Stddev Bandwidth: 24.7153
Max bandwidth (MB/sec): 156
Min bandwidth (MB/sec): 72
Average IOPS: 32
Stddev IOPS: 6
Max IOPS: 39
Min IOPS: 18
Average Latency(s): 0.496611
Stddev Latency(s): 0.165845
Max latency(s): 0.906973
Min latency(s): 0.0631


root@sal-ha-pve01:~# rados bench -p scbench 10 rand
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 494 478 1911.62 1912 0.0800071 0.0318395
2 15 978 963 1925.57 1940 0.0113598 0.0317749
3 15 1467 1452 1935.61 1956 0.0413051 0.0320339
4 16 1971 1955 1954.62 2012 0.00345103 0.0317834
5 16 2494 2478 1982.05 2092 0.00222354 0.0313682
6 15 2995 2980 1986.32 2008 0.013674 0.0312503
7 15 3542 3527 2015.09 2188 0.0157908 0.0308885
8 16 4053 4037 2018.17 2040 0.00344412 0.0308786
9 15 4562 4547 2020.57 2040 0.0258464 0.0308605
10 16 5079 5063 2024.88 2064 0.0178876 0.0307639
Total time run: 10.054077
Total reads made: 5079
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 2020.67
Average IOPS: 505
Stddev IOPS: 20
Max IOPS: 547
Min IOPS: 478
Average Latency(s): 0.0309307
Max latency(s): 0.174553
Min latency(s): 0.0021353
 
Check out our benchmark thread and if possible do a fio test (command line in the pdf) on the raw nvme (be aware, it kills your OSD), then you can compare the results. https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

You maybe may gain some performance if you put multiple OSDs on a NVMe, but the Samsung 960 Pro or EVO are not enterprise class NVMe SSDs.
 
I have a server with 8 disks and a NVMe ssd (M.2 with Delock PCI adapter) which contains boot partition and bluestore wal and db partitions. I prepared the NVMe with several partitions

Bildschirmfoto 2018-03-10 um 11.24.09.png
and used this script to create my OSDs
Code:
#!/bin/bash

DEV=$1
BLU=$2
OSD=$3
WALPARTN=$(($OSD + 4))
DBPARTN=$(($WALPARTN + 10))

#usage ./bluestore-prepare.sh sda nvme0n1 0     #creates osd.0 on /dev/sda with device-wal on nvme0n1p4 and device-db on nvme0n1p14

#Test
echo $DEV,$BLU,$OSD,$WALPARTN,$DBPARTN

#Name partitions and define typecode
sgdisk -c $WALPARTN:"osd.$OSD.wal" --typecode=$WALPARTN:5ce17fce-4087-4169-b7ff-056cc58473f9 /dev/$BLU
sgdisk -c $DBPARTN:"osd.$OSD.db" --typecode=$DBPARTN:30cd0809-c2b2-499c-8879-2d6b78529876 /dev/$BLU

#Delete disk and create osd on disk with metadata on ssd partitions
ceph-disk prepare --zap-disk --bluestore /dev/$DEV --block.wal /dev/disk/by-partlabel/osd.$OSD.wal --block.db /dev/disk/by-partlabel/osd.$OSD.db
ceph-disk -v activate /dev/$DEV1
 
I know how to put db partitions (as I understand from the docu it will be used for WAL too) onto a SSD's / NVME partitions. I wanted to know how to put multiple complete OSD's onto a single NVMe with bluestore.
 
Well, I guess the easiest way would be to use the new ceph-volume tool. It works with LVM and puts adds metadata instead of the 100MB xfs partition. Also it will be used in the next release Mimic, as a replacement.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!