How to benchmark ceph storage

cmonty14

Well-Known Member
Mar 4, 2014
343
5
58
Hello!
I have setup (and configured) Ceph on a 3-node-cluster.

All nodes have
- 48 HDDs
- 4 SSDs

For best performance I defined any HDD as data and SSD as log.
This means I created 12 partitions on each SSD and created an OSD like this on node A:
pveceph createosd /dev/sda -journal_dev /dev/sde1
pveceph createosd /dev/sdb -journal_dev /dev/sde2
pveceph createosd /dev/sdc -journal_dev /dev/sde3
pveceph createosd /dev/sdd -journal_dev /dev/sde4

where sde is a SSD.

On node B I created the OSDs slightly different:
pveceph createosd /dev/sda -journal_dev /dev/sde1
pveceph createosd /dev/sdb -journal_dev /dev/sdf1
pveceph createosd /dev/sdc -journal_dev /dev/sdg1
pveceph createosd /dev/sdd -journal_dev /dev/sdh1


Question:
How can I benchmark (I intend to run fio) the different OSDs?
I thought I would create 2 different pools and assign the relevant OSDs to it.
However I don't know how this will work.
And finally I don't know how to run the fio benchmark against the storage.

Any advice is appreciated.
 
  • Like
Reactions: AlexLup
For benchmarking ceph look here: https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

To get an idea of the different setups, you can do a 'ceph tell osd.* bench' to see what the write performance of the individual OSDs are. It writes 1GB data to each OSD.

I'm aware of this document.
However my request starts earlier, means with the setup/configuration of OSDs.

When I was searching in the internet I came across these key words:
  • Bucket index on NVMe / SSD
  • Intel CAS
  • Caching metadata in flash media
  • FlashCache
  • Cache Tiering
You can ignore "Intel CAS" as of now, this technology is well documented by Intel.

However, what about the other things?
Do they have anything in common?
How can I make use of them, means what must be configured in CrushMap etc.?
What is reasonable in an architecture with 48 HDDs that is mainly used for DB backups?

I'm willing to publish benchmark results with Proxmox, but I need your input in order to deliver comparable results.

THX
 
What part of ceph do you intend to use, RGW, RBD, CephFS? Some of the technologies are good for parts of ceph but not all.
 
Hi,
I'm going to use RBD only.
The main applications I want to serve are:
1. KVM / CT storage
2. DB Backup (Redhat is calling this Large-object payload (64MB) here.)
 
1. KVM / CT storage
A possible benefit can be to put the DB/WAL onto a faster storage device. Or create separate pools with different device classes (nvme/sdd).

DB Backup (Redhat is calling this Large-object payload (64MB) here.)
This doesn't tell, what service you want to use. As with either it is possible to store large objects. RGW/CephFS may profit from cache tiering or the bucket index on a faster device for RGW.

IMHO, I discourage from using FlashCache or CAS. They introduce more complexity, more latency, KVM/CT data is always hot, ceph already has caching build-in.
 
  • Like
Reactions: Tmanok
I decided to create OSD with dedicated device for WAL and Journal:
pveceph createosd /dev/sdh -bluestore -journal_dev /dev/sdx1 -wal_dev /dev/sdy1
where /dev/sdx1 and /dev/sdy1 are 5GB partitions on SSD.

Question:
Is it correct to have an empty block device on /dev/sdx1 and /dev/sdy1 (w/o any filesystem)?

I can see the links have been created accordingly:
root@ld4464:~# ls -l /var/lib/ceph/osd/ceph-3/
insgesamt 60
-rw-r--r-- 1 root root 402 Jul 23 15:38 activate.monmap
-rw-r--r-- 1 ceph ceph 3 Jul 23 15:39 active
lrwxrwxrwx 1 ceph ceph 58 Jul 23 15:38 block -> /dev/disk/by-partuuid/37a5e5ac-d2cb-402a-bab3-97d90c0d1945
lrwxrwxrwx 1 ceph ceph 9 Jul 23 15:38 block.db -> /dev/sdx1
-rw-r--r-- 1 ceph ceph 37 Jul 23 15:38 block.db_uuid
-rw-r--r-- 1 ceph ceph 37 Jul 23 15:38 block_uuid
lrwxrwxrwx 1 ceph ceph 9 Jul 23 15:38 block.wal -> /dev/sdy1
-rw-r--r-- 1 ceph ceph 37 Jul 23 15:38 block.wal_uuid
-rw-r--r-- 1 ceph ceph 2 Jul 23 15:38 bluefs
-rw-r--r-- 1 ceph ceph 37 Jul 23 15:38 ceph_fsid
-rw-r--r-- 1 ceph ceph 37 Jul 23 15:38 fsid
-rw------- 1 ceph ceph 56 Jul 23 15:38 keyring
-rw-r--r-- 1 ceph ceph 8 Jul 23 15:38 kv_backend
-rw-r--r-- 1 ceph ceph 21 Jul 23 15:38 magic
-rw-r--r-- 1 ceph ceph 4 Jul 23 15:39 mkfs_done
-rw-r--r-- 1 ceph ceph 6 Jul 23 15:39 ready
-rw-r--r-- 1 ceph ceph 0 Jul 23 17:24 systemd
-rw-r--r-- 1 ceph ceph 10 Jul 23 15:38 type
-rw-r--r-- 1 ceph ceph 2 Jul 23 15:38 whoami


The ratio HDD:SSD is quiet low, means 12:1.
Therefore I'm planning to create 24 partitions on each SSD.
 
I decided to create OSD with dedicated device for WAL and Journal:
pveceph createosd /dev/sdh -bluestore -journal_dev /dev/sdx1 -wal_dev /dev/sdy1
where /dev/sdx1 and /dev/sdy1 are 5GB partitions on SSD.
For your setup (48/4) I would suggest to create 12 even sized partitions (3x à SSD), as the WAL is written to the fastest device of the OSD and the DB is growing on the dedicated partition till it reaches its end (spill-over to HDD).

Is it correct to have an empty block device on /dev/sdx1 and /dev/sdy1 (w/o any filesystem)?
Bluestore OSDs pack its object data directly onto the disk, no need for a filesystem. The 100MB XFS are for the metadata (eg. links to devices). It will be replaced by LVM metadata in the future releases (eg. Mimic).
 
For your setup (48/4) I would suggest to create 12 even sized partitions (3x à SSD), as the WAL is written to the fastest device of the OSD and the DB is growing on the dedicated partition till it reaches its end (spill-over to HDD).


Bluestore OSDs pack its object data directly onto the disk, no need for a filesystem. The 100MB XFS are for the metadata (eg. links to devices). It will be replaced by LVM metadata in the future releases (eg. Mimic).

Does this mean you don't recommend to put WAL and Journal to different partitions but on the same partition?
 
For ease of setup and retaining an even distribution among the SSDs, I would recommend my suggestion from above. Putting the WAL separately may make sense if you have device capable of writing fast small (4K) block, like an NMVe (eg. 3D Xpoint).
 
For ease of setup and retaining an even distribution among the SSDs, I would recommend my suggestion from above. Putting the WAL separately may make sense if you have device capable of writing fast small (4K) block, like an NMVe (eg. 3D Xpoint).

Unfortunately there's a severe issue with OSD activation when putting WAL and Journal on the same partition.
More details see here.
 
So.
All issues with creation of OSDs have been sorted out.

And in the meantime I created 2 pools in order to benchmark the different disks available in the cluster.
One pool is intended to be used for PVE storage (VM and CT), and the relevant storage type "RBD (PVE) was created automatically when creating the pool.

However, I cannot display any information of this RBD, means command
root@ld4257:~# rbd info pve/pve_ct
is not finishing.

This results in 2 questions:
1. Why does rbd info not return anything?
2. How can I benchmark this storage with fio?

THX
 
1. Why does rbd info not return anything?
Storage pools in PVE are not the same as in Ceph. 'pve_ct' and 'pve_vm' are pointing to the same Ceph pool, 'pve'. And 'rbd info <pool>/<image>' needs a disk to give you information about it.

2. How can I benchmark this storage with fio?
Check out our benchmark paper. For fio you can use librbd as engine and an image or map a rbd image as a disk.
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/
https://tracker.ceph.com/projects/c...ter_Performance#Benchmark-a-Ceph-Block-Device
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!