Ceph recommendation

Norman Uittenbogaart

Renowned Member
Feb 28, 2012
150
5
83
Rotterdam, Netherlands, Netherlands
Hi we have 3 nodes running Ceph with Bluestore OSD's
The disks are HDD's.
We have medium performance on this.

Currently the nodes are also running the containers and VM's.
I would like to introduce a new node which will run all containers and VM's and no OSD's.
Hopefully this will increase performance.'
But is there a way with Bluestore to increase performance in the Ceph cluster?
There is no room in the Nodes to put a extra SSD for journal orso.
Offcourse there is room in the new node, but I don't know if that is possible to put a local buffer of sorts?

What would you advice?
 
If you have just three nodes and a few OSDs, the only way for acceptable performance is using SSDs.

Bluestore does not really boost your HDDs.
 
Try to increase the read ahead for the osd's hdd. It is possible to gain a bit of speed on read...

echo "2048" > /sys/block/sda/queue/read_ahead_kb

Also, in ceph.conf

osd max backfills = 1
osd recovery max active = 10

PS: How many HDD-OSD do you have? What is the ceph network speed? Have you tried some jumbo frames?
 
Hi I have 3 HD OSD on each node.
I will try the above settings.

network speed is 2GB/s, 250MB/s should give me enough speed.
But currently it seems it is slower then running a server with just a normal HD

Yes, you cannot expect more with HDDs unless you have a lot of them (100+).
 
We run 8 nodes, each node has two 2TB HDD OSD/s (16 HDD OSD) with journal on an Intel DC SSD, 2Gbps Ceph Network, latest version of Proxmox and Ceph. Performance and redundancy is more than OK. Last benchmark looks like this.

WRITE
Total time run: 60.614986
Total writes made: 2627
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 173.356
Stddev Bandwidth: 22.4167
Max bandwidth (MB/sec): 220
Min bandwidth (MB/sec): 108
Average IOPS: 43
Stddev IOPS: 5
Max IOPS: 55
Min IOPS: 27
Average Latency(s): 0.368527
Stddev Latency(s): 0.172779
Max latency(s): 1.34598
Min latency(s): 0.105797

SEQ READ
Total time run: 45.769104
Total reads made: 2627
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 229.587
Average IOPS: 57
Stddev IOPS: 3
Max IOPS: 65
Min IOPS: 51
Average Latency(s): 0.277923
Max latency(s): 1.44691
Min latency(s): 0.0176667

PS: A 3 node ceph is not recommended for a production infrastructure.
 
PG Calc recommends pg size, 512.
9 OSD's, Size 3, 100% Data integrity, Target PG / OSD 200

Wrong pg size cand generate performance issue.

You can increase this number but you can't go back without destroing the pool and loose data. Increasing pg size is creating a storm in your cluster and eats a lot of IO.

Here is the documentation on pgs.
 
PG Calc recommends pg size, 512.
9 OSD's, Size 3, 100% Data integrity, Target PG / OSD 200

Wrong pg size cand generate performance issue.

You can increase this number but you can't go back without destroing the pool and loose data. Increasing pg size is creating a storm in your cluster and eats a lot of IO.

Here is the documentation on pgs.

When you choose Target PGs per OSD to 100

100
If the cluster OSD count is not expected to increase in the foreseeable future.

The recommended PG size is 256, so the size should be ok.
I do have the OS running on a small SSD on all 3 nodes.
I could partition it to have a bit of space for a DB / Wall partition, would this make a lot of difference?
Is it possible to move / change that now, or will I have to recreate the cluster?
 
it is mandatory to choose the value of pg_num because it cannot be calculated automatically. Here are a few values commonly used:

  • Less than 5 OSDs set pg_num to 128
  • Between 5 and 10 OSDs set pg_num to 512
  • Between 10 and 50 OSDs set pg_num to 1024
  • If you have more than 50 OSDs, you need to understand the tradeoffs and how to calculate the pg_num value by yourself
  • For calculating pg_num value by yourself please take help of pgcalc tool
 
I don't like the idea to use the OS drive also for OSD's journal. if you loose the SSD, you loose everything, OSD's, server in the proxmox cluster..

To "move" the journal from it's own HDD drive, you must remove and create the OSD with the new structure.
In theory you can do that for each OSD, one by one and wait so that the cluster will heal after each osd upgrade.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!