Ceph recommendation

Norman Uittenbogaart · Feb 12, 2018

Hi we have 3 nodes running Ceph with Bluestore OSD's
The disks are HDD's.
We have medium performance on this.

Currently the nodes are also running the containers and VM's.
I would like to introduce a new node which will run all containers and VM's and no OSD's.
Hopefully this will increase performance.'
But is there a way with Bluestore to increase performance in the Ceph cluster?
There is no room in the Nodes to put a extra SSD for journal orso.
Offcourse there is room in the new node, but I don't know if that is possible to put a local buffer of sorts?

What would you advice?

Norman Uittenbogaart · Feb 14, 2018

Anyone?

tom · Feb 14, 2018

If you have just three nodes and a few OSDs, the only way for acceptable performance is using SSDs.

Bluestore does not really boost your HDDs.

Dan Nicolae · Feb 16, 2018

Try to increase the read ahead for the osd's hdd. It is possible to gain a bit of speed on read...

echo "2048" > /sys/block/sda/queue/read_ahead_kb

Also, in ceph.conf

osd max backfills = 1
osd recovery max active = 10

PS: How many HDD-OSD do you have? What is the ceph network speed? Have you tried some jumbo frames?

Norman Uittenbogaart · Feb 19, 2018

Hi I have 3 HD OSD on each node.
I will try the above settings.

network speed is 2GB/s, 250MB/s should give me enough speed.
But currently it seems it is slower then running a server with just a normal HD

tom · Feb 19, 2018

Norman Uittenbogaart said:
Hi I have 3 HD OSD on each node.
I will try the above settings.

network speed is 2GB/s, 250MB/s should give me enough speed.
But currently it seems it is slower then running a server with just a normal HD

Yes, you cannot expect more with HDDs unless you have a lot of them (100+).

Norman Uittenbogaart · Feb 19, 2018

tom said:
Yes, you cannot expect more with HDDs unless you have a lot of them (100+).

The strange thing is, the HD's are not being maxed out ....
None are reading/writing at their max speed.....
If that where the case I would understand the limitation, but most are reading writing at maybe max 3MB/s

Dan Nicolae · Feb 19, 2018

We run 8 nodes, each node has two 2TB HDD OSD/s (16 HDD OSD) with journal on an Intel DC SSD, 2Gbps Ceph Network, latest version of Proxmox and Ceph. Performance and redundancy is more than OK. Last benchmark looks like this.

WRITE
Total time run: 60.614986
Total writes made: 2627
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 173.356
Stddev Bandwidth: 22.4167
Max bandwidth (MB/sec): 220
Min bandwidth (MB/sec): 108
Average IOPS: 43
Stddev IOPS: 5
Max IOPS: 55
Min IOPS: 27
Average Latency(s): 0.368527
Stddev Latency(s): 0.172779
Max latency(s): 1.34598
Min latency(s): 0.105797

SEQ READ
Total time run: 45.769104
Total reads made: 2627
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 229.587
Average IOPS: 57
Stddev IOPS: 3
Max IOPS: 65
Min IOPS: 51
Average Latency(s): 0.277923
Max latency(s): 1.44691
Min latency(s): 0.0176667

PS: A 3 node ceph is not recommended for a production infrastructure.

Dan Nicolae · Feb 19, 2018

What is the number of pg-s and size/min in the pool you are using?

Norman Uittenbogaart · Feb 19, 2018

Dan Nicolae said:
What is the number of pg-s and size/min in the pool you are using?

PG size is 256
Size / min is 3 / 1

Dan Nicolae · Feb 19, 2018

PG Calc recommends pg size, 512.
9 OSD's, Size 3, 100% Data integrity, Target PG / OSD 200

Wrong pg size cand generate performance issue.

You can increase this number but you can't go back without destroing the pool and loose data. Increasing pg size is creating a storm in your cluster and eats a lot of IO.

Here is the documentation on pgs.

Norman Uittenbogaart · Feb 19, 2018

Dan Nicolae said:
PG Calc recommends pg size, 512.
9 OSD's, Size 3, 100% Data integrity, Target PG / OSD 200

Wrong pg size cand generate performance issue.

You can increase this number but you can't go back without destroing the pool and loose data. Increasing pg size is creating a storm in your cluster and eats a lot of IO.

Here is the documentation on pgs.

When you choose Target PGs per OSD to 100

100
If the cluster OSD count is not expected to increase in the foreseeable future.

The recommended PG size is 256, so the size should be ok.
I do have the OS running on a small SSD on all 3 nodes.
I could partition it to have a bit of space for a DB / Wall partition, would this make a lot of difference?
Is it possible to move / change that now, or will I have to recreate the cluster?

Dan Nicolae · Feb 19, 2018

it is mandatory to choose the value of pg_num because it cannot be calculated automatically. Here are a few values commonly used:

Less than 5 OSDs set pg_num to 128
Between 5 and 10 OSDs set pg_num to 512
Between 10 and 50 OSDs set pg_num to 1024
If you have more than 50 OSDs, you need to understand the tradeoffs and how to calculate the pg_num value by yourself
For calculating pg_num value by yourself please take help of pgcalc tool

Dan Nicolae · Feb 19, 2018

I don't like the idea to use the OS drive also for OSD's journal. if you loose the SSD, you loose everything, OSD's, server in the proxmox cluster..

To "move" the journal from it's own HDD drive, you must remove and create the OSD with the new structure.
In theory you can do that for each OSD, one by one and wait so that the cluster will heal after each osd upgrade.

Search

Search

Ceph recommendation

Norman Uittenbogaart

Renowned Member

Norman Uittenbogaart

Renowned Member

tom

Proxmox Staff Member

Dan Nicolae

Renowned Member

Norman Uittenbogaart

Renowned Member

tom

Proxmox Staff Member

Norman Uittenbogaart

Renowned Member

Dan Nicolae

Renowned Member

Dan Nicolae

Renowned Member

Norman Uittenbogaart

Renowned Member

Dan Nicolae

Renowned Member

Norman Uittenbogaart

Renowned Member

Dan Nicolae

Renowned Member

Dan Nicolae

Renowned Member