Ceph Cluster Hardware Configuration

ejmerkel · Dec 23, 2014

I am just curious if the following looks like a decent configuration for a Proxmox VE / Ceph cluster.

3 Dell 730xd servers with the following specs:

CPU: 2 x Xeon e5-2670 v3 2/3GHz
RAM: 256G RDIMM, 2133MT/s, Dual Rank, x4 Data Width
DISK CONTROLLER: PERC H730 RAID (not using for RAID of course - will make all OSD disks RAID0)
Network: Intel Ethernet X540 DP 10Gb + I350 1Gb DP Network Daughter Card
OS Partition: 2x1TB 7.2K SATA - RAID1 (going to use part of this disk for local lvm & ISO/templates)
Journaling Partition: 2x200GB SSD SATA - RAID1 Mix Use MLC 6Gbps
OSDs: 6x4TB 7.2K SATA

I realize the OS partition is not an SSD drive as recommended here http://pve.proxmox.com/wiki/Ceph_Server. Is that going to cause any performance issues?

If I use triple replication I should net 19.2T (80% of 24TB) correct?

My hope is after some initial Ceph testing is to make this system a production cluster. I will have the capacity to add 4 more 3.5" SATA drives per server.

I appreciate any comments or suggestions.

Best regards,
Eric

spirit · Dec 24, 2014

Hi,

>>I realize the OS partition is not an SSD drive as recommended here http://pve.proxmox.com/wiki/Ceph_Server. Is that going to cause any performance issues?
No problem here. you'll have the ceph monitor on the root partition, but with perc cache it should be ok.

>>If I use triple replication I should net 19.2T (80% of 24TB) correct?

No, you'll have 8TB. (3 copies of each datas).

If you want something like raid-5|6 with disk parity, ceph support since firefly a new replication called "erasure code", but it's not implement in proxmox gui.

ejmerkel · Dec 24, 2014

spirit said:
Hi,

>>I realize the OS partition is not an SSD drive as recommended here http://pve.proxmox.com/wiki/Ceph_Server. Is that going to cause any performance issues?
No problem here. you'll have the ceph monitor on the root partition, but with perc cache it should be ok.

>>If I use triple replication I should net 19.2T (80% of 24TB) correct?

No, you'll have 8TB. (3 copies of each datas).

If you want something like raid-5|6 with disk parity, ceph support since firefly a new replication called "erasure code", but it's not implement in proxmox gui.

Just to clarify each server has 24TB so between the 3 servers I will have 72TB. So just to make sure I am not misunderstanding, I should end up with 72TB / 3 * .80 = 19.2TB?

Best regards,
Eric

spirit · Dec 24, 2014

oh, ok ,sorry.

you'll have 72TB/3 = 24TB

(don't known why you do *0.8) ?

ejmerkel · Dec 24, 2014

From what I have read you never want to use more than 80% of your total disk capacity. Not really sure why?

Eric

ejmerkel · Dec 24, 2014

spirit said:
oh, ok ,sorry.

you'll have 72TB/3 = 24TB

(don't known why you do *0.8) ?

I read somewhere that you should not use more than 80% of your total disk capacity. I assume this is in case you have failed drives etc?

Eric

spirit · Dec 25, 2014

ejmerkel said:
I read somewhere that you should not use more than 80% of your total disk capacity. I assume this is in case you have failed drives etc?

Eric

They are a protection,

mon osd full ratio = 0.95 (95%).

So when you reach 95% , it'll change as read only.

You can tune it if you want.

In case of a failed drive, ceph will try to replicate missing blocks on other osds on same node.

udo · Dec 25, 2014

ejmerkel said:
From what I have read you never want to use more than 80% of your total disk capacity. Not really sure why?

Eric

Hi,
ths is right - normaly you should use much less. There are some reasons:

1. If one node fails, the content must have place on the remaining nodes before mon_osd_full_ratio came into play.

2. the performace drop with a full disks - esp. with xfs-disks. It's helpfull to use mountoptions to avoid fragmentation:
osd mount options xfs = "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M"
I switched my cluster from xfs to ext4 (move an archiv to an external raid before) - now I move the content back to reach again the +60% fill. See than if the performance is better than with xfs, but it's seems so.

3. some versions ago, the best thing to stop ceph working is an full OSD!! Perhaps it's better now, but I would avoid this in an production environment.

4. The disks are not filled in an equal distribution. There are sometimes huge differences (depends on your workload - VM-disks which are only partly filled). If this happens during rebuilding of one failed node, you can easily reach mon_osd_full_ratio on single disks and on other disks have plenty free space...

If your cluster 70% filled you should have ordered another OSD-Node.

Udo

udo · Dec 25, 2014

ejmerkel said:
If I use triple replication I should net 19.2T (80% of 24TB) correct?

Hi,
if you format an 4TB-drive with xfs, you will get an ceph weight of 3.64 by disk and 3.58 for ext4 (depends on your mkfs-options).

E.G. you will get 21.84 TB (or 21.48 for ext4) for each node.

If one node can fail, the other nodes needs the space for the rebuild: around 60% would the max usable space if your mon_osd_full_ratio is high (like 0.95).
If you are able to bring the failed node back fast, perhaps you can go higher?!

Udo

Search

Search

Ceph Cluster Hardware Configuration

ejmerkel

Well-Known Member

spirit

Distinguished Member

ejmerkel

Well-Known Member

spirit

Distinguished Member

ejmerkel

Well-Known Member

ejmerkel

Well-Known Member

spirit

Distinguished Member

udo

Distinguished Member

udo

Distinguished Member