[SOLVED] HW planning CEPH 3 x OSD cluster for PVE 4.2 / PVE 5.x

Denny Fuchs · May 24, 2016

hi,

we planning a CEPH storage cluster to use with PVE 4.2 and after a lot of reading, what for HW we should use for our lab, we consider to following:

The basis layout:

3 x OSD nodes with 10 x consumer SSD and / or seagate spinning disks
3 x nodes acting as MON and PVE host

Details per OSD node:

Asus Z10PR-D16
CPU Intel Xeon E5-2620v3
Chassis RSC-2AH
Storage SSD Cache: Seagate MLC 200GB (ST200FM0053)
Storage OSD: Crucial MX200 and / or Seagate constellation disks (we take later care on it, if we now, how much space/io we need)
Memory: 16GB or 32GB DDR4
SAS controller: LSI Megaraid 9361-8i in JBOD (Raid0) or direct mode (if cache is used) with battery pack
Network: Intel X520-SR2
Switch: HP Aruba 2920-24 in stacked mode with (via stack module) and HP 10Gbit backplane module

All OSD nodes would get PVE 4.2 installed, but not used for hosting VMs.

The other three nodes are HW we already have. Some PSSC blades with same CPU and mainboard and up to 64GB ram. Only the 10Gbit interfaces are missing to connect to the CEPH cluster. But for testing 2x1GBit in LACP mode should be fine.
One question that comes into my mind: is it O.K to host VMs and mon daemon on the same physical host.

All OSD nodes are later connected via a LACP trunk (so we have 2x10Gbit) to the stacked switch so if one switch gets off, one link is up, through the other switch.

The main goal is to have a shared storage and replace our ISCSI setup in two racks. We don't have so much high I/O related VMs, just normal Debian VMs for our (Web) services and backups.
Also, we want to build everything redundant, so we can put hosts/switches in maintenance mode, or one can fail, without interrupting the productive environment (maybe slower, but not stopping)

Is there something wrong with our HW or are there some other suggestions?

Big update:

I put now our already in testing hardware here, that could be usefull for other:

We have now:

6 x Ceph Proxmox 4 (upgrade comes later)
5 x Proxmox 5

For all we have the same basics:

The hardware list:

* Chassis: 2HE 24 slot http://www.aicipc.com/en/productdetail/446
* Chassis Backplane: 12Gbit without expander
* Motherboard: Supermicro X10DRI with new HTML5 IPMI firmware
* CPU: Intel Xeon 2620v4 (or v6?) 2,1Ghz
* Ram: DDR4 64GB ECC buffered 2400Mhz
* System disk: 2 x Crucial MX300 250GB

Ceph has:
* Ceph SSD Pool: 6 x Samsung 850Evo 500Gb
* Ceph SATA Pool: 6 x WD Red 1TB ( only for logs/elasticsearch ...)
* HBA: LSI sas-9305-16i
* Ceph journal : Intel SSD DC P3700 400GB (cache for the SSDS)
* Network: Mellanox CX-4 two port 100Gb -> Ceph only

Proxmox has:
* Ceph network: Mellanox CX-4 two port 25Gb -> Ceph storage connection
* VM network: Intel dualport X520 SFP+
* Switch: 3 x Ubuiquity Edgeswitch 16-XG

Ceph Network:
* Switch: 2x Mellanox MSN2100 (100Gb) 12 port switch
* Cable: 4 x 100Gb -> 4 x 25Gb splitting cable
* Cable: 12 x 100Gb cable for Ceph nodes

The journal device is only for the SSDs and a single point of failure, because if the journal device dies, the whole node dies.
But we have the option to change that later.
The X10DRi has enough PCIe slots to extend with NVMe and HBA controller, if we fill the 12 free storage slots.
Important: The CX-4 must be assigned to CPU 2 to get the full speed, otherwise it reaches only ~ 25% and lower.

We created two Ceph pools:

* ssds -> For all VMs
* sata -> For logs and elastic search

Both are created with 3 replicas and 2048PGs.

We will test now all the details and handling

cu denny

PigLover · May 24, 2016

Assuming you are going with 3-way replicas (default) then I'd highly recommend adding a 4th - and perhaps 5th - OSD node. Same number of SSDs, just spread them over more nodes. This will significantly improve your outage resiliency.

With exactly the same number of OSD nodes as replicas you will have to deal with having "degraded" placement groups during any single-node outage (i.e., your cluster cannot return to a "healthy" state during the outage). You can avoid service impact from this by setting the "required" replicas to 2 or even 1, which will allow you to continue operation, but your playing with fire if there is a second concurrent outage. You also have a serious headache with planning routine maintenance of the OSD nodes.

I know Ceph's "best practices" recommend keeping the MON nodes isolated from the OSDs but this recommendation hearkens back to days of much weaker CPUs, expensive RAM and 1Gbe networks. You might consider going with 5 total nodes, running MON and OSD on all 5 nodes - spreading the 30 SSDs out as 6/node over 5 nodes rather than 10/node over 3 nodes. This gives you a cluster where any two nodes can be offline for maintenance or other reasons and still have resiliency against the failure of an additional node. And you saved the cost of one CPU+chassis. Horizontal growth from this point would just be adding additional OSD nodes.

spirit · May 25, 2016

Don't use crucial consummer ssd MX200, they are pretty bad to do sync write, which is needed by ceph journal (sometimes a simple hdd can be faster)

https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/

brad_mssw · May 25, 2016

You list a 12Gbps SAS controller, but the chassis has a 6Gbps SAS backplane. They changed the connector between 12Gbps and 6Gbps. I'm a huge supermicro fan ... you're not going to be happy when you try to put that solution together

I should also mention that JBOD is not the same as RAID0 as you seem to indicate, you definitely want JBOD mode and not RAID0!

Our equivalent setup in production we use a SuperMicro 216BE2C-R920LPB chassis and SuperMicro X10DRI-O motherboard, same RAID card you're looking at using. For SSDs, we use the Intel DC S3700 (S3710 is the current generation now), which are suitable for cache drives. Oh, and don't cheap out on RAM ... we run 256GB.

spirit · May 25, 2016

brad_mssw said:
Oh, and don't cheap out on RAM ... we run 256GB.

I'm using 64GB for 6 osd.

(Normally you need 2GB by osd, and a little bit more when datas is move in cluster, maybe 4-5gb by osd).
Of course, you can have more ram to have read buffer

Denny Fuchs · Jun 6, 2016

hi,

O.K thanks for all replys. I created a new list for the components with the SuperMicro things and with 6 OSDs, instead of 3. A questions, what comes into my mind: should we use a HW-RAID1 for the CEPH journal (with two SSD)? If the the SSD gets broken, than maybe the OSD goes down, right?

The other thing: I thinking to use 22 x normal 2.5" harddisk like the Seagate Enterprise 15K/Umin / 12Gbit / 300GB, or WD Red WD10JFCX / 6Gib 5.400Umin TB.
Any objections?

cu denny

PigLover · Jun 6, 2016

No real need to raid the journal that way. If an OSD goes down Ceph just recover and re-creates the placement groups on other OSDs.

Just make sure you have sufficient free space to cover for a failed OSD (or, if you are holding several OSDs jounal on a single SSD, that you have enough space to cover for all of these OSDs on a single SSD going "out" together).

Denny Fuchs · Jun 14, 2016

hi,

we had some discussions (a lot!) about the OSD disk drives (not journal drives):

Enterprise SSD (Seagate/Intel) vs. "consumer" SSD (Crucial / Samsung)
Enterprise HDD (Seagate enterprise SAS/15K) vs. "consumer" disks (HGST / WD / ...)

... or a mix of all of them per OSD ( OSD with SSD pro / OSD with consumer / OSD with standard rotating disks) vs. 8 slots with one kind of disk type (but then we would use all 24 slots ... I think)

I want to use enterprise but the disks are much more expensive .. the opposite is: we have six OSD with (first) 10 disk slots filled with disks from 24. So we have enough redundancy in vertical and horizontal to use "cheaper" disks and throw them away, if they are broken.
Also we have the discussion SSD (smaller, but lower engergy/heating and high read/write) vs. standard disks 1TB 7.200/Umin because of the nature from Ceph to use them in parallel for reading / writing.

What we know: we have also with 10 disks (one with 250GB per diskdrive) with six (-2 redundancy) OSDs more space we need actually . Also our services are not so heavy on I/O, only Zabbix. At the moment, all of our productive data runs from a Synology 815RS via NFS

Our next services depends a bit more on I/O: Elasticsearch and some Java things (ActiveMQ/Tomcat/...). So we have a very mix of all kind of services.

Any suggestions are welcome

El Tebe · Aug 17, 2017

PigLover said:
Assuming you are going with 3-way replicas (default) then I'd highly recommend adding a 4th - and perhaps 5th - OSD node. Same number of SSDs, just spread them over more nodes. This will significantly improve your outage resiliency.

With exactly the same number of OSD nodes as replicas you will have to deal with having "degraded" placement groups during any single-node outage (i.e., your cluster cannot return to a "healthy" state during the outage). You can avoid service impact from this by setting the "required" replicas to 2 or even 1, which will allow you to continue operation, but your playing with fire if there is a second concurrent outage. You also have a serious headache with planning routine maintenance of the OSD nodes.

I know Ceph's "best practices" recommend keeping the MON nodes isolated from the OSDs but this recommendation hearkens back to days of much weaker CPUs, expensive RAM and 1Gbe networks. You might consider going with 5 total nodes, running MON and OSD on all 5 nodes - spreading the 30 SSDs out as 6/node over 5 nodes rather than 10/node over 3 nodes. This gives you a cluster where any two nodes can be offline for maintenance or other reasons and still have resiliency against the failure of an additional node. And you saved the cost of one CPU+chassis. Horizontal growth from this point would just be adding additional OSD nodes.

I read a lot about this and when i getting see clearer, then I found this post.
Now, i have a lot of new questions about OSD and MON nodes...
So..

Background:

Now, I have six nodes (blades) in a Supermicro Microblade Sys, with an internal 2x10G network connection between all nodes. Every node have 32GB DDR4 ecc reg. rams, and eight core xeon intel cpus.

Each node have only 4 SATA ports

(+1 SATA DOM port).
(I have no raid or other controller, and theres no pci-e or other ports, there's no external storage server too)

My plan was:

I use three of the nodes to Ceph OSDs (only) and the other 3 to Proxmox(+MON).
(I'd like to run some centos/ubuntu VMs for mail, database, webserver/proxy, services.)

The most important in this config is the available minimal downtime - not the huge storage space - but i really like that if this 6 nodes produces a fair speed too.

So, compunding that post of Denny Fuchs - my questions are:
In this situation: which is more efficient?
A.) Three OSDs nodes AND 3 Proxmox/Mon nodes (like the Ceph's best practices)

B.) Four or five or six OSDs nodes (beside MONs on the first three node) AND each with two SSDs (beacuse of max 4 SATA ports, and i have to install the OS too..)

C.) Same like b) but the number of SSDs different.

D.) Or install proxmox to all nodes (6x) with another ceph disk configuration...
I'm really lost here.. Hope you can help me..

Really really thank you for all!

Denny Fuchs · Aug 17, 2017

hi,

El Tebe said:
So, compunding that post of Denny Fuchs - my questions are:
In this situation: which is more efficient?

I think there is only one question: do you want a Proxmox hypervisor cluster too? If so, than the only option is 3 x Ceph, 3 x Proxmox. I would not recommend to put them in one cluster of six, because if something goes realy wrong with the cluster, you have nothing. If you split them in two clusters, the chance to lose storage and hypervisor is lower. But that is my opinion. You can put the mon on the same hosts, like Ceph.

If you don't want a Proxmox cluster, I would prefer 4 Ceph nodes and two hypervisor, but than you lose the live migration feature.

cu denny

tom · Aug 17, 2017

Denny Fuchs said:
...

If you don't want a Proxmox cluster, I would prefer 4 Ceph nodes and two hypervisor, but than you lose the live migration feature.

cu denny

I assume you mean Proxmox VE HA, not live migration. live migration works also with 2 nodes.

El Tebe · Aug 17, 2017

Denny Fuchs said:
hi,

I think there is only one question: do you want a Proxmox hypervisor cluster too? If so, than the only option is 3 x Ceph, 3 x Proxmox. I would not recommend to put them in one cluster of six, because if something goes realy wrong with the cluster, you have nothing. If you split them in two clusters, the chance to lose storage and hypervisor is lower. But that is my opinion. You can put the mon on the same hosts, like Ceph.

If you don't want a Proxmox cluster, I would prefer 4 Ceph nodes and two hypervisor, but than you lose the live migration feature.

cu denny

Thank you

The 3x ceph + 3 proxmox version was my first idea.

El Tebe · Aug 17, 2017

1.

Denny Fuchs said:
hi,
You can put the mon on the same hosts, like Ceph.
cu denny

I saw somewhere a Ceph install recommendation that says: do not place MONs and OSDs on the same nodes.
I think I can place beside those to proxmox nodes, can i?

2.
And what do you think about the 4 sata drive slots / node problem?

2/1: Shoud I install OSDs to all 4 (intel datacenter) SSDs (1osd-1ssd) and the satadom for the OS? (thats a single point of failure..)

2/2: And the Proxmox cluster recommended disk setup (2 hdds + 2 ssds vs 4 ssd)?

(Of course "I can do", but I ask these for higer availability and reduce downtimes)

Denny Fuchs · Aug 17, 2017

tom said:
I assume you mean Proxmox VE HA, not live migration. live migration works also with 2 nodes.

Nope, I realy mean live migration with a healty 3 node Proxmox cluster. I was not aware, that Proxmox5 can now live migration without clustering, or did I missed something ?

cu denny

Denny Fuchs · Aug 17, 2017

hi,

@El Tebe

The CPUs are so powerfull and you don't have so much OSDs and nodes, that it shouldn't be a problem. It could be a problem, if you want to use also MDS for CephFS ... than maybe ... but you have to test it anyway.

For 2/1: I can't say anything, but if you create a pool with replica of 3, than one node can fail.
For 2/2: If you want only a few GBs than just use only SSDs. Than you can put also the journal on the same OSD or use Proxmox 5 with Ceph luminous which releases in a few weeks and use Bluestore, which kicks the journal "problem".

cu denny

El Tebe · Aug 17, 2017

Denny Fuchs said:
The CPUs are so powerfull and you don't have so much OSDs and nodes, that it shouldn't be a problem. It could be a problem, if you want to use also MDS for CephFS ... than maybe ... but you have to test it anyway.

For 2/1: I can't say anything, but if you create a pool with replica of 3, than one node can fail.
For 2/2: If you want only a few GBs than just use only SSDs. Than you can put also the journal on the same OSD or use Proxmox 5 with Ceph luminous which releases in a few weeks and use Bluestore, which kicks the journal "problem".

cu denny

Thank you Denny, now the whole picture is clearer

tom · Aug 18, 2017

Denny Fuchs said:
Nope, I realy mean live migration with a healty 3 node Proxmox cluster. I was not aware, that Proxmox5 can now live migration without clustering, or did I missed something ?

cu denny

I talk about a simple (non HA) two node cluster.

El Tebe · Nov 14, 2017

Denny Fuchs said:
hi,

@El Tebe

The CPUs are so powerfull and you don't have so much OSDs and nodes, that it shouldn't be a problem. It could be a problem, if you want to use also MDS for CephFS ... than maybe ... but you have to test it anyway.

For 2/1: I can't say anything, but if you create a pool with replica of 3, than one node can fail.
For 2/2: If you want only a few GBs than just use only SSDs. Than you can put also the journal on the same OSD or use Proxmox 5 with Ceph luminous which releases in a few weeks and use Bluestore, which kicks the journal "problem".

cu denny

Hi Denny,
I rethink all of this, get more and more info about ceph, so finally this is the last working "3+3 nodes" version:
- ceph three dedicated OSD nodes: every node have three OSD (1 sata SSD = 1 OSD) + 1 SSD for the OS
- three nodes for the Proxmox VE (HA cluster) and Ceph MON and ceph client too.

My sources for my decisions:

your prev. advices
Ceph official: Guide for ceph cluster planning (!)
Ceph official: COresident monitors/OSDs
Ceph official: Hardware recommendations
Manually mount a ceph storage pool in proxmox
Choosed enterprise (High Endurance Technology (HET)) SSDs: write endurance and comparsions
OS: EXT4 on SSD

Search

Search

[SOLVED] HW planning CEPH 3 x OSD cluster for PVE 4.2 / PVE 5.x

Denny Fuchs

Renowned Member

PigLover

Renowned Member

spirit

Distinguished Member

brad_mssw

Renowned Member

spirit

Distinguished Member

Denny Fuchs

Renowned Member

PigLover

Renowned Member

Denny Fuchs

Renowned Member

El Tebe

Renowned Member

Denny Fuchs

Renowned Member

tom

Proxmox Staff Member

El Tebe

Renowned Member

El Tebe

Renowned Member

Denny Fuchs

Renowned Member

Denny Fuchs

Renowned Member

El Tebe

Renowned Member

tom

Proxmox Staff Member

El Tebe

Renowned Member

We value your privacy