[SOLVED] HW planning CEPH 3 x OSD cluster for PVE 4.2 / PVE 5.x

Jan 21, 2016
96
8
73
43
Germany
www.pug.org
hi,

we planning a CEPH storage cluster to use with PVE 4.2 and after a lot of reading, what for HW we should use for our lab, we consider to following:

The basis layout:
  • 3 x OSD nodes with 10 x consumer SSD and / or seagate spinning disks
  • 3 x nodes acting as MON and PVE host
Details per OSD node:
  • Asus Z10PR-D16
  • CPU Intel Xeon E5-2620v3
  • Chassis RSC-2AH
  • Storage SSD Cache: Seagate MLC 200GB (ST200FM0053)
  • Storage OSD: Crucial MX200 and / or Seagate constellation disks (we take later care on it, if we now, how much space/io we need)
  • Memory: 16GB or 32GB DDR4
  • SAS controller: LSI Megaraid 9361-8i in JBOD (Raid0) or direct mode (if cache is used) with battery pack
  • Network: Intel X520-SR2
  • Switch: HP Aruba 2920-24 in stacked mode with (via stack module) and HP 10Gbit backplane module
All OSD nodes would get PVE 4.2 installed, but not used for hosting VMs.

The other three nodes are HW we already have. Some PSSC blades with same CPU and mainboard and up to 64GB ram. Only the 10Gbit interfaces are missing to connect to the CEPH cluster. But for testing 2x1GBit in LACP mode should be fine.
One question that comes into my mind: is it O.K to host VMs and mon daemon on the same physical host.

All OSD nodes are later connected via a LACP trunk (so we have 2x10Gbit) to the stacked switch so if one switch gets off, one link is up, through the other switch.

The main goal is to have a shared storage and replace our ISCSI setup in two racks. We don't have so much high I/O related VMs, just normal Debian VMs for our (Web) services and backups.
Also, we want to build everything redundant, so we can put hosts/switches in maintenance mode, or one can fail, without interrupting the productive environment (maybe slower, but not stopping)

Is there something wrong with our HW or are there some other suggestions?

Big update:

I put now our already in testing hardware here, that could be usefull for other:

We have now:

6 x Ceph Proxmox 4 (upgrade comes later)
5 x Proxmox 5

For all we have the same basics:

The hardware list:

* Chassis: 2HE 24 slot http://www.aicipc.com/en/productdetail/446
* Chassis Backplane: 12Gbit without expander
* Motherboard: Supermicro X10DRI with new HTML5 IPMI firmware
* CPU: Intel Xeon 2620v4 (or v6?) 2,1Ghz
* Ram: DDR4 64GB ECC buffered 2400Mhz
* System disk: 2 x Crucial MX300 250GB

Ceph has:
* Ceph SSD Pool: 6 x Samsung 850Evo 500Gb
* Ceph SATA Pool: 6 x WD Red 1TB ( only for logs/elasticsearch ...)
* HBA: LSI sas-9305-16i
* Ceph journal : Intel SSD DC P3700 400GB (cache for the SSDS)
* Network: Mellanox CX-4 two port 100Gb -> Ceph only

Proxmox has:
* Ceph network: Mellanox CX-4 two port 25Gb -> Ceph storage connection
* VM network: Intel dualport X520 SFP+
* Switch: 3 x Ubuiquity Edgeswitch 16-XG

Ceph Network:
* Switch: 2x Mellanox MSN2100 (100Gb) 12 port switch
* Cable: 4 x 100Gb -> 4 x 25Gb splitting cable
* Cable: 12 x 100Gb cable for Ceph nodes

The journal device is only for the SSDs and a single point of failure, because if the journal device dies, the whole node dies.
But we have the option to change that later.
The X10DRi has enough PCIe slots to extend with NVMe and HBA controller, if we fill the 12 free storage slots.
Important: The CX-4 must be assigned to CPU 2 to get the full speed, otherwise it reaches only ~ 25% and lower.

We created two Ceph pools:

* ssds -> For all VMs
* sata -> For logs and elastic search

Both are created with 3 replicas and 2048PGs.

We will test now all the details and handling :)


cu denny
 
Last edited:
Assuming you are going with 3-way replicas (default) then I'd highly recommend adding a 4th - and perhaps 5th - OSD node. Same number of SSDs, just spread them over more nodes. This will significantly improve your outage resiliency.

With exactly the same number of OSD nodes as replicas you will have to deal with having "degraded" placement groups during any single-node outage (i.e., your cluster cannot return to a "healthy" state during the outage). You can avoid service impact from this by setting the "required" replicas to 2 or even 1, which will allow you to continue operation, but your playing with fire if there is a second concurrent outage. You also have a serious headache with planning routine maintenance of the OSD nodes.

I know Ceph's "best practices" recommend keeping the MON nodes isolated from the OSDs but this recommendation hearkens back to days of much weaker CPUs, expensive RAM and 1Gbe networks. You might consider going with 5 total nodes, running MON and OSD on all 5 nodes - spreading the 30 SSDs out as 6/node over 5 nodes rather than 10/node over 3 nodes. This gives you a cluster where any two nodes can be offline for maintenance or other reasons and still have resiliency against the failure of an additional node. And you saved the cost of one CPU+chassis. Horizontal growth from this point would just be adding additional OSD nodes.
 
You list a 12Gbps SAS controller, but the chassis has a 6Gbps SAS backplane. They changed the connector between 12Gbps and 6Gbps. I'm a huge supermicro fan ... you're not going to be happy when you try to put that solution together :) I should also mention that JBOD is not the same as RAID0 as you seem to indicate, you definitely want JBOD mode and not RAID0!

Our equivalent setup in production we use a SuperMicro 216BE2C-R920LPB chassis and SuperMicro X10DRI-O motherboard, same RAID card you're looking at using. For SSDs, we use the Intel DC S3700 (S3710 is the current generation now), which are suitable for cache drives. Oh, and don't cheap out on RAM ... we run 256GB.
 
  • Like
Reactions: Denny Fuchs
hi,

O.K thanks for all replys. I created a new list for the components with the SuperMicro things and with 6 OSDs, instead of 3. A questions, what comes into my mind: should we use a HW-RAID1 for the CEPH journal (with two SSD)? If the the SSD gets broken, than maybe the OSD goes down, right?

The other thing: I thinking to use 22 x normal 2.5" harddisk like the Seagate Enterprise 15K/Umin / 12Gbit / 300GB, or WD Red WD10JFCX / 6Gib 5.400Umin TB.
Any objections?

cu denny
 
No real need to raid the journal that way. If an OSD goes down Ceph just recover and re-creates the placement groups on other OSDs.

Just make sure you have sufficient free space to cover for a failed OSD (or, if you are holding several OSDs jounal on a single SSD, that you have enough space to cover for all of these OSDs on a single SSD going "out" together).
 
  • Like
Reactions: Denny Fuchs
hi,

we had some discussions (a lot!) about the OSD disk drives (not journal drives):

Enterprise SSD (Seagate/Intel) vs. "consumer" SSD (Crucial / Samsung)
Enterprise HDD (Seagate enterprise SAS/15K) vs. "consumer" disks (HGST / WD / ...)

... or a mix of all of them per OSD ( OSD with SSD pro / OSD with consumer / OSD with standard rotating disks) vs. 8 slots with one kind of disk type (but then we would use all 24 slots ... I think)

I want to use enterprise but the disks are much more expensive .. the opposite is: we have six OSD with (first) 10 disk slots filled with disks from 24. So we have enough redundancy in vertical and horizontal to use "cheaper" disks and throw them away, if they are broken.
Also we have the discussion SSD (smaller, but lower engergy/heating and high read/write) vs. standard disks 1TB 7.200/Umin because of the nature from Ceph to use them in parallel for reading / writing.

What we know: we have also with 10 disks (one with 250GB per diskdrive) with six (-2 redundancy) OSDs more space we need actually . Also our services are not so heavy on I/O, only Zabbix. At the moment, all of our productive data runs from a Synology 815RS via NFS :)
Our next services depends a bit more on I/O: Elasticsearch and some Java things (ActiveMQ/Tomcat/...). So we have a very mix of all kind of services.

Any suggestions are welcome :)
 
Assuming you are going with 3-way replicas (default) then I'd highly recommend adding a 4th - and perhaps 5th - OSD node. Same number of SSDs, just spread them over more nodes. This will significantly improve your outage resiliency.

With exactly the same number of OSD nodes as replicas you will have to deal with having "degraded" placement groups during any single-node outage (i.e., your cluster cannot return to a "healthy" state during the outage). You can avoid service impact from this by setting the "required" replicas to 2 or even 1, which will allow you to continue operation, but your playing with fire if there is a second concurrent outage. You also have a serious headache with planning routine maintenance of the OSD nodes.

I know Ceph's "best practices" recommend keeping the MON nodes isolated from the OSDs but this recommendation hearkens back to days of much weaker CPUs, expensive RAM and 1Gbe networks. You might consider going with 5 total nodes, running MON and OSD on all 5 nodes - spreading the 30 SSDs out as 6/node over 5 nodes rather than 10/node over 3 nodes. This gives you a cluster where any two nodes can be offline for maintenance or other reasons and still have resiliency against the failure of an additional node. And you saved the cost of one CPU+chassis. Horizontal growth from this point would just be adding additional OSD nodes.

I read a lot about this and when i getting see clearer, then I found this post.
Now, i have a lot of new questions about OSD and MON nodes...
So..

Background:

Now, I have six nodes (blades) in a Supermicro Microblade Sys, with an internal 2x10G network connection between all nodes. Every node have 32GB DDR4 ecc reg. rams, and eight core xeon intel cpus.

Each node have only 4 SATA ports :( (+1 SATA DOM port).
(I have no raid or other controller, and theres no pci-e or other ports, there's no external storage server too)

My plan was:

I use three of the nodes to Ceph OSDs (only) and the other 3 to Proxmox(+MON).
(I'd like to run some centos/ubuntu VMs for mail, database, webserver/proxy, services.)

The most important in this config is the available minimal downtime - not the huge storage space - but i really like that if this 6 nodes produces a fair speed too.

So, compunding that post of Denny Fuchs - my questions are:
In this situation: which is more efficient?
A.) Three OSDs nodes AND 3 Proxmox/Mon nodes (like the Ceph's best practices)

B.) Four or five or six OSDs nodes (beside MONs on the first three node) AND each with two SSDs (beacuse of max 4 SATA ports, and i have to install the OS too..)

C.) Same like b) but the number of SSDs different.

D.) Or install proxmox to all nodes (6x) with another ceph disk configuration...
I'm really lost here.. Hope you can help me..

Really really thank you for all!
 
hi,
So, compunding that post of Denny Fuchs - my questions are:
In this situation: which is more efficient?

I think there is only one question: do you want a Proxmox hypervisor cluster too? If so, than the only option is 3 x Ceph, 3 x Proxmox. I would not recommend to put them in one cluster of six, because if something goes realy wrong with the cluster, you have nothing. If you split them in two clusters, the chance to lose storage and hypervisor is lower. But that is my opinion. You can put the mon on the same hosts, like Ceph.

If you don't want a Proxmox cluster, I would prefer 4 Ceph nodes and two hypervisor, but than you lose the live migration feature.

cu denny
 
...

If you don't want a Proxmox cluster, I would prefer 4 Ceph nodes and two hypervisor, but than you lose the live migration feature.

cu denny

I assume you mean Proxmox VE HA, not live migration. live migration works also with 2 nodes.
 
  • Like
Reactions: El Tebe
hi,


I think there is only one question: do you want a Proxmox hypervisor cluster too? If so, than the only option is 3 x Ceph, 3 x Proxmox. I would not recommend to put them in one cluster of six, because if something goes realy wrong with the cluster, you have nothing. If you split them in two clusters, the chance to lose storage and hypervisor is lower. But that is my opinion. You can put the mon on the same hosts, like Ceph.

If you don't want a Proxmox cluster, I would prefer 4 Ceph nodes and two hypervisor, but than you lose the live migration feature.

cu denny

Thank you

The 3x ceph + 3 proxmox version was my first idea.
 
1.
hi,
You can put the mon on the same hosts, like Ceph.
cu denny

I saw somewhere a Ceph install recommendation that says: do not place MONs and OSDs on the same nodes.
I think I can place beside those to proxmox nodes, can i?

2.
And what do you think about the 4 sata drive slots / node problem?

2/1: Shoud I install OSDs to all 4 (intel datacenter) SSDs (1osd-1ssd) and the satadom for the OS? (thats a single point of failure..)

2/2: And the Proxmox cluster recommended disk setup (2 hdds + 2 ssds vs 4 ssd)?

(Of course "I can do", but I ask these for higer availability and reduce downtimes)
 
hi,

@El Tebe

The CPUs are so powerfull and you don't have so much OSDs and nodes, that it shouldn't be a problem. It could be a problem, if you want to use also MDS for CephFS ... than maybe ... but you have to test it anyway.

For 2/1: I can't say anything, but if you create a pool with replica of 3, than one node can fail.
For 2/2: If you want only a few GBs than just use only SSDs. Than you can put also the journal on the same OSD or use Proxmox 5 with Ceph luminous which releases in a few weeks and use Bluestore, which kicks the journal "problem".

cu denny
 
Last edited:
  • Like
Reactions: El Tebe
The CPUs are so powerfull and you don't have so much OSDs and nodes, that it shouldn't be a problem. It could be a problem, if you want to use also MDS for CephFS ... than maybe ... but you have to test it anyway.

For 2/1: I can't say anything, but if you create a pool with replica of 3, than one node can fail.
For 2/2: If you want only a few GBs than just use only SSDs. Than you can put also the journal on the same OSD or use Proxmox 5 with Ceph luminous which releases in a few weeks and use Bluestore, which kicks the journal "problem".

cu denny

Thank you Denny, now the whole picture is clearer ;)
 
Nope, I realy mean live migration with a healty 3 node Proxmox cluster. I was not aware, that Proxmox5 can now live migration without clustering, or did I missed something ?

cu denny

I talk about a simple (non HA) two node cluster.
 
  • Like
Reactions: Denny Fuchs
hi,

@El Tebe

The CPUs are so powerfull and you don't have so much OSDs and nodes, that it shouldn't be a problem. It could be a problem, if you want to use also MDS for CephFS ... than maybe ... but you have to test it anyway.

For 2/1: I can't say anything, but if you create a pool with replica of 3, than one node can fail.
For 2/2: If you want only a few GBs than just use only SSDs. Than you can put also the journal on the same OSD or use Proxmox 5 with Ceph luminous which releases in a few weeks and use Bluestore, which kicks the journal "problem".

cu denny
Hi Denny,
I rethink all of this, get more and more info about ceph, so finally this is the last working "3+3 nodes" version:
- ceph three dedicated OSD nodes: every node have three OSD (1 sata SSD = 1 OSD) + 1 SSD for the OS
- three nodes for the Proxmox VE (HA cluster) and Ceph MON and ceph client too.

My sources for my decisions:
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!