Hardware/Concept for Ceph Cluster

chris_lee · Feb 9, 2017

Hello,

we are using Proxmox with local storage on a small 3 node cluster. Now we are planning to set up a 4 node cluster with network storage (Ceph), live migration, HA and snapshot functionality. We already have some hardware laying around from dev projects. Now I would like to get some ideas how to use the stuff we have in a most efficient manner.

We have the following hardware:
-redundant 10G network
-4 nodes with dual Xeon 2650, 64G Ram, 2x10G ethernet, 2x 1Gbit Ethernet, 6 x 2,5" drive bays
- 8 Samsung SM863 Enterprise (240G)
- 16 Crucial MX300 (750G)
- 8 WD Red NAS SATA drives (1GB)

We don't need to max out the space. Performance is more important. My question is, how to use the drives most efficiently respecting performance and durability. Does it make sense to separate the meta data to the Samsung enterprise discs or just leave them on the Crucial disks? I was wondering if it would make sense to add HP Turbo Drive NVMe.

Any suggestions?

Thanks
Chris

Ashley · Feb 10, 2017

Using the hardware you have:

Each server:

2* SM863
4 * MX300

SM863 on each create:

1 Partition for Proxmox (Equal Size) 80GB
2 * Journal Partition (4 * 10GB)

ZFS the two first partitions from each SM863 during the Proxmox Installer, use the remaining 4 Journal partitions for each MX300 OSD.

This will allow you to loose one SM863 and only loose 2 OSD's and the OS continue to operate while you replace the failed SM863 and rebuild the 2 failed OSD's.

If you run a CEPH replication of 3 you will have around 4TB usable, make sure you have the CRUSH MAPS set to use HOST instead of OSD for the replication, meaning you could loose a full OSD or HOST and still have a replication of 2.

chris_lee · Feb 10, 2017

Thanks for your reply. I´ve read in the Ceph tutorials, that it is not suggested to share journal disks with the OS, but your proposal sounds reasonable. Should I run a monitor for each node?

udo · Feb 10, 2017

chris_lee said:
Thanks for your reply. I´ve read in the Ceph tutorials, that it is not suggested to share journal disks with the OS, but your proposal sounds reasonable.

Hi,
I guess your OS need not so much IO (depends if you use local storage for VMs and the logging from the ceph-mon).

Should I run a monitor for each node?

No!
an odd number is needed - and three mons are enough.

Udo

chris_lee · Feb 12, 2017

Ok, I see. Local storage is not planed on the OS disk(s). I was thinking about using NVMe for local storage, replicated via DRBD if needed.
Are there any special performance tweaks to take care of, when implementing the above mentioned setup? Can I limit the network usage so that the Proxmox cluster communication will be stable? I have 2x10G bonded and uplinked to 2 switches. (IBM Blade Switch G8124) The 2x1Gbit ports are planed for the redundant outgoing VM traffic.

Thanks
Chris

Q-wulf · Feb 13, 2017

chris_lee said:
[...]
Are there any special performance tweaks to take care of, when implementing the above mentioned setup? Can I limit the network usage so that the Proxmox cluster communication will be stable? I have 2x10G bonded and uplinked to 2 switches. (IBM Blade Switch G8124) The 2x1Gbit ports are planed for the redundant outgoing VM traffic.

Thanks
Chris

I typically use openvswitch in conjunction with my networking needs on Proxmox. See here:
https://pve.proxmox.com/wiki/Open_vSwitch

Unless you have hardware that allows you to properly apply QOS while using bonding (in which case I'd bond 2x10G for Ceph and 2x1G for Proxmox ), I'd switch to poor-mans QOS.

That is basically separating my nodes and links via separate switches and/or separate subnets (again depending on your networking gear)

1x10G for Ceph Public Network on 10.1.1.x/24
1x10G for Ceph Cluster Network on 10.2.2.x/24
1x1G for Proxmox Cluster communication on 10.3.3.x/24
1x1G for Proxmox Public Communication (this is where your VM's communicate with your clients) on 10.4.4.x/24

chris_lee · Feb 14, 2017

The switches I use (IBM Blade Switches G8124) basically understand QoS and LCAP. I want to keep the network redundant, so I would have to uplink 2x 1G and 2x10G ports to both switches, what limits the possibility of separating the network as you proposed. In the past we made bad experiences with switch failures and the effect on the cluster. Doing the networking on dedicated hardware will be more performant than doing it with openvswitch I guess. Nevertheless OpenVSwitch is really interesting and powerful.

udo · Feb 14, 2017

chris_lee said:
The switches I use (IBM Blade Switches G8124) basically understand QoS and LCAP. I want to keep the network redundant, so I would have to uplink 2x 1G and 2x10G ports to both switches, what limits the possibility of separating the network as you proposed. In the past we made bad experiences with switch failures and the effect on the cluster. Doing the networking on dedicated hardware will be more performant than doing it with openvswitch I guess. Nevertheless OpenVSwitch is really interesting and powerful.

Hi,
you should be able to create bonds with openvswitch and 10G+1G which work with the G8124 (now lenovo) and an 1GB-switch (like an G8052).

Udo

Q-wulf · Feb 15, 2017

chris_lee said:
[...]Doing the networking on dedicated hardware will be more performant than doing it with openvswitch I guess. Nevertheless OpenVSwitch is really interesting and powerful.

Just a side note (because I feel this might have gotten lost while reading the openvswitch link i provided):
You use openvswitch inside Proxmox instead of your native linux bridging.
You do not need to run openvswitch on dedicated hardware (ie as a hardware-switch replacement)

Basically what you do is assign your Network interfaces to a OVS based Bridge. Then you assigned so called OVS_IntPorts to it. e.g. 10.1.1.x/24 Vlan 12 and 10.2.3.y/16 vlan 10 and 192.168.3.119/24 (no vlan). These inturn provide the traffic via teh bridge to the nics you have attached to said bridge (regular or LACP'ed).

udo said:
Hi,
you should be able to create bonds with openvswitch and 10G+1G which work with the G8124 (now lenovo) and an 1GB-switch (like an G8052).

Udo

chris_lee said:
The switches I use (IBM Blade Switches G8124) basically understand QoS and LCAP. I want to keep the network redundant, so I would have to uplink 2x 1G and 2x10G ports to both switches, what limits the possibility of separating the network as you proposed.[...]

As Udo rightly pointed out, you can do that with Proxmox + openvswitch.

I'd use balance-tcp for the following reason:

balance-tcp
Balances flows among slaves based on L2, L3, and L4 pro‐
tocol information such as destination MAC address, IP
address, and TCP port.

why is that interesting ? because every MON and every OSD gets assigned a separate port on the corresponding network (Ceph Public/Cluster). You'd be able to max out your 10G lines when your doing a OSD or Node backfill after you encountered an issue and thereby have degraded performance for your cluster for a shorter amount of time.

Why do I stress a working QOS so much?
because Corosync (Proxmox-Cluster) is sometimes throwing a fit when the network it runs on gets congested and the Ceph-Cluster (i.e. where your OSDs do their data replication on) can get highly congested. When your not backfilling and just reading data, you can read at >=10G incoming to the VM (if your Ceph Cluster config suports these high numbers)

You say you need the network to fail-over on either the 1G-Switches or the 10-G switches. So being able to run on 10G and a 1G backup link.
If QOS works (and only if it works properly) you might also be able to do a LACP config of 2x10G and 1G on the same ovs-Bridge. That way you can maximize your throughput.

What we do at work is basically have the proxmox-Cluster network have absolute priority, Give Ceph Cluster the lowest Priority. then have Proxmox-Public and Ceph -Public have the same priority.

We also have guaranteed and Burstable thresholds, but AFAIR this is due to our Software Defined network (SDN) solution. But we also do operate large amounts of OSD's per Nodes (60-120) with larger link capacity (2x10G dedicated proxmox if that runs on the same node) and then either 4x10G, 2x40G or 2x100G for ceph (depending on the OSD amount, and the OSD-type - as in HDD, SSD or NVME - used on the node), so that is something you should not necessarily aim for.

Why am I mentioning this ? Because that way (LCAP on all links on same OVS-Bridge) you can maximize your link utilization, which in turn increases performance (once you get to a specific number of (HOSTs x OSDs) and also a (sort-of) Failover (unless i misunderstood and you have multiple 10G switches) . Basically if your 10G Switch fails you are running on 2x1G. The Downside is you need watertight QOS preferably on a VLAN or Subnet basis.

I'm not very well-versed on the Hardware specifics of network switches however, that's what my company has network guys for

so suggests from others are your best bet on those.

chris_lee · Feb 16, 2017

Thanks for pointing out the OVS usage.

Q-wulf said:
Why am I mentioning this ? Because that way (LCAP on all links on same OVS-Bridge) you can maximize your link utilization, which in turn increases performance (once you get to a specific number of (HOSTs x OSDs) and also a (sort-of) Failover (unless i misunderstood and you have multiple 10G switches) . Basically if your 10G Switch fails you are running on 2x1G. The Downside is you need watertight QOS preferably on a VLAN or Subnet basis.

I think you got me wrong. I have 2 of the 10G-Switches. The Switches are in stacking mode (connected via 2x 10G DAC). The nodes are connected via 2x 10G (DAC) and 2x 1G (transceivers) to the both switches. I planed to use the redundant 10G links for cluster stuff and the 1G links for outgoing VM traffic.

Q-wulf · Feb 17, 2017

That should actually work no problem.

Afaik you should be able to just lcap in balance-tcp mode the 10G links and assign em to OVS-Bridge vmbr0. Add OVS-IntPorts for the following networks:

Ceph_Public (jumbo Frames are your friend here)
Ceph_Cluster (jumbo Frames are your friend here)
Proxmox_Cluster

Make sure you prioritize Proxmox-Cluster communication (by subnet or vlan)

Then Lcap the 1G links with balance-tcp and assign em to OVS-Bridge vmbr1 and assign Proxmox-Public OVS based IntPorts to said bridge.

You should have 20G total pushable bandwidth, with 10G max per flow. If one switch fails your max total pushable bandwidth will be 10G. Same thing basically for your 1G switches.

The big and important thing is:
Your switches need to be able to support Lcap on multiple stacked switches. if they only support lcap on a single switch, then this config will not work.

chris_lee · Feb 21, 2017

Sounds good. I´ll try this this week and figure out if LCAP is supported in combination with stacking.

chris_lee · Feb 23, 2017

I tried to setup the machine yesterday and slipped over a problem. I am not familiar with ZFS so please be patient

During the installation process I cannot chose a maxsize value or partition the drive. Do I have to prepare the partitions before installation or do I have to shrink the pool after installation in order to get my 3 partitions out of each drive?

Ashley · Feb 23, 2017

How are you presenting the disk to your server is it via a Raid Card?

chris_lee · Feb 23, 2017

No, the disks are hooked up to the AHCI controller of the mainboard. In our old cluster we had LSI 9260 cards (we still have them), but from what I´ve read "real raid cards" are not recommended as they bring a further single point of failure and it is not recommended to present raid volumes to ceph.

chris_lee · Feb 23, 2017

Did manage to verify my switches: they do not support stacking but support vLAG and Virtual Fabric.

chris_lee · Mar 29, 2017

I set up the system as proposed and did some benchmarks. Maybe someone can tell me if the results I get are reasonable or if there is something to be tunable.

Code:

rados -p ceph-ssd bench 100 write --no-cleanup

Code:

Total writes made:      15168
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     606.008
Stddev Bandwidth:       124.917
Max bandwidth (MB/sec): 840
Min bandwidth (MB/sec): 336
Average IOPS:           151
Stddev IOPS:            31
Max IOPS:               210
Min IOPS:               84
Average Latency(s):     0.105604
Stddev Latency(s):      0.112845
Max latency(s):         1.08144
Min latency(s):         0.0233331

and

Code:

rados -p ceph-ssd bench 10 seq

Code:

Total reads made:     3712
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1477.18
Average IOPS          369
Stddev IOPS:          12
Max IOPS:             392
Min IOPS:             347
Average Latency(s):   0.0426312
Max latency(s):       0.220388
Min latency(s):       0.00337663

I pretty much used the standard Proxmox ceph settings.

Thanks in advance

Q-wulf · Mar 30, 2017

That is with 4 Nodes and 10G for Ceph client + 10g for ceph-cluster, right ?

How many Journal SSD (SM863 240GB) and Disks (which type ??) do you have copnfigured per node ?
What pool replication size did you end up using ?

chris_lee · Mar 30, 2017

I use 4 nodes. Proxmox OSs is on 2 SM863 (dedicated). I decided to use one NVMe drive (Intel P3600) per node for journaling and 3 OSD per node (Crucial MX 300). Currently only one Node1 has a NVMe disk (the other 3 are to come). At the end every node will have a NVMe drive for journaling. For testing I am on on a 10G link for Ceph (dedicated), 10G for Proxmox (dedicated), later I will do all the network bonding magic. Replication is 2 on a pool with a PG of 1024. Replication of 3 lowers the write performance and lifts the read performance. Journal size is set to 10G.

Q-wulf · Apr 3, 2017

Then these numbers make more sense. (You are doing replication 2 on the host failure-domain, right ? not on the OSD Failure-Domain ?)

3/4th of your nodes use the same SSD for OSD+Journal. I am guessing you are getting about 150 MB/s ish write performance out of these Cruicial MX300's.
the NVME journal in your 4th node is only having a minor impact since 3/4th of your nodes are likely slowing it down.

I'd do a real life benchmark.
1. Setup a VM. Assign it X GB Ram. Benchmark it with data that is at (X*2) GB of data. do Read+Write
2. Clone it to all 4 nodes. Benchmark with all 4 VM's running teh benchmark in parallel.

NVME DISK question
(I assume you chose the 400GB versions and not the 800GB version, right ? - they have vastly different write specs)
May i ask how you came to choose the Intel P3600 ? (2100 MB/s read vs 550 MB/s write)
Why'd you not go for the P3700? (2700 MB/s read vs 1200 MB/s write). The markup is only 27%, but you get 118% more write performance and 236% more TBW. Win-Win if you ask me.

This is especially important, as the Cruicial MX 300 750GB model already does 510 MB/s. So the P3600 (400GB model) will most likely not speed this up drastically (3x journal on P3600 vs 1x on 3 seperate MX 300's). (It is a good read cache device tho). Whereas the P3700 would give you an edge. If using the P3600 you might aswell scrap it and buy another Mx 300. That would probably give you more performance overall (and be about 30% cheaper then your current solution.)

Hardware/Concept for Ceph Cluster

Active Member

Member

Active Member

Distinguished Member

Active Member

Well-Known Member

Active Member

Distinguished Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Active Member

Member

Active Member

Active Member

Active Member

Well-Known Member

Active Member

Well-Known Member