Hardware/Concept for Ceph Cluster

Jap, I did Replica of 2 on host domain. I double checked and Ceph distributed the disk images of the test machines on different hosts. I did benchmarks within the VM´s and get about 1,2GB/sec and 280MB/sec read/write. When doing it parallely the read performance goes down while write stays more or less identically.

To the NVMe drives:
You´re right the P3700 are more attractive, it has been a matter of costs ;-)
I thought the P3600 (400G) drives would bring less latency and this will be the game winner for journaling and bring more TBW than the P3500. I have the chance to get some fusion drives from a colleague. Those would be from the write performance side of view better (and bigger = 1,2T). When doing dd with bs=4M the Crucials reach 250MB/sec on XFS (single drive, not in Ceph), which is definitively better than a spinners.

Thought about buying another MX300. Do you mean it for journaling or as another Ceph OSD? If you mean journaling, probably the S3700 with 200G would be a better choice. They bring HET and probably and better write speed. What do you think?
 
Jap, I did Replica of 2 on host domain.

Yeah, that means you can cope with one host going dark on you.

In m mind that is smarter then using OSD as failure domain (only interesting when you have loads of OSD's per node)

I thought the P3600 (400G) drives would bring less latency

Yes .. and no ..

According to the intel specs on the P3600 you are looking at
0.020-0.030 ms (milliseconds) latency on these drives.

According to storagereview.com on the Cruicial MX 300's you are looking at
0.034 ms latencies on their benchmark.

Now. add the latencies from host to host and this advantage gets watered down a bit.
When doing dd with bs=4M the Crucials reach 250MB/sec on XFS (single drive, not in Ceph), which is definitively better than a spinners.

Remember, when doing tests:
for the P3600 as journal, you'd need to do 3 DD's at the same time. (journal for 3 OSD's)
for the Crucial Mx 300's you'd need to do 2 DD's at a time (journal + OSD write)

the single DD value is interesting if you are thinking of using the SSD as OSD and the P3600 as OSD.
3x250MB/s for those 3 MX300's means that your P3600 (400GB) is likely bottlenecking you at 2/3 the speed. (you can confirm this by running 3 DD's at the same time.
better yet benchmark as shown here:
https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/
with
--numjobs=3 for the P3600 and then compare it to --numjobs=1 (multiplied by 3) on the SSD's

in any case
to be able to compare these even remotely (choose a samplesize > your Ram, like e.g. (Ram*2)

Thought about buying another MX300. Do you mean it for journaling or as another Ceph OSD?
I meant instead of doing the P3600 for journaling 3 SSD's you'd use another MX 300 as an OSD being its own journal.

The problem (in my mind) really becomes that the write speed of a single P3600 (400GB) version can not cope with your 3 SSD's write speeds. So it basically is a bottleneck (550 MB/s write on P3600 vs 3x 250MB/s on MX300 (osd no journal) vs 4x 150-200 MB/s (osd+journal )

remember do a 2x DD test - you are probably looking at closer to 150-200 MB/s.

If you mean journaling, probably the S3700 with 200G would be a better choice. They bring HET and probably and better write speed. What do you think?

the S3700 is worse then the P3600 (400G) as a journal device. Compare with the list here:
https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/

In my Mind (and practice), I'd go with a P3600 (400G) only as a read cache device (2100 MB/s read), for a write-Cache or as a journal, i'd use a P3600 (>=800G) or a P3700 (>=400GB). If these are too expensive chances are high that increasing the numbers of SSD's (OSD + journal --numjobs=2) will leave you better off.


ps: a single 10G link won't handle more than 1250 MB/s raw incoming writes, so keep that in mind.
pps: did you use regular linux bridges or did you go for openvswitch ? Yes/No on Jumbo frames ?
 
Thank you for sharing your experience. Your right, I will have to do more realistic benchmarking. BTW: I have been using the benchmark proposals of Sebastien Han (the link you provided) I thought about leaving the journal on the MX300 before. I did the benchmarks with and without the NVMe drives with hardly no performance difference. Guess I will go with the P3600 (800G) and open an independent PCIe-only-pool. Read performance of the MX300 is ok for web servers an mail servers I guess. For database appliances the PCIe only-pool will probably give us more performance than doing read/write caching for the Crucials.

ps: a single 10G link won't handle more than 1250 MB/s raw incoming writes, so keep that in mind.
pps: did you use regular linux bridges or did you go for openvswitch ? Yes/No on Jumbo frames ?

Bonding will bring 2x10G links together (has still to be done). This should be sufficient. Switching to 40G is (financially) out of scope for the moment. Yes, Jumboframes are enabled. I originally planed to do linux bonding but by the lack of rate limiting thought about using OVS. I did some first experiments with OVS and used IntPorts for breaking out ip devices for ceph and corosync, but rate limiting still does not work as expected. Another thing that bothered me is that the bridge did not come up automatically. So some work has to be done here.
 
I borrowed some ioDrives (ioDrive2 1,2T) from a colleague and had the chance to test them in my setup. Moved a 10G journal from the MX300 to the ioDrives and did the basic rados benchmark. The rest of the setup is unchanged (still only 1x 10G).

Code:
rados -p ceph-ssd bench 10 write --no-cleanup

Code:
Bandwidth (MB/sec):    1090.55
Stddev Bandwidth:      16.7809
Max bandwidth (MB/sec): 1104
Min bandwidth (MB/sec): 1048

Whereas the read performance is pretty much unchanged, the read performance almost doubled compared to putting the Journal on the Crucial SSD. Pretty impressing.
 
Thank you for sharing your experience. Your right, I will have to do more realistic benchmarking. BTW: I have been using the benchmark proposals of Sebastien Han (the link you provided) I thought about leaving the journal on the MX300 before. I did the benchmarks with and without the NVMe drives with hardly no performance difference. G

Just so I do understand this correctly.
You performed the test as follows:

Test 1: 3 Nodes with 3 SSD OSD's acting as own journal.
Test 2: 2 Nodes with 3 SSD OSD's acting as own journal. 1 Node with 3 SSD OSD's having journal on P3600 ?

If so, then you are right there is will only be a slight performance difference because the 2 nodes having no P3600 as journal device will slow you down.

Guess I will go with the P3600 (800G) and open an independent PCIe-only-pool. Read performance of the MX300 is ok for web servers an mail servers I guess. For database appliances the PCIe only-pool will probably give us more performance than doing read/write caching for the Crucials.

just like it would be without having ceph under the hood. We do this as well at work to tier our Storage as needed.

you probably wanna look into
crush location hook scripts that can set this up easily for you.
http://docs.ceph.com/docs/master/rados/operations/crush-map/

There is a bunch of pretty good ones out there on git that separate HDD, SSD and NVME disks into their own storage tiers.

Bonding will bring 2x10G links together (has still to be done). This should be sufficient. Switching to 40G is (financially) out of scope for the moment
Remember when doing bonding:
a single flow will never be able to surpass the 10G limit (1250 MB/s), but multiple flows can share up to 20G (2500 MB/s) in bandwidth. In order to maximize the utilization of your bonded links you probably wanna use Balance-TCP.

I did some first experiments with OVS and used IntPorts for breaking out ip devices for ceph and corosync, but rate limiting still does not work as expected. Another thing that bothered me is that the bridge did not come up automatically. So some work has to be done here.

IMHO (unless there have been some advances made for native linux bridges, ovs is utilizing less cpu cycles when doing the same work load.
OVH bridge not coming up is most likely a misconfiguration. We use it all the time (and i do so on setups I have been contracted to do as well).

regarding the rate limiting. To be totally honest. I never had a use for it because of the network layout. Either used poor man's QOS (separate links and switches) or used the Switch or SDN to do the QOS for me.
 
ust so I do understand this correctly.
You performed the test as follows:

Test 1: 3 Nodes with 3 SSD OSD's acting as own journal.
Test 2: 2 Nodes with 3 SSD OSD's acting as own journal. 1 Node with 3 SSD OSD's having journal on P3600 ?

No, not exactly.
Test 1: 4 Nodes (4x3OSD) Journal on disk. Same result with 3 Nodes Journal on disk + 1 Node with 1xP3600.
Test 2: 4 Nodes (4x3OSD) Journal on ioDrive2 (one ioDrive per Node)

It is not surprising that the there is no difference when using ext. journal only on one node. The difference by using journaling on PCIe flash on all nodes ist convincing. I used ioDrives as I had the chance to test them and managed to compile the driver.
After I fixed the networking issues I will come back to the crush map and separate the PCIe pool.

I see the advantages of the OVS solution. I am not the network guy and doing it in software makes the setup less hardware dependant.
I know about the restriction of bonding. Having several 1.250GB/sec streams is absolutely ok. I basically understand how OVS works but I am struggling with how to implement STP/RSTP with VLAN and QOS. If I setup a inter switch link (let´s say 3x10G to a trunk), do I need to activate/configure RSTP on OVS side? Can I use Balance-TCP if I use 2 NICs linked to 2 switches?
 
Last edited:
Sry for the late reply, was on vacation.

I see the advantages of the OVS solution. I am not the network guy and doing it in software makes the setup less hardware dependant [..]
I basically understand how OVS works but I am struggling with how to implement STP/RSTP with VLAN and QOS. If I setup a inter switch link (let´s say 3x10G to a trunk), do I need to activate/configure RSTP on OVS side?
[..]
Can I use Balance-TCP if I use 2 NICs linked to 2 switches?

Disclaimer: I am not a network guy either (I just have to reingeneer our 200+ ceph servers on a weekly basis, to get the most out of our 3k+ spinners and 2k+ Flash, and have them talk to our georedundant (among others) proxmox-clusters). But I think you do not understand what openvswitch actually is.

For your intends and purposes it is basically a different way setting up networking ON your Proxmox-Server. Instead of using native linux bridging you are using openvswitch. Both options are software. One just is more performant in doing its work (last i compared them side by side).

It also has some advanced features that you can use use for SDN (software defined networking) through the api's, but you will most likely not be using them in your setup.

That said. I assume you have looked at the proxmox openvswitch wiki ?
https://pve.proxmox.com/wiki/Open_v...RSTP.29_-_1Gbps_uplink.2C_10Gbps_interconnect


I went back through the thread to try and find your exact setup (a napkin based drawing might actually be helpful here).
.I have 2x10G bonded and uplinked to 2 switches. (IBM Blade Switch G8124). The 2x1Gbit ports are planed for the redundant outgoing VM traffic.
[..]
I have 2 of the 10G-Switches. The Switches are in stacking mode (connected via 2x 10G DAC). The nodes are connected via 2x 10G (DAC) and 2x 1G (transceivers) to the both switches. I planed to use the redundant 10G links for cluster stuff and the 1G links for outgoing VM traffic.
[..]
2x 1G and 2x10G ports to both switche

I am assuming the following:
  • Switch1: IBM Blade Switch G8124
    • 24x 10G links
    • carving a dualport Nic into 2-8 vNics (100Mbit increments)
    • Static and LACP (IEEE 802.3ad) link aggregation, up to 16 trunk groups with up to 12 ports per trunk group --> 16 Lacp Groups with <=12 Ports per Group
    • Support for jumbo frames (up to 9,216 bytes)
    • With a simple configuration change to Easy Connect mode, the RackSwitch G8124E becomes a transparent network device that is invisible to the core, which eliminates network administration concerns of Spanning Tree Protocol configuration and interoperability and VLAN assignments and avoids any possible loops.
    • Support for IEEE 802.1p, IP ToS/DSCP, and ACL-based (MAC/IP source and destination addresses and VLANs) traffic classification and processing - Eight output Class of Service (COS) queues per port for processing qualified traffic
  • Switch2: IBM Blade Switch G8124
  • you have 4 Nodes.
    • each with 1x 10G to Switch1
    • each with 1x10G to Switch2
    • each with 1x1G to Switch1
    • each with 1x1G to Switch2
  • Switch1 and Switch2 are connected via 2x10G. They act as a transparent device (for anyone on the network they look like a single device)


Question is:
What happens when a single 10G switch fails ? DOES it actually split up the single Virtual device into 2 separate ones, or does it fail completely?

If it does split them. I'd do the following:
  • Setup a OVS-Bridge on each Proxmox-Node for Ceph-Cluster and Ceph-Public
  • LACP bond the 10G nics from Proxmox1 and proxmox2 to Switch1 with Balance-tcp -->2x20G Bonds on Switch1
  • LACP bond the 10G nics from Proxmox3 and Proxmox4 to Switch4 with Balance-tcp -> 2x20G Bonds on Switch2
  • Cross-Connect Switch1 and Switch2 with at least 3x10G ([Number Links connected to Switch] - ([Number of Links on Switch]/[Number of Switches])). You don't wanna bottleneck yourself here incase you need to do rebalancing
  • In case of Switch-Failure you will loose 50% of the Ceph nodes (that is why you did chose Ceph in the first place). So roughly 50% performance.
  • keep it off the client-network
- Then get 2 1G Switches (Switch3 and Switch4), Cross connect them and plug em into your Client-network and your 1G Proxmox nics utilising a second separate openvswitch-Bridge on each Proxmox-node. e.g. 1x 1G from Proxmox1 to Switch3 and 1x1G from Proxmox1 to Switch4 - unless your 1G Switches do LACP, at which point use the example provided above for 10G and addapt it to your 1G switches.

- I'd run Proxmox Public on Switch3 and Switch4 and Openvswitch-Bridge0 (or use standard linux-Bridge for this if OVSwitch based rate-limiting does still not work in Proxmox and you wanna ratelimit your VM's network speeds)
- I'd run Proxmox-Cluster, Ceph Public and Ceph CLuster on Switch1 and Switch2 via Openvswitch based Bridge1

This Option gives you the best performance (and ability to properly QOS ProxmoxCluster, Ceph Public/Clster networks) in all situations - and is the least complex setup.
At least in my Mind. (Bonus points: you have alot more 10G Links left over in case you wanna add more nodes (or upgrade the Nics from 2x10G to 1x40G using 40G-to-4x10G breakout cables)


If it does not split Switch1 and Switch2 , then i'd keep them as separated switches (But still connected with 3x10G + 1x10G for every 10 1G Links connected on said switch), use a single openvswitch Bridge, don't get 2 additional 1G-switches and use (R)STP to make sure i do not get any Loops when one of these Switches fails.
Should still be performant enough, but IMHO a waste of 10G ports and needlessly complex (mixing the 1G public network in there.

ps.: if one of your 10g Switches ever goes down, both proxmox (ha) and Ceph should be able to pick up the slack. You'll have some performance degregation for as long as it takes you to switch the 10G networklinks from e.g. switch2 to switch1 (and set up lacp on the switch)

pps.: you could do 1x10g on switch1 and 1x10g on switch2. On a 4-node system it is still the same practical performance hit, the upside is 4 instead of 2 nodes keep working (with degraded performance) and there is no re-balancing when the second switch becomes available again, the disadvantage is, once you put more then 4 nodes in a cluster the "all 10G nics on a single switch" approach becomes more performant during failure and the working nodes are not network bottle-necked between themselves.


I hope that helps,
Q-Wulf
 
Last edited:
  • Like
Reactions: Talion
If it does split them. I'd do the following:
  • Setup a OVS-Bridge on each Proxmox-Node for Ceph-Cluster and Ceph-Public
  • LACP bond the 10G nics from Proxmox1 and proxmox2 to Switch1 with Balance-tcp -->2x20G Bonds on Switch1
  • LACP bond the 10G nics from Proxmox3 and Proxmox4 to Switch4 with Balance-tcp -> 2x20G Bonds on Switch2
  • Cross-Connect Switch1 and Switch2 with at least 3x10G ([Number Links connected to Switch] - ([Number of Links on Switch]/[Number of Switches])). You don't wanna bottleneck yourself here incase you need to do rebalancing
  • In case of Switch-Failure you will loose 50% of the Ceph nodes (that is why you did chose Ceph in the first place). So roughly 50% performance.
  • keep it off the client-network
- Then get 2 1G Switches (Switch3 and Switch4), Cross connect them and plug em into your Client-network and your 1G Proxmox nics utilising a second separate openvswitch-Bridge on each Proxmox-node. e.g. 1x 1G from Proxmox1 to Switch3 and 1x1G from Proxmox1 to Switch4 - unless your 1G Switches do LACP again

- I'd run Proxmox Public on Switch3 and Switch4 and Openvswitch-Bridge0 (or use standard linux-Bridge for this if OVSwitch based rate-limiting does still not work in Proxmox and you wanna ratelimit your VM's network speeds)
- I'd run Proxmox-Cluster, Ceph Public and Ceph CLuster on Switch1 and Switch2 via Openvswitch based Bridge1

This Option gives you the best performance (and ability to properly QOS ProxmoxCluster, Ceph Public/Clster networks) in all situations - and is the least complex setup.
At least in my Mind. (Bonus points: you have alot more 10G Links left over in case you wanna add more nodes (or upgrade the Nics from 2x10G to 1x40G using 40G-to-4x10G breakout cables)

Could you be so kind as to post a sample interfaces file for this config? It looks real promising but I'd like to walk through the config (instead of asking redundant obvious questions :)
 
For the 10G network you are looking at this config from the proxmox wiki:

# Loopback interface
auto lo
iface lo inet loopback

# Bond eth0 and eth1 together
# these are your 10G nics on your proxmox host
allow-vmbr0 bond0
iface bond0 inet manual
ovs_bridge vmbr0
ovs_type OVSBond
ovs_bonds eth0 eth1
# Force the MTU of the physical interfaces to be jumbo-frame capable.
# This doesn't mean that any OVSIntPorts must be jumbo-capable.
# We cannot, however set up definitions for eth0 and eth1 directly due
# to what appear to be bugs in the initialization process.
pre-up ( ifconfig eth0 mtu 9000 && ifconfig eth1 mtu 9000 )
ovs_options bond_mode=balance-tcp lacp=active other_config:lacp-time=fast
# make sure you use the highest MTU, that ALL of your 10G network gear is capable of handling (9k-> 16k), in the case of the OP, it is up to 9,216 bytes.
mtu 9000

# Bridge for our bond and vlan virtual interfaces)
auto vmbr0
allow-ovs vmbr0
iface vmbr0 inet manual
ovs_type OVSBridge
# NOTE: we MUST mention bond0, vlan50, and vlan55 even though each
# of them lists ovs_bridge vmbr0! Not sure why it needs this
# kind of cross-referencing but it won't work without it!
ovs_ports bond0 vlan50 vlan55 vlan60
mtu 9000

# Proxmox cluster communication vlan
allow-vmbr0 vlan50
iface vlan50 inet static
ovs_type OVSIntPort
ovs_bridge vmbr0
ovs_options tag=50
ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
address 10.50.10.44
netmask 255.255.255.0
gateway 10.50.10.1
#Not sure why a mtu of 1500 was chosen on the wiki example. I'm like 99% sure we are using jumbo frames here aswell at work - will need to check next week on the lab network. Maybe it was just used to illustrate the point that you can use multiple mtu values for vlans, if e.g. you vlans traverse not just via a jumboframe capeable network gear, but also across non jumboframe capeable gear.
mtu 1500

# Ceph cluster communication vlan (jumbo frames)
allow-vmbr0 vlan55
iface vlan55 inet static
ovs_type OVSIntPort
ovs_bridge vmbr0
ovs_options tag=55
ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
address 10.55.10.44
netmask 255.255.255.0
mtu 9000

# Ceph Public communication vlan (jumbo frames)
allow-vmbr0 vlan60
iface vlan60 inet static
ovs_type OVSIntPort
ovs_bridge vmbr0
ovs_options tag=60
ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
address 10.60.10.44
#We use a larger subnet here at work. As we have more than 254 devices that may connect on it (Ceph-Servers, Ceph mons, Proxmox-nodes, other nodes that store stuff on ceph, etc...), we are at least on one ceph-cluster running close to 230 devices that i know of.
netmask 255.255.255.0
mtu 9000

For the 1G network gear (assuming it is not Jumbo frame capable you go down this route) you take this openvswtch bond approach, assuming your network gear is not LACP capeable.

# Loopback interface
auto lo
iface lo inet loopback

# Bond eth3 and eth4 together
allow-vmbr1 bond1
iface bond1 inet manual
ovs_bridge vmbr1
ovs_type OVSBond
ovs_bonds eth3 eth4
ovs_options bond_mode=balance-slb vlan_mode=native-untagged

# Bridge for our bond and vlan virtual interfaces (our VMs will
# attach to this bridge)
auto vmbr1
allow-ovs vmbr0
iface vmbr1 inet manual
ovs_type OVSBridge
ovs_ports bond1 vlan1

# Virtual interface to take advantage of originally untagged traffic
allow-vmbr0 vlan1
iface vlan1 inet static
ovs_type OVSIntPort
ovs_bridge vmbr0
ovs_options vlan_mode=access
ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
# This is where you access the gui on.
address 192.168.3. 5
netmask 255.255.255.0
gateway 192.168.3.254

If you DO NOT want to use Openvswitch for Proxmox-Public, adapt this example:
https://pve.proxmox.com/wiki/Network_Configuration#_linux_bond

You probably also wan't to read this part of the wiki:
https://pve.proxmox.com/wiki/Open_vSwitch#Using_Open_vSwitch_in_Proxmox

And this for the sake of understanding
http://openvswitch.org/support/dist-docs/ovs-vswitchd.conf.db.5.html
balance-slb
Balances flows among slaves based on source MAC address
and output VLAN, with periodic rebalancing as traffic
patterns change.

active-backup
Assigns all flows to one slave, failing over to a backup
slave when the active slave is disabled. This is the only
bonding mode in which interfaces may be plugged into dif‐
ferent upstream switches.

The following modes require the upstream switch to support 802.3ad with
successful LACP negotiation. If LACP negotiation fails and other-con‐
fig:lacp-fallback-ab is true, then active-backup mode is used:

balance-tcp
Balances flows among slaves based on L2, L3, and L4 pro‐
tocol information such as destination MAC address, IP
address, and TCP port.





Now as far as VM's go:
  • If they need public access stick em on vmbr1
  • If you need inter-VM communication stick em on vmbr0 on their own subnet/vlan combo

Hope that helps.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!