Sry for the late reply, was on vacation.
I see the advantages of the OVS solution. I am not the network guy and doing it in software makes the setup less hardware dependant [..]
I basically understand how OVS works but I am struggling with how to implement STP/RSTP with VLAN and QOS. If I setup a inter switch link (let´s say 3x10G to a trunk), do I need to activate/configure RSTP on OVS side?
[..]
Can I use Balance-TCP if I use 2 NICs linked to 2 switches?
Disclaimer: I am not a network guy either (I just have to reingeneer our 200+ ceph servers on a weekly basis, to get the most out of our 3k+ spinners and 2k+ Flash, and have them talk to our georedundant (among others) proxmox-clusters). But I think you do not understand what openvswitch actually is.
For your intends and purposes it is basically a different way setting up networking
ON your Proxmox-Server. Instead of using native linux bridging you are using openvswitch. Both options are software. One just is more performant in doing its work (last i compared them side by side).
It also has some advanced features that you can use use for SDN (software defined networking) through the api's, but you will most likely not be using them in your setup.
That said. I assume you have looked at the proxmox openvswitch wiki ?
https://pve.proxmox.com/wiki/Open_v...RSTP.29_-_1Gbps_uplink.2C_10Gbps_interconnect
I went back through the thread to try and find your exact setup (a napkin based drawing might actually be helpful here).
.I have 2x10G bonded and uplinked to 2 switches. (IBM Blade Switch G8124). The 2x1Gbit ports are planed for the redundant outgoing VM traffic.
[..]
I have 2 of the 10G-Switches. The Switches are in stacking mode (connected via 2x 10G DAC). The nodes are connected via 2x 10G (DAC) and 2x 1G (transceivers) to the both switches. I planed to use the redundant 10G links for cluster stuff and the 1G links for outgoing VM traffic.
[..]
2x 1G and 2x10G ports to both switche
I am assuming the following:
- Switch1: IBM Blade Switch G8124
- 24x 10G links
- carving a dualport Nic into 2-8 vNics (100Mbit increments)
- Static and LACP (IEEE 802.3ad) link aggregation, up to 16 trunk groups with up to 12 ports per trunk group --> 16 Lacp Groups with <=12 Ports per Group
- Support for jumbo frames (up to 9,216 bytes)
- With a simple configuration change to Easy Connect mode, the RackSwitch G8124E becomes a transparent network device that is invisible to the core, which eliminates network administration concerns of Spanning Tree Protocol configuration and interoperability and VLAN assignments and avoids any possible loops.
- Support for IEEE 802.1p, IP ToS/DSCP, and ACL-based (MAC/IP source and destination addresses and VLANs) traffic classification and processing - Eight output Class of Service (COS) queues per port for processing qualified traffic
- Switch2: IBM Blade Switch G8124
- you have 4 Nodes.
- each with 1x 10G to Switch1
- each with 1x10G to Switch2
- each with 1x1G to Switch1
- each with 1x1G to Switch2
- Switch1 and Switch2 are connected via 2x10G. They act as a transparent device (for anyone on the network they look like a single device)
Question is:
What happens when a single 10G switch fails ? DOES it actually split up the single Virtual device into 2 separate ones, or does it fail completely?
If it does split them. I'd do the following:
- Setup a OVS-Bridge on each Proxmox-Node for Ceph-Cluster and Ceph-Public
- LACP bond the 10G nics from Proxmox1 and proxmox2 to Switch1 with Balance-tcp -->2x20G Bonds on Switch1
- LACP bond the 10G nics from Proxmox3 and Proxmox4 to Switch4 with Balance-tcp -> 2x20G Bonds on Switch2
- Cross-Connect Switch1 and Switch2 with at least 3x10G ([Number Links connected to Switch] - ([Number of Links on Switch]/[Number of Switches])). You don't wanna bottleneck yourself here incase you need to do rebalancing
- In case of Switch-Failure you will loose 50% of the Ceph nodes (that is why you did chose Ceph in the first place). So roughly 50% performance.
- keep it off the client-network
- Then get 2 1G Switches (Switch3 and Switch4), Cross connect them and plug em into your Client-network and your 1G Proxmox nics utilising a second separate openvswitch-Bridge on each Proxmox-node. e.g. 1x 1G from Proxmox1 to Switch3 and 1x1G from Proxmox1 to Switch4 - unless your 1G Switches do LACP, at which point use the example provided above for 10G and addapt it to your 1G switches.
- I'd run Proxmox Public on Switch3 and Switch4 and Openvswitch-Bridge0 (or use standard linux-Bridge for this if OVSwitch based rate-limiting does still not work in Proxmox and you wanna ratelimit your VM's network speeds)
- I'd run Proxmox-Cluster, Ceph Public and Ceph CLuster on Switch1 and Switch2 via Openvswitch based Bridge1
This Option gives you the best performance (and ability to properly QOS ProxmoxCluster, Ceph Public/Clster networks) in all situations - and is the least complex setup.
At least in my Mind. (Bonus points: you have alot more 10G Links left over in case you wanna add more nodes (or upgrade the Nics from 2x10G to 1x40G using 40G-to-4x10G breakout cables)
If it does not split Switch1 and Switch2 , then i'd keep them as separated switches (But still connected with 3x10G + 1x10G for every 10 1G Links connected on said switch), use a single openvswitch Bridge, don't get 2 additional 1G-switches and use (R)STP to make sure i do not get any Loops when one of these Switches fails.
Should still be performant enough, but IMHO a waste of 10G ports and needlessly complex (mixing the 1G public network in there.
ps.: if one of your 10g Switches ever goes down, both proxmox (ha) and Ceph should be able to pick up the slack. You'll have some performance degregation for as long as it takes you to switch the 10G networklinks from e.g. switch2 to switch1 (and set up lacp on the switch)
pps.: you could do 1x10g on switch1 and 1x10g on switch2. On a 4-node system it is still the same practical performance hit, the upside is 4 instead of 2 nodes keep working (with degraded performance) and there is no re-balancing when the second switch becomes available again, the disadvantage is, once you put more then 4 nodes in a cluster the "all 10G nics on a single switch" approach becomes more performant during failure and the working nodes are not network bottle-necked between themselves.
I hope that helps,
Q-Wulf