PVE HA cluster redundant networking with OpenVSwitch and RSTP

austrianbavarian

New Member
Feb 22, 2019
3
0
1
36
Dear Proxmox staff and forum members,

being new to PVE and high availability systems in general, I'd like to discuss a 3-node-cluster setup, that involves network redundancy by means of OpenVSwitch using RSTP and 2 switches,
and beg you to pardon possible beginner's mistakes.

Each node is configured with 3 ovs-bridges, vmbr0 for entire Ceph networking, vmbr1 for PVE-node-intercommunication and management, and vmbr2 for VLAN-tagged guest traffic.

Each of the mentioned ovs bridges consists of 2 ethernet interfaces, one connected to switch1 (nominal switch) the other one connected to switch2 (backup switch).

The RSTP enabled switches allow "grouping" of ports by untagged vlans ("port-based-vlan"), labelled as "u-VLAN" in the attached network sketch.
PVE_redundant_networking_draft.JPG


I'm aware of the fact, that RSTP is not VLAN-aware, but MSTP cannot be used, as OpenVSwitch is not MSTP-enabled.

The network redundancy is working for all of the nodes' vmbrX in the case of one entirely failing switch; the port state changes from forwarding to discarding and vice versa on every node's interface-pairs can be nicely observed, when i pull one switch's power supply; so far, so good.

In case of only one failing port of a switch (/failing NIC on node/ failing link in general), those status changes again can be observed in the affected node's interfaces; this leads to a situation, in which only one node's interface swaps over to the backup switch, and the other nodes still remain connected to the nominal switch.
Thus, in practice one node isn't connected to the rest of the cluster anymore.

Does anybody have a suggestion, how i could address this problem?
Or is my approach to redundancy a hopeless/problematic one?
Could my goal of redundancy also for single link defects be achieved by using (active-backup-mode) bonds without a direct connection between the hardware switches (stacking)?

Thank you!
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
5,545
371
103

austrianbavarian

New Member
Feb 22, 2019
3
0
1
36
Hi wolfgang, thanks for your reply!

I tried to rethink the scenario with your recommendation of using active-backup bonding mode, but I'm still not sure, if this solves the problem of a single host leaving the other ones allone:

If I'm correct, the preferred way of configuring bonding in active-backup mode would be adding arp targets to the bonds; as I understand, in HA-setups it makes sense to add at least 2 arp targets;
in this case, I could only think of the addresses of the other 2 nodes; what would happen in case of 1 of these node's complete failure?
The other 2 nodes' bonds would switch forth and back (with frequencies amongst others determined by the bonds' options updelay/downdelay and arp_interval) from one interface to the other, until both arp targets get available again, which I guess will pose a threat to network stability.
Adding ip targets "behind" the switches from nodes' perspective also does not guarantee, that all nodes would use the same switch.
Is my line of thoughts correct so far?

With 3 nodes you can consider to create a full mesh this reduces the latency of what is essential for ceph.
see https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server
Thats an excellent suggestion, as it will also reduce the required amount of wiring and other components that could fail!
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
5,545
371
103
Adding ip targets "behind" the switches from nodes' perspective also does not guarantee, that all nodes would use the same switch.
Is my line of thoughts correct so far?
The switching will be done by the bond technical both links are up and both are registered with the MAC of the bond is the one of the primary nic.
When the bond switches the nic it makes advertising to the switch.
And yes the bond is only listening on the active nic.
To solve your problem you have to connect switch1 with switch2, in any case.
 

austrianbavarian

New Member
Feb 22, 2019
3
0
1
36
I've now connected both of the switches, while staying with stp for the moment ( - as soon as I have some time, I will also test the bond-way -).
I configured the switches to use MSTP on their bridges, leaving RSTP enabled on the ovs-bridges. And it works like a charm for now; R/M-STP-convergence happens so fast, that almost immediately after deactivating a single Interface on a switch (/ pulling a cable from the switch) the traffic gets switched to the 10G-"trunk"-connection between the switches; and it still works, of course, in case of a completely failing switch.

Many thanks,
Gerald
 

Eitan

New Member
Feb 4, 2020
6
0
1
50
I know this is a bit old, but just out of curiosity, (I am working on a similar setup) Why not have the two switches connected in stack mode and gain the benefit of link aggregation AND redundancy?

For example pve1 has port0 and port1 configured in loadbalancing (LACP or whatever switch supports) and these ports would be connected to port0 on sw1 and port0 on sw2. pve2 is the same and has port0 and port1 connected to port1 on sw1 amd port1 on sw2 and so forth. This way if and piece of hardware fails, traffic keeps on flowing and when everything is working correctly you get double the bandwidth :D
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
5,545
371
103
Why not have the two switches connected in stack mode and gain the benefit of link aggregation AND redundancy?
This is working but.
With MLAG:
1.) switches are more expensive if they capable of MLAG.
2.) MLAG consumes resources so the latency will normally increase.
3.) Updates of the switches get more complicated.
4.) The MLAG implementation is only working if the switches are the same brand and mostly the same series with the same firmware.

With Stacked Switches.
1.) the latency will increase.
2.) Stacked witches are normally not capable to route the specificated throughput between the members.

The cheaper the switches are, the more these points apply. of course, vice versa.
 

Eitan

New Member
Feb 4, 2020
6
0
1
50
First time I have heard about this, I will look into it more.

With Stacked Switches.
1.) the latency will increase.
I can understand why theoretically latency would increase, but is this something that is even measurable?

This is what I came up with when performing a quick test:

My windows box -> standalone (cisco SG series) switch -> router:

This is hrPING v5.07.1148 by cFos Software GmbH -- http://www.cfos.de

Source address is 12.34.56.78; using ICMP echo-request, ID=ac1e
Pinging 1.2.3.4 [1.2.3.4 ]
with 32 bytes data (60 bytes IP):

From 1.2.3.4 : bytes=60 seq=0001 TTL=64 ID=9778 time=0.357ms
From 1.2.3.4 : bytes=60 seq=0002 TTL=64 ID=9779 time=0.226ms
From 1.2.3.4 : bytes=60 seq=0003 TTL=64 ID=977a time=0.271ms
From 1.2.3.4 : bytes=60 seq=0004 TTL=64 ID=977b time=0.247ms
From 1.2.3.4 : bytes=60 seq=0005 TTL=64 ID=977c time=0.266ms
From 1.2.3.4 : bytes=60 seq=0006 TTL=64 ID=977d time=0.271ms
From 1.2.3.4 : bytes=60 seq=0007 TTL=64 ID=977e time=0.192ms
From 1.2.3.4 : bytes=60 seq=0008 TTL=64 ID=977f time=0.278ms
From 1.2.3.4 : bytes=60 seq=0009 TTL=64 ID=9780 time=0.285ms
From 1.2.3.4 : bytes=60 seq=000a TTL=64 ID=9781 time=0.262ms
From 1.2.3.4 : bytes=60 seq=000b TTL=64 ID=9782 time=0.238ms
From 1.2.3.4 : bytes=60 seq=000c TTL=64 ID=9783 time=0.277ms
From 1.2.3.4 : bytes=60 seq=000d TTL=64 ID=9784 time=0.281ms
From 1.2.3.4 : bytes=60 seq=000e TTL=64 ID=9785 time=0.271ms
From 1.2.3.4 : bytes=60 seq=000f TTL=64 ID=9786 time=0.316ms
From 1.2.3.4 : bytes=60 seq=0010 TTL=64 ID=9787 time=0.239ms
From 1.2.3.4 : bytes=60 seq=0011 TTL=64 ID=9788 time=0.290ms
From 1.2.3.4 : bytes=60 seq=0012 TTL=64 ID=9789 time=0.227ms
From 1.2.3.4 : bytes=60 seq=0013 TTL=64 ID=978a time=0.274ms
From 1.2.3.4 : bytes=60 seq=0014 TTL=64 ID=978b time=0.324ms
From 1.2.3.4 : bytes=60 seq=0015 TTL=64 ID=978c time=0.235ms
[Aborting...]

Packets: sent=21, rcvd=21, error=0, lost=0 (0.0% loss) in 10.000432 sec
RTTs in ms: min/avg/max/dev: 0.192 / 0.267 / 0.357 / 0.035
Bandwidth in kbytes/sec: sent=0.125, rcvd=0.125


My windows box -> standalone (cisco SG series) switch -> switch stack (cisco SGX series) -> server:

This is hrPING v5.07.1148 by cFos Software GmbH -- http://www.cfos.de

Source address is 12.34.56.78; using ICMP echo-request, ID=e01d
Pinging 4.3.2.1[4.3.2.1]
with 32 bytes data (60 bytes IP):

From 4.3.2.1: bytes=60 seq=0001 TTL=64 ID=d08d time=0.466ms
From 4.3.2.1: bytes=60 seq=0002 TTL=64 ID=d08e time=0.299ms
From 4.3.2.1: bytes=60 seq=0003 TTL=64 ID=d08f time=0.329ms
From 4.3.2.1: bytes=60 seq=0004 TTL=64 ID=d090 time=0.318ms
From 4.3.2.1: bytes=60 seq=0005 TTL=64 ID=d091 time=0.316ms
From 4.3.2.1: bytes=60 seq=0006 TTL=64 ID=d092 time=0.296ms
From 4.3.2.1: bytes=60 seq=0007 TTL=64 ID=d093 time=0.288ms
From 4.3.2.1: bytes=60 seq=0008 TTL=64 ID=d094 time=0.289ms
From 4.3.2.1: bytes=60 seq=0009 TTL=64 ID=d095 time=0.296ms
From 4.3.2.1: bytes=60 seq=000a TTL=64 ID=d096 time=0.274ms
From 4.3.2.1: bytes=60 seq=000b TTL=64 ID=d097 time=0.310ms
From 4.3.2.1: bytes=60 seq=000c TTL=64 ID=d098 time=0.348ms
From 4.3.2.1: bytes=60 seq=000d TTL=64 ID=d099 time=0.312ms
From 4.3.2.1: bytes=60 seq=000e TTL=64 ID=d09a time=0.316ms
From 4.3.2.1: bytes=60 seq=000f TTL=64 ID=d09b time=0.304ms
From 4.3.2.1: bytes=60 seq=0010 TTL=64 ID=d09c time=0.310ms
From 4.3.2.1: bytes=60 seq=0011 TTL=64 ID=d09d time=0.293ms
From 4.3.2.1: bytes=60 seq=0012 TTL=64 ID=d09e time=0.276ms
From 4.3.2.1: bytes=60 seq=0013 TTL=64 ID=d09f time=0.293ms
From 4.3.2.1: bytes=60 seq=0014 TTL=64 ID=d0a0 time=0.312ms
[Aborting...]

Packets: sent=20, rcvd=20, error=0, lost=0 (0.0% loss) in 9.500856 sec
RTTs in ms: min/avg/max/dev: 0.274 / 0.312 / 0.466 / 0.039
Bandwidth in kbytes/sec: sent=0.126, rcvd=0.126

So to me it looks like a 0.1 millisecond difference and that is 99% related to having another switch in between.

2.) Stacked witches are normally not capable to route the specificated throughput between the members.
What exactly do you mean by this?
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
5,545
371
103
What exactly do you mean by this?
Here the Cisco SG550X-24 has 24 ports with 1 Gbit/s + 4 10 Gbit/s and a switching capacity of 84.8 Gbps.
It uses only two SFP+ ports for stacking. As far I understand the Manual they use plain IP on these ports (Hybrid use).
So the switches got 20 Gbit/s connection.

So to me it looks like a 0.1 millisecond difference and that is 99% related to having another switch in between.
It is ticky to test it.
1.) what is the load on this switch?
2.) do you stay on one switch or do you must go over both switches?
 

Eitan

New Member
Feb 4, 2020
6
0
1
50
Here the Cisco SG550X-24 has 24 ports with 1 Gbit/s + 4 10 Gbit/s and a switching capacity of 84.8 Gbps.
It uses only two SFP+ ports for stacking. As far I understand the Manual they use plain IP on these ports (Hybrid use).
So the switches got 20 Gbit/s connection.
Still not sure where you're going with this?
Are you trying to make the argument that there is somehow a disadvantage or limitation to stacked switches as opposed to MLAG or dual standalone switches? If so how realistic is would the disadvantage be? Is it significant enough to make a case for considering better solutions?

It is ticky to test it.
1.) what is the load on this switch?
2.) do you stay on one switch or do you must go over both switches?
Load on the switch... I would probably be exaggerating if I said 10%...I think maybe closer to 5%...
After logging in, CPU seems to be dancing around between 10-50 percent.

Don't understand the second question, what do you mean go over both switches?
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
5,545
371
103
Do not misunderstand me, your setup is fine and will work.
but no setup is perfect and you ask about downsides.

Don't understand the second question, what do you mean go over both switches?
Client -> switch A -> switch B -> ping Target
vs
Client -> switch 1 -> ping Target

The point is you must see also the worsens scenario.
This has nothing to do if it is realistic or not.
But with this, you can make a decision of what you will do.
 

Eitan

New Member
Feb 4, 2020
6
0
1
50
My topology looks like this:

1580893682778.png

Does this answer your question?

Do you think that this is the most optimized solution that achieves performance, redundancy and cost savings?

If not, what would you suggest?

P.S. this exact layout is not final, I will be making some upgrades and layout changes in the near future to make the structure more logical and organized, but this is the environment that I am testing. There are a whole bunch of other clients connected to switch3 and some other servers connected to switch stack 1.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!