PVE HA cluster redundant networking with OpenVSwitch and RSTP

austrianbavarian · Feb 22, 2019

Dear Proxmox staff and forum members,

being new to PVE and high availability systems in general, I'd like to discuss a 3-node-cluster setup, that involves network redundancy by means of OpenVSwitch using RSTP and 2 switches,
and beg you to pardon possible beginner's mistakes.

Each node is configured with 3 ovs-bridges, vmbr0 for entire Ceph networking, vmbr1 for PVE-node-intercommunication and management, and vmbr2 for VLAN-tagged guest traffic.

Each of the mentioned ovs bridges consists of 2 ethernet interfaces, one connected to switch1 (nominal switch) the other one connected to switch2 (backup switch).

The RSTP enabled switches allow "grouping" of ports by untagged vlans ("port-based-vlan"), labelled as "u-VLAN" in the attached network sketch.

I'm aware of the fact, that RSTP is not VLAN-aware, but MSTP cannot be used, as OpenVSwitch is not MSTP-enabled.

The network redundancy is working for all of the nodes' vmbrX in the case of one entirely failing switch; the port state changes from forwarding to discarding and vice versa on every node's interface-pairs can be nicely observed, when i pull one switch's power supply; so far, so good.

In case of only one failing port of a switch (/failing NIC on node/ failing link in general), those status changes again can be observed in the affected node's interfaces; this leads to a situation, in which only one node's interface swaps over to the backup switch, and the other nodes still remain connected to the nominal switch.
Thus, in practice one node isn't connected to the rest of the cluster anymore.

Does anybody have a suggestion, how i could address this problem?
Or is my approach to redundancy a hopeless/problematic one?
Could my goal of redundancy also for single link defects be achieved by using (active-backup-mode) bonds without a direct connection between the hardware switches (stacking)?

Thank you!

wolfgang · Feb 25, 2019

Hi

austrianbavarian said:
Does anybody have a suggestion, how i could address this problem?

Use bond in backup mode.
Your setup is the same but need more resources and contains more parts which need to debug if a problem occurs.

vmbr -> bond -> 2 x nics
Also openvswitch is not needed in this senario.

With 3 nodes you can consider to create a full mesh this reduces the latency of what is essential for ceph.
see https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server

austrianbavarian · Feb 25, 2019

Hi wolfgang, thanks for your reply!

I tried to rethink the scenario with your recommendation of using active-backup bonding mode, but I'm still not sure, if this solves the problem of a single host leaving the other ones allone:

If I'm correct, the preferred way of configuring bonding in active-backup mode would be adding arp targets to the bonds; as I understand, in HA-setups it makes sense to add at least 2 arp targets;
in this case, I could only think of the addresses of the other 2 nodes; what would happen in case of 1 of these node's complete failure?
The other 2 nodes' bonds would switch forth and back (with frequencies amongst others determined by the bonds' options updelay/downdelay and arp_interval) from one interface to the other, until both arp targets get available again, which I guess will pose a threat to network stability.
Adding ip targets "behind" the switches from nodes' perspective also does not guarantee, that all nodes would use the same switch.
Is my line of thoughts correct so far?

wolfgang said:
With 3 nodes you can consider to create a full mesh this reduces the latency of what is essential for ceph.
see https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server

Thats an excellent suggestion, as it will also reduce the required amount of wiring and other components that could fail!

wolfgang · Feb 25, 2019

austrianbavarian said:
Adding ip targets "behind" the switches from nodes' perspective also does not guarantee, that all nodes would use the same switch.
Is my line of thoughts correct so far?

The switching will be done by the bond technical both links are up and both are registered with the MAC of the bond is the one of the primary nic.
When the bond switches the nic it makes advertising to the switch.
And yes the bond is only listening on the active nic.
To solve your problem you have to connect switch1 with switch2, in any case.

austrianbavarian · Feb 27, 2019

I've now connected both of the switches, while staying with stp for the moment ( - as soon as I have some time, I will also test the bond-way -).
I configured the switches to use MSTP on their bridges, leaving RSTP enabled on the ovs-bridges. And it works like a charm for now; R/M-STP-convergence happens so fast, that almost immediately after deactivating a single Interface on a switch (/ pulling a cable from the switch) the traffic gets switched to the 10G-"trunk"-connection between the switches; and it still works, of course, in case of a completely failing switch.

Many thanks,
Gerald

Eitan · Feb 4, 2020

I know this is a bit old, but just out of curiosity, (I am working on a similar setup) Why not have the two switches connected in stack mode and gain the benefit of link aggregation AND redundancy?

For example pve1 has port0 and port1 configured in loadbalancing (LACP or whatever switch supports) and these ports would be connected to port0 on sw1 and port0 on sw2. pve2 is the same and has port0 and port1 connected to port1 on sw1 amd port1 on sw2 and so forth. This way if and piece of hardware fails, traffic keeps on flowing and when everything is working correctly you get double the bandwidth

wolfgang · Feb 5, 2020

Eitan said:
Why not have the two switches connected in stack mode and gain the benefit of link aggregation AND redundancy?

This is working but.
With MLAG:
1.) switches are more expensive if they capable of MLAG.
2.) MLAG consumes resources so the latency will normally increase.
3.) Updates of the switches get more complicated.
4.) The MLAG implementation is only working if the switches are the same brand and mostly the same series with the same firmware.

With Stacked Switches.
1.) the latency will increase.
2.) Stacked witches are normally not capable to route the specificated throughput between the members.

The cheaper the switches are, the more these points apply. of course, vice versa.

Eitan · Feb 5, 2020

wolfgang said:
MLAG

First time I have heard about this, I will look into it more.

wolfgang said:
With Stacked Switches.
1.) the latency will increase.

I can understand why theoretically latency would increase, but is this something that is even measurable?

This is what I came up with when performing a quick test:

My windows box -> standalone (cisco SG series) switch -> router:

This is hrPING v5.07.1148 by cFos Software GmbH -- http://www.cfos.de

Source address is 12.34.56.78; using ICMP echo-request, ID=ac1e
Pinging 1.2.3.4 [1.2.3.4 ]
with 32 bytes data (60 bytes IP):

From 1.2.3.4 : bytes=60 seq=0001 TTL=64 ID=9778 time=0.357ms
From 1.2.3.4 : bytes=60 seq=0002 TTL=64 ID=9779 time=0.226ms
From 1.2.3.4 : bytes=60 seq=0003 TTL=64 ID=977a time=0.271ms
From 1.2.3.4 : bytes=60 seq=0004 TTL=64 ID=977b time=0.247ms
From 1.2.3.4 : bytes=60 seq=0005 TTL=64 ID=977c time=0.266ms
From 1.2.3.4 : bytes=60 seq=0006 TTL=64 ID=977d time=0.271ms
From 1.2.3.4 : bytes=60 seq=0007 TTL=64 ID=977e time=0.192ms
From 1.2.3.4 : bytes=60 seq=0008 TTL=64 ID=977f time=0.278ms
From 1.2.3.4 : bytes=60 seq=0009 TTL=64 ID=9780 time=0.285ms
From 1.2.3.4 : bytes=60 seq=000a TTL=64 ID=9781 time=0.262ms
From 1.2.3.4 : bytes=60 seq=000b TTL=64 ID=9782 time=0.238ms
From 1.2.3.4 : bytes=60 seq=000c TTL=64 ID=9783 time=0.277ms
From 1.2.3.4 : bytes=60 seq=000d TTL=64 ID=9784 time=0.281ms
From 1.2.3.4 : bytes=60 seq=000e TTL=64 ID=9785 time=0.271ms
From 1.2.3.4 : bytes=60 seq=000f TTL=64 ID=9786 time=0.316ms
From 1.2.3.4 : bytes=60 seq=0010 TTL=64 ID=9787 time=0.239ms
From 1.2.3.4 : bytes=60 seq=0011 TTL=64 ID=9788 time=0.290ms
From 1.2.3.4 : bytes=60 seq=0012 TTL=64 ID=9789 time=0.227ms
From 1.2.3.4 : bytes=60 seq=0013 TTL=64 ID=978a time=0.274ms
From 1.2.3.4 : bytes=60 seq=0014 TTL=64 ID=978b time=0.324ms
From 1.2.3.4 : bytes=60 seq=0015 TTL=64 ID=978c time=0.235ms
[Aborting...]

Packets: sent=21, rcvd=21, error=0, lost=0 (0.0% loss) in 10.000432 sec
RTTs in ms: min/avg/max/dev: 0.192 / 0.267 / 0.357 / 0.035
Bandwidth in kbytes/sec: sent=0.125, rcvd=0.125

My windows box -> standalone (cisco SG series) switch -> switch stack (cisco SGX series) -> server:

This is hrPING v5.07.1148 by cFos Software GmbH -- http://www.cfos.de

Source address is 12.34.56.78; using ICMP echo-request, ID=e01d
Pinging 4.3.2.1[4.3.2.1]
with 32 bytes data (60 bytes IP):

From 4.3.2.1: bytes=60 seq=0001 TTL=64 ID=d08d time=0.466ms
From 4.3.2.1: bytes=60 seq=0002 TTL=64 ID=d08e time=0.299ms
From 4.3.2.1: bytes=60 seq=0003 TTL=64 ID=d08f time=0.329ms
From 4.3.2.1: bytes=60 seq=0004 TTL=64 ID=d090 time=0.318ms
From 4.3.2.1: bytes=60 seq=0005 TTL=64 ID=d091 time=0.316ms
From 4.3.2.1: bytes=60 seq=0006 TTL=64 ID=d092 time=0.296ms
From 4.3.2.1: bytes=60 seq=0007 TTL=64 ID=d093 time=0.288ms
From 4.3.2.1: bytes=60 seq=0008 TTL=64 ID=d094 time=0.289ms
From 4.3.2.1: bytes=60 seq=0009 TTL=64 ID=d095 time=0.296ms
From 4.3.2.1: bytes=60 seq=000a TTL=64 ID=d096 time=0.274ms
From 4.3.2.1: bytes=60 seq=000b TTL=64 ID=d097 time=0.310ms
From 4.3.2.1: bytes=60 seq=000c TTL=64 ID=d098 time=0.348ms
From 4.3.2.1: bytes=60 seq=000d TTL=64 ID=d099 time=0.312ms
From 4.3.2.1: bytes=60 seq=000e TTL=64 ID=d09a time=0.316ms
From 4.3.2.1: bytes=60 seq=000f TTL=64 ID=d09b time=0.304ms
From 4.3.2.1: bytes=60 seq=0010 TTL=64 ID=d09c time=0.310ms
From 4.3.2.1: bytes=60 seq=0011 TTL=64 ID=d09d time=0.293ms
From 4.3.2.1: bytes=60 seq=0012 TTL=64 ID=d09e time=0.276ms
From 4.3.2.1: bytes=60 seq=0013 TTL=64 ID=d09f time=0.293ms
From 4.3.2.1: bytes=60 seq=0014 TTL=64 ID=d0a0 time=0.312ms
[Aborting...]

Packets: sent=20, rcvd=20, error=0, lost=0 (0.0% loss) in 9.500856 sec
RTTs in ms: min/avg/max/dev: 0.274 / 0.312 / 0.466 / 0.039
Bandwidth in kbytes/sec: sent=0.126, rcvd=0.126

So to me it looks like a 0.1 millisecond difference and that is 99% related to having another switch in between.

wolfgang said:
2.) Stacked witches are normally not capable to route the specificated throughput between the members.

What exactly do you mean by this?

wolfgang · Feb 5, 2020

Eitan said:
What exactly do you mean by this?

Here the Cisco SG550X-24 has 24 ports with 1 Gbit/s + 4 10 Gbit/s and a switching capacity of 84.8 Gbps.
It uses only two SFP+ ports for stacking. As far I understand the Manual they use plain IP on these ports (Hybrid use).
So the switches got 20 Gbit/s connection.

Eitan said:
So to me it looks like a 0.1 millisecond difference and that is 99% related to having another switch in between.

It is ticky to test it.
1.) what is the load on this switch?
2.) do you stay on one switch or do you must go over both switches?

Eitan · Feb 5, 2020

wolfgang said:
Here the Cisco SG550X-24 has 24 ports with 1 Gbit/s + 4 10 Gbit/s and a switching capacity of 84.8 Gbps.
It uses only two SFP+ ports for stacking. As far I understand the Manual they use plain IP on these ports (Hybrid use).
So the switches got 20 Gbit/s connection.

Still not sure where you're going with this?
Are you trying to make the argument that there is somehow a disadvantage or limitation to stacked switches as opposed to MLAG or dual standalone switches? If so how realistic is would the disadvantage be? Is it significant enough to make a case for considering better solutions?

wolfgang said:
It is ticky to test it.
1.) what is the load on this switch?
2.) do you stay on one switch or do you must go over both switches?

Load on the switch... I would probably be exaggerating if I said 10%...I think maybe closer to 5%...
After logging in, CPU seems to be dancing around between 10-50 percent.

Don't understand the second question, what do you mean go over both switches?

wolfgang · Feb 5, 2020

Do not misunderstand me, your setup is fine and will work.
but no setup is perfect and you ask about downsides.

Eitan said:
Don't understand the second question, what do you mean go over both switches?

Client -> switch A -> switch B -> ping Target
vs
Client -> switch 1 -> ping Target

The point is you must see also the worsens scenario.
This has nothing to do if it is realistic or not.
But with this, you can make a decision of what you will do.

Eitan · Feb 5, 2020

My topology looks like this:

Does this answer your question?

Do you think that this is the most optimized solution that achieves performance, redundancy and cost savings?

If not, what would you suggest?

P.S. this exact layout is not final, I will be making some upgrades and layout changes in the near future to make the structure more logical and organized, but this is the environment that I am testing. There are a whole bunch of other clients connected to switch3 and some other servers connected to switch stack 1.

Search

Search

PVE HA cluster redundant networking with OpenVSwitch and RSTP

austrianbavarian

Member

wolfgang

Proxmox Retired Staff

austrianbavarian

Member

wolfgang

Proxmox Retired Staff

austrianbavarian

Member

Eitan

Member

wolfgang

Proxmox Retired Staff

Eitan

Member

wolfgang

Proxmox Retired Staff

Eitan

Member

wolfgang

Proxmox Retired Staff

Eitan

Member

We value your privacy