HA Bonding for best use of 4 NICs (2x 10G, 2x 1G)

eduncan911

Member
Mar 12, 2021
18
11
8
69
I have 3 machines I am trying to cluster into an HA group. This is not a mission critical setup, just a homelab - but, it does run a lot of my personal stuff so minimal downtime for repairs is fine. More so to learn and use HA in somewhat of a production environment.

I'm on the fence of a few different network configurations. I've read in the docs about the suggested use of bonded pairs, across switches, especially for corosync. Then i found the updated 6.x docs about how Corosync can use a 2nd link now for its own redundancy - negating the need for bonding.

So, I have the following for each server:

2x 10 Gbps
2x 1 Gbps

My idea was to setup a bonded pair across the 10G links, across two stacked switches, and use it for Ceph sync and Application access (the Ceph nor Data would ever saturate a single 10G link, much less 20G). The idea is to have redundancy in the event I reboot the switches or the 4 year old pulls a plug (yeah, has happened). Ceph and App networking would be on two different VLANs that I could throttle if need be (with the Ceph cluster air-gapped).

Same concept on the 1G links for Proxmox mgmt, Corosync, and CLRNET downloads over different VLANs (some being air-gapped).

So, I wouldn't need to use Corosync's redundancy setup - and still gain the benefits of redundancy.

Am I on the right track here? I am about to start testing out the VLAN aware tagging of the bridges, or experiment with OpenVirtualSwitch.
 
Last edited:
When using Corosync with other services on the same physical interface, be aware that if one of these other services is using up all the bandwidth, you will have some problems, especially if you use the PVE HA stack.

If one of these other services starts using up all the bandwidth, the latency for corosync packets will go up. If the latency is too high, corosync will consider that link as unusuable -> down. If there are no other corosync links configured and if the situation keeps going for too long (1 or 2 min) the PVE HA stack will kick in. First, the node that has lost the corosync connection will fence itself (hard reset) to make sure that the HA guests are off. One minute later, the remaining nodes will start the HA guests.

Now if all nodes are affected by that high latency, they all will lose the corosync connection to the majority of the cluster and will fence themselves. What you see from the outside is that your whole cluster just resetted.

We recommend having at least one physical interface dedicated for corosync. If you configure more corosync links on the remaining networks, it might save you should the dedicated corosync link have a problem.
 
  • Like
Reactions: eduncan911
Here's what I did with the same NICs:

802.3ad bond with the two 10G (bond10), active-backup bond with bond10 and one 1G (bond0)
VLAN-aware vmbr0 on bond0
VLANs for Ceph public, Ceph cluster, Management on vmbr0 for the host
Corosync primary on one 1G, secondary on Ceph cluster VLAN

This way any of my two switches can fail and I have the maximum of failover and redundancy of the hardware.
 
  • Like
Reactions: eduncan911

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!