Mesh SDN network for Ceph

Sep 7, 2025
20
1
3
Question: We are moving our three-node Proxmox cluster to new hardware. I have the opportunity to fit two dual-port ConnectX-6 Dx cards to each node this time around. I have a drawer full of them because of weirdness around pricing. With a second NIC in the server, the dual-port ConnectX-6 Dx OCP cards were much cheaper, so they effectively cost £50 each. I promptly removed them because they will never be used on those servers. The Proxmox nodes only came with OCP cards because they are Broadcom 25 Gbps; the lead time on the ConnectX-6 Lx was ridiculous, so no weird pricing this time around.

Anyway, my thinking is that this would provide redundancy to the networking links, but I have no idea if the SDN will make use of it. Does anyone know if this will work, or am I just smoking some good stuff?
 
Interesting my presumption has always been that you needed a switch to LACP, didn't realize you could do it between two servers. Will need to order up some more DAC cables.
 
Nice hardware setup — with 2× dual-port ConnectX-6 Dx per node you have a lot of flexibility.

Before you order the DAC cables, one thing worth knowing: LACP (bonding mode 4) is designed for host-to-switch connections. The kernel bonding documentation lists "a switch that supports IEEE 802.3ad Dynamic link aggregation" as a prerequisite (bonding.rst, mode 4 prerequisites). For direct server-to-server DAC links, you'd want a different approach.

The Full Mesh wiki page is a great reference. Its examples assume 2 mesh ports per node (1 link per node-pair). With your 4 ports, you could dedicate 2 links per node-pair if you have separate NICs for management/corosync.

With 2 links per pair, the approach I'd suggest is the "Routed with Fallback" (FRR/OpenFabric) option from that wiki page — and the Fabrics feature automates exactly this. FRR treats each point-to-point link as a separate routing adjacency, so with 2 links per peer you get:
  • Bandwidth — ECMP (equal-cost multi-path) distributes traffic across both links to the same peer
  • Link redundancy — if one link fails, traffic shifts to the surviving link
  • Node redundancy — if a node goes down entirely, traffic reroutes through the third node

No bonding needed — FRR handles it at L3. FRR supports up to 64-way ECMP, and since OpenFabric is a link-state protocol (based on IS-IS), its SPF computation naturally finds all equal-cost paths. With 2 equal-cost links to the same peer, both should be installed as next-hops in the kernel routing table.

One thing to be aware of: the kernel's default ECMP hash (`fib_multipath_hash_policy=0`) only considers L3 (src/dst IP). Since Ceph traffic between two nodes uses the same IP pair, all of it would land on one link — you'd have redundancy but not bandwidth aggregation. To spread Ceph's per-OSD connections across both links, set the L4 hash policy (Fabrics doesn't configure this automatically):

Bash:
sysctl -w net.ipv4.fib_multipath_hash_policy=1

This includes TCP/UDP ports in the hash, so different OSD connections to the same peer get distributed across both links. Might want to add to `/etc/sysctl.d/` to persist across reboots.

You can set up Fabrics under Datacenter → SDN → Fabrics in the GUI. Choose OpenFabric, assign your ConnectX-6 interfaces, and it configures FRR on all nodes automatically.