Mesh SDN network for Ceph

Sep 7, 2025
29
4
3
Question: We are moving our three-node Proxmox cluster to new hardware. I have the opportunity to fit two dual-port ConnectX-6 Dx cards to each node this time around. I have a drawer full of them because of weirdness around pricing. With a second NIC in the server, the dual-port ConnectX-6 Dx OCP cards were much cheaper, so they effectively cost £50 each. I promptly removed them because they will never be used on those servers. The Proxmox nodes only came with OCP cards because they are Broadcom 25 Gbps; the lead time on the ConnectX-6 Lx was ridiculous, so no weird pricing this time around.

Anyway, my thinking is that this would provide redundancy to the networking links, but I have no idea if the SDN will make use of it. Does anyone know if this will work, or am I just smoking some good stuff?
 
Interesting my presumption has always been that you needed a switch to LACP, didn't realize you could do it between two servers. Will need to order up some more DAC cables.
 
Nice hardware setup — with 2× dual-port ConnectX-6 Dx per node you have a lot of flexibility.

Before you order the DAC cables, one thing worth knowing: LACP (bonding mode 4) is designed for host-to-switch connections. The kernel bonding documentation lists "a switch that supports IEEE 802.3ad Dynamic link aggregation" as a prerequisite (bonding.rst, mode 4 prerequisites). For direct server-to-server DAC links, you'd want a different approach.

The Full Mesh wiki page is a great reference. Its examples assume 2 mesh ports per node (1 link per node-pair). With your 4 ports, you could dedicate 2 links per node-pair if you have separate NICs for management/corosync.

With 2 links per pair, the approach I'd suggest is the "Routed with Fallback" (FRR/OpenFabric) option from that wiki page — and the Fabrics feature automates exactly this. FRR treats each point-to-point link as a separate routing adjacency, so with 2 links per peer you get:
  • Bandwidth — ECMP (equal-cost multi-path) distributes traffic across both links to the same peer
  • Link redundancy — if one link fails, traffic shifts to the surviving link
  • Node redundancy — if a node goes down entirely, traffic reroutes through the third node

No bonding needed — FRR handles it at L3. FRR supports up to 64-way ECMP, and since OpenFabric is a link-state protocol (based on IS-IS), its SPF computation naturally finds all equal-cost paths. With 2 equal-cost links to the same peer, both should be installed as next-hops in the kernel routing table.

One thing to be aware of: the kernel's default ECMP hash (`fib_multipath_hash_policy=0`) only considers L3 (src/dst IP). Since Ceph traffic between two nodes uses the same IP pair, all of it would land on one link — you'd have redundancy but not bandwidth aggregation. To spread Ceph's per-OSD connections across both links, set the L4 hash policy (Fabrics doesn't configure this automatically):

Bash:
sysctl -w net.ipv4.fib_multipath_hash_policy=1

This includes TCP/UDP ports in the hash, so different OSD connections to the same peer get distributed across both links. Might want to add to `/etc/sysctl.d/` to persist across reboots.

You can set up Fabrics under Datacenter → SDN → Fabrics in the GUI. Choose OpenFabric, assign your ConnectX-6 interfaces, and it configures FRR on all nodes automatically.
 
  • Like
Reactions: Marco Lucarelli
A bit delayed because getting firmware updates from the vendor has taken some time, so all the cards could be on the same level. I am happy to report that it works really well.

However, I am wondering if I could do the same for the Corosync network? The reason being I don't have any good options for 1Gbps networking. Being an HPC environment, the only 1Gbps ports are on the switches dedicated to the management network, and they are now getting on. While it's not ideal if we lose a management switch, being unable to access the BMCs of the servers in a particular rack isn't exactly the end of the world.

There also seems to be a reluctance in the documentation to use the DCB features of my switches, which would have been my go to option; what's this congestion you speak of? Not everyone is using SOHO or consumer-grade network switches, or is incompetent at configuring them. To be honest, a switch with 802.1q and 802.1p would do the job of making congestion a non-issue, and those features are available in most SOHO managed switches these days.

My thinking is to use the spare 1Gbps ports on the server with either two separate mesh networks or a single mesh with redundant links? Is that viable, or should I investigate other options, like ignoring the requirement for a dedicated network entirely and just use the fact that I have DCB-capable switches and can configure them correctly?