Multipath Ceph storage network

Jan 13, 2025
16
4
3
We currently have a 4 node cluster using ceph as the shared storage solution (soon to be 5 nodes), made up of 8x 4TB nvme cards on each node, so 32 OSDs in total. Everything is running as expected and HA failover/migrations at expected speeds (better than expected to be honest).

Each node has 4 onboard NICs (eno1/2 are 10GBps, eno3/4 are 1GBps), along with a single PCIe mellanox card offering dual 100GB ports.

So far we have one of the 100GB ports on each node linked to a single port on an FS switch capable of 6.4 Tbps of switching capacity and configured on a non-routable VLAN. We have just purchased another identical switch and configured 4 ports to be on the same vlan, and we are hoping to configure the ceph storage network in an multipath A/B channel setup, as you would if connecting to a SAN.

Is this possible? I can't believe that such a mature hyper-visor doesn't have the option to provide a failover path for the shared storage solution, so hoping i've just missed it.
 
Last edited:
This is Ethernet and not Fibre-Channel.

You cannot have two separate interfaces in the same VLAN.

Create a stack out of the two switches (or MLAG or whatever the vendor calls it) and then a link aggregation group for each Ceph node with one port from each switch.

On the Ceph nodes create a bonding interface out of the two 100G interfaces.

Then you have two paths from each node.
 
Thanks for you reply @gurubert.

I'm aware this is not fibre channel, I was hoping there was a way of assigning both nics to ceph storage traffic in an active/passive configuration.

The switches support MLAG so I can set this up and create the link aggregation as you suggest. With this configuration I should be able to pull the power from 1 of the 2 switches without downtime, bar a few lost packets I assume?

Separate question (mainly curiosity) but what happens in the event that the ceph storage network only has one link and that switch goes down, how does ceph behave? I'm assuming it can't rebalance because it has no network to send the rebalancing data, so does ceph stay in a paused state until the link is back up, or does some corruption happen?
 
assigning both nics to ceph storage traffic in an active/passive configuration.
Use MLAG with both F5 switches and configure LACP 802.3ad LAG both in PVE and in the switches and you will get both links in an active/active setup with failover/failback times tipically under 400ms. Remember to add the LAG to the Ceph vlan.

storage network only has one link and that switch goes down, how does ceph behave?
You will lose Ceph quorum because no monitor will reach any other monitor, then all OSD's will halt all I/O. Obviously all VMs will hang until I/O is restored. The chance of data corruption is almost zero (I really thing its zero, but you know, can't be 100% sure), but you may lose some in flight writes in some VMs, which may need some chkdsk/fsck and some file might get lost/damaged.

proxmox host would try to reboot itself if it lost the connectivity in the corosync.
Only if OP has only one corosync link running in that same Ceph network, which he doesn't... do you, @leedys90 ? ;)
 
  • Like
Reactions: gurubert
Use MLAG with both F5 switches and configure LACP 802.3ad LAG both in PVE and in the switches and you will get both links in an active/active setup with failover/failback times tipically under 400ms. Remember to add the LAG to the Ceph vlan.


You will lose Ceph quorum because no monitor will reach any other monitor, then all OSD's will halt all I/O. Obviously all VMs will hang until I/O is restored. The chance of data corruption is almost zero (I really thing its zero, but you know, can't be 100% sure), but you may lose some in flight writes in some VMs, which may need some chkdsk/fsck and some file might get lost/damaged.


Only if OP has only one corosync link running in that same Ceph network, which he doesn't... do you, @leedys90 ? ;)
Thank you all for your answers, and @VictorSTS for the information surrounding the behaviour if all cluster_storage network fails, it's good to know.

The monitors run on the public_network, which is separate from the cluster_network, so if I understand correctly if the switch currently connected to only the cluster network fails, the hosts won't shut down because the monitor network will still be active. Is that correct?

I do currently have only one nic configured for the public/corosync network. Would you recommend using 2 nics in a bond in the same suggested config as the cluster_network (MLAG + LACP 802.3ad LAG)?

configure LACP 802.3ad LAG both in PVE
How is this done in PVE?, does the "Linux bond" interface in PVE use 802.3ad ?

Another question if I may, in our setup where we have such high bandwidth throughput on the 100G switches, should I be using a separate network for the migration network? (Currently the migration network is configured to use the cluster_network interface)
 
The monitors run on the public_network, which is separate from the cluster_network, so if I understand correctly if the switch currently connected to only the cluster network fails, the hosts won't shut down because the monitor network will still be active. Is that correct?
That depends on your corosync configuration and which interfaces are used for corosync and Ceph. Corosync networks are completely different thing than Ceph networks. In fact, you may even have quorum in PVE and not in Ceph and vice versa. Also remember that Ceph public network is used for all client traffic (MONs, OSD, and client access to data) and Ceph cluster network is used for replication traffic only.

You must use at least 2 corosync links if planning on using HA and at least one of them should be dedicated or, at the very least, must not be shared with your storage networkm, to avoid that network saturation disrupts Corosync traffic, which although highly improbable with 100G networks, it may still happen.

How is this done in PVE?, does the "Linux bond" interface in PVE use 802.3ad ?
Linux bond, type LACP 802.3ad. It's all exposed in the webUI.
 
  • Like
Reactions: gurubert