Second fabric does not fail over

HaZu · Sep 13, 2025

Hi, we have a weird issue configuring a Hyper Converged Cluster using Fabrics:

blue: Cluster, Green: CephStor

After creating two Fabrics, one for Ceph and one for Cluster

The CephStor Fabric is working fine. If I deactivate vmbr2 (bond of ens1f1np1 and ens2f1np1) I can ping from shadlpven00 through sshadlpven01 to sshadlpven02.

not the fastes but looking okay so far.

otherwise in the Cluster Fabric if deactivate vmbr4 (vmbr 4=ens3f2) and try to ping sshadlpven02 (172.16.241.3) from shadlpven00 through sshadlpven01, the target host is unreachable

Now comes the fun part:
If I start a Ping from sshadlpven01 to sshadlpven02 (172.16.241.3), the ping from shadlpven00 to sshadlpven02 starts working!

just to stop soon after (upper windows in the screenshot below), but not immediately, after i stop the ping from sshadlpven01

no idea whats wrong with it. We tried all sorts of variations.. interface + bond, interface + bond + vmbr on top. All are working fine, until we take a port / bond / vmbr ofline and the rerouting happens. In that case the traffic is correctly rerouted to the other node, but from there not forwarded. As I said.. no idea why. Configuration looks fine to me. CephStor is working flawless, only the second fabric is affected.

Code:

root@shadlpven00:~# cat /etc/frr/frr.conf
frr version 10.3.1
frr defaults datacenter
hostname shadlpven00
log syslog informational
service integrated-vtysh-config
!
router openfabric CephStor
 net 49.0001.1720.1624.0009.00
exit
!
router openfabric Cluster
 net 49.0001.1720.1624.0009.00
exit
!
interface dummy_CephStor
 ip router openfabric CephStor
 openfabric passive
exit
!
interface dummy_Cluster
 ip router openfabric Cluster
 openfabric passive
exit
!
interface vmbr1
 ip router openfabric CephStor
 openfabric hello-interval 1
 openfabric csnp-interval 2
exit
!
interface vmbr2
 ip router openfabric CephStor
 openfabric hello-interval 1
 openfabric csnp-interval 2
exit
!
interface vmbr3
 ip router openfabric Cluster
 openfabric hello-interval 1
 openfabric csnp-interval 2
exit
!
interface vmbr4
 ip router openfabric Cluster
 openfabric hello-interval 1
 openfabric csnp-interval 2
exit
!
access-list pve_openfabric_CephStor_ips permit 172.16.240.8/29
!
access-list pve_openfabric_Cluster_ips permit 172.16.241.0/24
!
route-map pve_openfabric permit 100
 match ip address pve_openfabric_CephStor_ips
 set src 172.16.240.9
exit
!
route-map pve_openfabric permit 110
 match ip address pve_openfabric_Cluster_ips
 set src 172.16.241.1
exit
!
ip protocol openfabric route-map pve_openfabric
!
!
line vty

HaZu · Sep 17, 2025

I raised a ticket and am allowed to share the solution:

"The problem is that the ARP lookup failed because it takes the IP from the other fabric as source IP. That means with tcpdump you would have to see arp requests with whois 172.16.241.3 tell 172.16.240.12 (or another ip). We can solve the problem by distributing the IP address of the dummy interface to all interfaces. So arp always knows which source address must be used. I'll prepare a patch soon that will fix this, but in the meantime you can put the following config in /etc/network/interfaces (deliberately /etc/network/interfaces and not /etc/network/interfaces.d/sdn, so that this config is not overwritten when applying):

Code:

node shadlpven00:

auto vmbr1
iface vmbr1

&nbsp; &nbsp; address 172.16.240.9/32
&nbsp; &nbsp; ip-forward 1

auto vmbr2
iface vmbr2

&nbsp; &nbsp; address 172.16.240.9/32
&nbsp; &nbsp; ip-forward 1

auto vmbr3
iface vmbr3

&nbsp; &nbsp; address 172.16.241.1/32
&nbsp; &nbsp; ip-forward 1

auto vmbr4
iface vmbr4

&nbsp; &nbsp; address 172.16.241.1/32
&nbsp; &nbsp; ip-forward 1

Then repeat the same thing on all nodes: copy the fabric interfaces from /etc/network/interfaces.d/sdn to /etc/network/interfaces and put the dummy_X address on top.

Then reload with ifreload -a and check with ip -a if all vmbrX interfaces have the correct address."

I must say the Proxmox support is awesome. Issue was solved on the same day and very nice contact.

carles89 · Oct 27, 2025

We're also testing the new Fabrics GUI and we see exactly the same behaviour with pings, even with a single fabric. Tested lots of different configurations, and the only one that makes it stable is setting node's /32 ip address to every interface that belong to the fabric, as per @ggoller explanation in its patch [1].

The only difference is that we set this directly in the SDN GUI (Datacenter -> SDN -> Fabrics -> node_name)

With this setting, everything works as expected.

The wiki page [2] leaves this configuration as optional:

Optionally you can define dedicated IP addresses for the interfaces that will then be used instead of the router IP to create adjacencies between the nodes.

Maybe this could be updated to advise users to put the dummy ip on all the interfaces?

[1] https://lore.proxmox.com/pve-devel/20250916124108.209042-1-g.goller@proxmox.com/
[2] https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Using_SDN_Fabrics

ggoller · Oct 28, 2025

Hi!
the next version will definitely fix this (9.1, should be coming next month), we have a patch lined up already.

Search

Search

Second fabric does not fail over

HaZu

Member

Attachments

HaZu

Member

carles89

Renowned Member

ggoller

Proxmox Staff Member

We value your privacy