Second fabric does not fail over

HaZu

Member
May 6, 2022
7
2
8
Hi, we have a weird issue configuring a Hyper Converged Cluster using Fabrics:

blue: Cluster, Green: CephStor
1757759629460.png

After creating two Fabrics, one for Ceph and one for Cluster
1757752247536.png

The CephStor Fabric is working fine. If I deactivate vmbr2 (bond of ens1f1np1 and ens2f1np1) I can ping from shadlpven00 through sshadlpven01 to sshadlpven02.
1757759871257.png
not the fastes but looking okay so far.

otherwise in the Cluster Fabric if deactivate vmbr4 (vmbr 4=ens3f2) and try to ping sshadlpven02 (172.16.241.3) from shadlpven00 through sshadlpven01, the target host is unreachable
1757760028586.png
Now comes the fun part:
If I start a Ping from sshadlpven01 to sshadlpven02 (172.16.241.3), the ping from shadlpven00 to sshadlpven02 starts working! :oops:

1757752952683.png
just to stop soon after (upper windows in the screenshot below), but not immediately, after i stop the ping from sshadlpven01

1757753000976.png

no idea whats wrong with it. We tried all sorts of variations.. interface + bond, interface + bond + vmbr on top. All are working fine, until we take a port / bond / vmbr ofline and the rerouting happens. In that case the traffic is correctly rerouted to the other node, but from there not forwarded. As I said.. no idea why. Configuration looks fine to me. CephStor is working flawless, only the second fabric is affected.

Code:
root@shadlpven00:~# cat /etc/frr/frr.conf
frr version 10.3.1
frr defaults datacenter
hostname shadlpven00
log syslog informational
service integrated-vtysh-config
!
router openfabric CephStor
 net 49.0001.1720.1624.0009.00
exit
!
router openfabric Cluster
 net 49.0001.1720.1624.0009.00
exit
!
interface dummy_CephStor
 ip router openfabric CephStor
 openfabric passive
exit
!
interface dummy_Cluster
 ip router openfabric Cluster
 openfabric passive
exit
!
interface vmbr1
 ip router openfabric CephStor
 openfabric hello-interval 1
 openfabric csnp-interval 2
exit
!
interface vmbr2
 ip router openfabric CephStor
 openfabric hello-interval 1
 openfabric csnp-interval 2
exit
!
interface vmbr3
 ip router openfabric Cluster
 openfabric hello-interval 1
 openfabric csnp-interval 2
exit
!
interface vmbr4
 ip router openfabric Cluster
 openfabric hello-interval 1
 openfabric csnp-interval 2
exit
!
access-list pve_openfabric_CephStor_ips permit 172.16.240.8/29
!
access-list pve_openfabric_Cluster_ips permit 172.16.241.0/24
!
route-map pve_openfabric permit 100
 match ip address pve_openfabric_CephStor_ips
 set src 172.16.240.9
exit
!
route-map pve_openfabric permit 110
 match ip address pve_openfabric_Cluster_ips
 set src 172.16.241.1
exit
!
ip protocol openfabric route-map pve_openfabric
!
!
line vty
 

Attachments

  • 1757752480693.png
    1757752480693.png
    1.1 MB · Views: 3
  • 1757752618331.png
    1757752618331.png
    1 MB · Views: 7
Last edited:
I raised a ticket and am allowed to share the solution:

"The problem is that the ARP lookup failed because it takes the IP from the other fabric as source IP. That means with tcpdump you would have to see arp requests with whois 172.16.241.3 tell 172.16.240.12 (or another ip). We can solve the problem by distributing the IP address of the dummy interface to all interfaces. So arp always knows which source address must be used. I'll prepare a patch soon that will fix this, but in the meantime you can put the following config in /etc/network/interfaces (deliberately /etc/network/interfaces and not /etc/network/interfaces.d/sdn, so that this config is not overwritten when applying):

Code:
node shadlpven00:

auto vmbr1
iface vmbr1

    address 172.16.240.9/32
    ip-forward 1

auto vmbr2
iface vmbr2

    address 172.16.240.9/32
    ip-forward 1

auto vmbr3
iface vmbr3

    address 172.16.241.1/32
    ip-forward 1

auto vmbr4
iface vmbr4

    address 172.16.241.1/32
    ip-forward 1

Then repeat the same thing on all nodes: copy the fabric interfaces from /etc/network/interfaces.d/sdn to /etc/network/interfaces and put the dummy_X address on top.

Then reload with ifreload -a and check with ip -a if all vmbrX interfaces have the correct address."

I must say the Proxmox support is awesome. Issue was solved on the same day and very nice contact.
 
  • Like
Reactions: ggoller