OpenFabric: Could not find two T0 routers log spam after migration? (And VM access question)

telvenes

Well-Known Member
Oct 5, 2020
36
2
48
36
Hello everyone,

I successfully migrated my cluster last week from a manually configured BGP full mesh (using FRR) to the new Proxmox 9 SDN openfabric full mesh.

Initial Setup Note: My first problem was that the /etc/frr/daemon file was not updated correctly by the Proxmox SDN installation. This caused fabricd to fail. I solved this by completely purging frr (apt purge frr), deleting all files under /etc/frr, and then reinstalling frr. This fixed the initial setup issues, and the SDN apply process then worked correctly.

Problem 1: Log Spam

After the setup, my cluster is working well, and all nodes are peering correctly. However, the system logs on all nodes are being spammed with the following message every few minutes:
Code:
Nov 10 10:31:07 pve04 fabricd[1962]: [QBAZ6-3YZR3] OpenFabric: Could not find two T0 routers
Nov 10 10:36:03 pve04 fabricd[1962]: [QBAZ6-3YZR3] OpenFabric: Could not find two T0 routers
Nov 10 10:37:57 pve04 fabricd[1962]: [QBAZ6-3YZR3] OpenFabric: Could not find two T0 routers
Question 1: Everything seems to be working perfectly. What does this warning mean? Is it a problem in a full mesh topology, or can it be safely ignored?

Problem 2: VM Networking

Question 2: Now that the SDN openfabric is configured via the GUI, is it simpler to connect a VM directly to this fabric?

I am running K3s nodes inside VMs, and they use ceph-csi for storage. I would love to give these K3s nodes direct access to the openfabric network (which is my Ceph network) to improve storage performance.

What is the recommended way to "bridge" a VM's vNIC to the openfabric zone?

Thank you for your help!
 
I noticed that I get about double the amount of spam for this, I have both Full Mesh Ceph and Full Mesh Cluster networks in my case so it seems to generate this event for each fabric you create see:

vtysh -c "show openfabric topology"

Code:
Area ceph:
IS-IS paths to level-2 routers that speak IP
 Vertex         Type         Metric  Next-Hop       Interface  Parent            
 --------------------------------------------------------------------------------
 node-a                                                                   
 10.15.15.1/32  IP internal  0                          node-a(4)  
 node-b         TE-IS        10      node-b  mlx0       node-a(4)  
 node-c         TE-IS        10      node-c  mlx1       node-a(4)  
 10.15.15.2/32  IP TE        20      node-b  mlx0       node-b(4)  
 10.15.15.3/32  IP TE        20      node-c  mlx1       node-c(4)

Code:
Area cluster:
IS-IS paths to level-2 routers that speak IP
 Vertex         Type         Metric  Next-Hop       Interface  Parent            
 --------------------------------------------------------------------------------
 node-a                                                                   
 10.14.14.1/32  IP internal  0                                 node-a(4)  
 node-b         TE-IS        10      node-b          cls0    node-a(4)  
 node-c         TE-IS        10      node-c          cls1    node-a(4)  
 10.14.14.2/32  IP TE        20      node-b         cls0    node-b(4)  
 10.14.14.3/32  IP TE        20      node-c          cls1    node-c(4)

So far I've been running this cluster in production at a factory and haven't had any cause for concern but there is very little in terms of actual troubleshooting online I could find about this.

I can't answer your question about bridging a NIC to the openfabric network as I don't want my VMs on the Ceph or Cluster network in my instance sorry.
 
Hi!
Initial Setup Note: My first problem was that the /etc/frr/daemon file was not updated correctly by the Proxmox SDN installation. This caused fabricd to fail. I solved this by completely purging frr (apt purge frr), deleting all files under /etc/frr, and then reinstalling frr. This fixed the initial setup issues, and the SDN apply process then worked correctly.
Hmm this should not happen. Did you get any error in the console or the syslog? Was the "fabricd" entry just set to "no" in the /etc/frr/daemons file? DId you perhaps have an older version of frr? Or did you update frr after applying?

Regarding the log spam, this is just because openfabric is actually derived from IS-IS but with some special optimizations for tiered networks with spines and leafs. So openfabric tries to be smart and find the T0 routers (the spines or top-of-rack routers) to reduce flooding, etc. In a full-mesh network, you also have east-west traffic, so openfabric can't really fit this to a spine-leaf topology and can't find the T0 router. Everything still works though, so you can safely ignore this message. I'll see if we can downgrade this message from info to debug.
Regarding connecting a VM to a fabric, I would just create a simple zone (or a bridge) and NAT the traffic from the VMs to the fabric. Currently, you can't configure the SNAT IP for a zone, so you would need to create a simple zone and a vnet (without enabling SNAT) and then, for example, use this nftables script to NAT:

Code:
#!/sbin/nft -f
flush ruleset

table inet nat {
        chain postrouting {
                type nat hook postrouting priority srcnat; policy accept;
                iifname "simple_zone" snat ip to <openfabric_ip_of_the_host>
        }
}

table inet filter {
        chain forward {
                type filter hook forward priority filter; policy accept;
                iifname "simple_zone" accept
                oifname "simple_zone" ct state related,established accept
        }
}
 
  • Like
Reactions: complexplaster27