3 node cluster, SDN fabric shows 'not ok' for one node

ns33 · 2025-11-19T12:57:48+0100

Currently running a 3 node cluster with PVE 9.0.15; high level of network arch on each node;
2x non bonded 1G for corosync
2x 25gb bonded in lacp for ceph (public and ceph cluster)
2x 10gb bonded in lacp for VM traffic (sdn) There is no ip assigned to this bond

Off of the 10gb bond, I have a vlan for web ui access.

Currently everything is running off of a cisco network stack of multiple 9300 and 9300x switches. This a dev env so the network layout isn't ideal. On the switch side, the bonded ports are configured as trunk ports in a port channel with lacp rate set to fast.

Anyway, in creating a fabric; using OpenFabric and setting an unused IPv4 prefix, keeping the hello/csnp intervals to the default and then adding each node to the configured fabric selecting the ip-less bond containing the 10gb nics and then setting an IP in the 'Create Node' window for the fabric.

After applying, the 1st node shows the fabric as being 'not ok' and looking in the system log for that node, it continually prints out 'OpenFabric: Needed to resync LSPDB using CSNP'

I'm at a loss here on what would cause this print out in the logs and what would cause the fabric to show as 'not ok'. Where should I start looking that might point to what the problem is?

ns33 · 2025-11-19T14:21:00+0100

Some more information:

Doing an 'ip route' on each node; only nodes 2 and 3 list an entry for openfabric.

If I go to the web ui and select the fabric under each node in the server view: (Main is the name of the fabric, bond0 is what I selected when adding the node)

Code:

Node1 Fabric:
    Routes: Empty
    Neighbors: Node2
    Interfaces: dummy_Main
                bond0

Node2 Fabric:
    Routes: List the ip for Node3
    Neighbors: Node3
    Interfaces: bond0
                dummy_Main

Node3 Fabric:
    Routes: List the ip for Node2
    Neighbors: Node2
    Interfaces: bond0
                dummy_Main

shanreich · 2025-11-19T14:26:09+0100

The status 'not ok' currently indicates whether there are any routes learned via the fabric, so most likely that is the reason for the status being 'not ok'. It seems like there are issues with the connection to node 1, judging from the routes?

Also, is there a reason why you're using a bond over simply adding both interfaces as one to the fabric?

ns33 · 2025-11-19T14:54:53+0100

Other than the potential for increased throughput, there isn't a specific reason.

Not sure why I didn't test this early, but Nodes 2/3 are unable to ping Node1 using the fabric node defined ip. If I remove node2 or 3, the remaining one can now ping Node1. As soon as I add back the removed node, I loose the ability to ping the newly added node.

Is this an issue on the switch hardware side then?

shanreich · 2025-11-19T14:57:01+0100

Are all of the nodes connected to the same switch? Is there a particular reason why you'd want to run a dynamic routing protocol over a single switch, rather than just statically configuring the bonds?

edit: sorry, misinterpreted your response.

I think it would be worthwhile trying just a single, simple interface and see if it works then. If you're already using a dynamic routing protocol, you shouldn't need a bond. If you're using multiple interfaces, then ECMP should take care of redundancy / load balancing. Depending on the traffic patterns you might want to use layer4 hashing though, for better distribution of connection flows.

Also: are those point-to-point connections and are you running openfabric on the switches as well? Or how does your topology look like exactly?

ns33 · 2025-11-19T21:45:55+0100

I went ahead and removed the bond, adding each of those individual interfaces to the fabric and now they all show as 'ok'.

I don't totally understand why it works with individual interfaces but not a bond. I'm just a software dev, networking is not my strong suit at all. Definitely open to an explanation why or at least pointing me to sources I can deep dive into to understand

To answer your question though, to my best understanding, they aren't point-to-point and the switches aren't running openfabric. They're L3 cisco switches running IOS. All network interfaces on each node go to one of the switches in the stack.

Thank you!

Search

Search

3 node cluster, SDN fabric shows 'not ok' for one node

ns33

New Member

ns33

New Member

shanreich

Proxmox Staff Member

ns33

New Member

shanreich

Proxmox Staff Member

ns33

New Member

We value your privacy