Stumped by a routing issue

morik_proxmox · Friday at 05:50

Dear experts,
I'm a bit puzzled and don't know how to move forward. Hence, requesting your assistance.

Desire: Use dedicated NICs for dedicated functions (e.g. 1 NIC for storage, 2nd NIC as backup to 1st - via VPC on the switches, 3rd NIC for certain VMs, 4th NIC for certain other VMs etc)

Problem: Employing multiple bridges crashes the whole network (STP loop detected).

Setup:

6 NICs on a standalone PVE8.2 (w/ an intention to convert to cluster)
- Four of which are Broadcom (BRCM 57504) 25G. Each port has Npar=4 which in turn presents 4 ports virtual ports to hypervisor (PVE). Meaning, PVE sees a total of 4*4=16 BRCM ports. SR-IOV is disabled.
- Two of which are NVIDIA 100G. SR-IOV is enabled but VF is set to 0 in the firmware.
VLANs involved:
- 140: Management
- 100: Server
- 120: Storage (NFS access here)
- 121: Storage backup (NFS access desired here)
To simplify setup
- No VMs installed yet
- Npar'ed ports are bonded w/ balanced-xor in its own pair Physical port which is then assigned to a separate bridge which each have an IP
- All 6 ports are configured as access on the (Cisco) switch w/ the following configuration (where X= one of the above VLANs):

Switch port configuration

Code:

switchport mode access
switchport access vlan X
mtu 9216
no flowcontrol receive on

Sample `lshw -class net` output for BRCM NIC

Code:

  *-network:1
       description: Ethernet interface
       product: BCM57504 NetXtreme-E RDMA Partition
       vendor: Broadcom Inc. and subsidiaries
       physical id: 0.1
       bus info: pci@0000:19:00.1
       logical name: eno8403np1
       version: 11
       serial: d0:8e:79:f2:14:af
       width: 64 bits
       clock: 33MHz
       capabilities: pm vpd msix pciexpress bus_master cap_list rom ethernet physical fibre autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=bnxt_en driverversion=6.8.12-2-pve duplex=full firmware=229.2.62.0/pkg 22.92.07.50 latency=0 link=yes multicast=yes port=fibre slave=yes
       resources: iomemory:21ff0-21fef iomemory:21ff0-21fef iomemory:21ff0-21fef irq:17 memory:21ffff0e0000-21ffff0effff memory:21fffd000000-21fffdffffff memory:21ffff170000-21ffff177fff memory:a6340000-a637ffff

NIC drivers for both types are properly loaded

Code:

lsmod | grep -e nvidia -e bnxt
nvidia_vgpu_vfio      114688  10
nvidia              54284288  3
mdev                   24576  1 nvidia_vgpu_vfio
kvm                  1339392  2 nvidia_vgpu_vfio,kvm_intel
vfio_pci_core          86016  2 nvidia_vgpu_vfio,vfio_pci
irqbypass              12288  3 vfio_pci_core,nvidia_vgpu_vfio,kvm
vfio                   65536  4 vfio_pci_core,nvidia_vgpu_vfio,vfio_iommu_type1,vfio_pci
bnxt_en               380928  0

Code:

[    5.637754] bnxt_en 0000:19:00.0 eth0: Broadcom BCM57504 NetXtreme-E Ethernet Partition found at mem 21ffff0f0000, node addr d0:8e:79:f2:14:ae
[    5.637767] bnxt_en 0000:19:00.0: 126.024 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x8 link)
[    5.638226] bnxt_en 0000:19:00.1 (unnamed net_device) (uninitialized): Device requests max timeout of 60 seconds, may trigger hung task watchdog
[    5.783770] bnxt_en 0000:19:00.1 eth1: Broadcom BCM57504 NetXtreme-E Ethernet Partition found at mem 21ffff0e0000, node addr d0:8e:79:f2:14:af
[    5.783786] bnxt_en 0000:19:00.1: 126.024 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x8 link)
[    5.784264] bnxt_en 0000:19:00.2 (unnamed net_device) (uninitialized): Device requests max timeout of 60 seconds, may trigger hung task watchdog
[    5.866088] Backport based on https://:@git-nbu.nvidia.com/r/a/mlnx_ofed/mlnx-ofa_kernel-4.0.git 51d9270
[    5.866090] compat.git: https://:@git-nbu.nvidia.com/r/a/mlnx_ofed/mlnx-ofa_kernel-4.0.git
[    5.921285] bnxt_en 0000:19:00.2 eth2: Broadcom BCM57504 NetXtreme-E Ethernet Partition found at mem 21ffff0d0000, node addr d0:8e:79:f2:14:b0
[    5.921296] bnxt_en 0000:19:00.2: 126.024 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x8 link)
[    5.921582] bnxt_en 0000:19:00.3 (unnamed net_device) (uninitialized): Device requests max timeout of 60 seconds, may trigger hung task watchdog
[    5.946404] bnxt_en 0000:19:00.3 eth3: Broadcom BCM57504 NetXtreme-E Ethernet Partition found at mem 21ffff0c0000, node addr d0:8e:79:f2:14:b1
[    5.946411] bnxt_en 0000:19:00.3: 126.024 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x8 link)
[    5.946593] bnxt_en 0000:19:00.4 (unnamed net_device) (uninitialized): Device requests max timeout of 60 seconds, may trigger hung task watchdog
[    5.970527] bnxt_en 0000:19:00.4 eth4: Broadcom BCM57504 NetXtreme-E Ethernet Partition found at mem 21ffff0b0000, node addr d0:8e:79:f2:14:b2
[    5.970533] bnxt_en 0000:19:00.4: 126.024 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x8 link)
[    5.970705] bnxt_en 0000:19:00.5 (unnamed net_device) (uninitialized): Device requests max timeout of 60 seconds, may trigger hung task watchdog
[    5.993783] bnxt_en 0000:19:00.5 eth5: Broadcom BCM57504 NetXtreme-E Ethernet Partition found at mem 21ffff0a0000, node addr d0:8e:79:f2:14:b3
[    5.993788] bnxt_en 0000:19:00.5: 126.024 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x8 link)

Code:

dmesg | grep -e nvidia  | head -n 20
[    5.866088] Backport based on https://:@git-nbu.nvidia.com/r/a/mlnx_ofed/mlnx-ofa_kernel-4.0.git 51d9270
[    5.866090] compat.git: https://:@git-nbu.nvidia.com/r/a/mlnx_ofed/mlnx-ofa_kernel-4.0.git
[    9.043196] nvidia-nvlink: Nvlink Core is being initialized, major device number 509
[   10.003012] audit: type=1400 audit(1729815717.795:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1558 comm="apparmor_parser"
[   10.003025] audit: type=1400 audit(1729815717.795:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1558 comm="apparmor_parser"
[  195.238507] nvidia 0000:8a:00.4: enabling device (0000 -> 0002)
[  195.238842] nvidia 0000:8a:00.5: enabling device (0000 -> 0002)
[  195.238987] nvidia 0000:8a:00.6: enabling device (0000 -> 0002)
[  195.239348] nvidia 0000:8a:00.7: enabling device (0000 -> 0002)
[  195.239475] nvidia 0000:8a:01.0: enabling device (0000 -> 0002)
[  195.239581] nvidia 0000:8a:01.1: enabling device (0000 -> 0002)
[  195.239768] nvidia 0000:8a:01.2: enabling device (0000 -> 0002)
[  195.239929] nvidia 0000:8a:01.3: enabling device (0000 -> 0002)
[  195.240082] nvidia 0000:8a:01.4: enabling device (0000 -> 0002)
[  195.240269] nvidia 0000:8a:01.5: enabling device (0000 -> 0002)
[  195.240384] nvidia 0000:8a:01.6: enabling device (0000 -> 0002)
[  195.240527] nvidia 0000:8a:01.7: enabling device (0000 -> 0002)
[  195.240652] nvidia 0000:8a:02.0: enabling device (0000 -> 0002)
[  195.240819] nvidia 0000:8a:02.1: enabling device (0000 -> 0002)
[  195.240941] nvidia 0000:8a:02.2: enabling device (0000 -> 0002)

/etc/network/interfaces

Code:

auto lo
iface lo inet loopback

auto eno8304np0
iface eno8304np0 inet manual
    mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on
#Embedded NIC#1 BRCM#1 (Left#4 from front) NPar#2; Switch-side VLAN 140 (Supervisor)

auto eno8305np0
iface eno8305np0 inet manual
    mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on
#Embedded NIC#1 BRCM#1 (Left#4 from front) NPar#3; Switch-side VLAN 140 (Supervisor)

auto eno8306np0
iface eno8306np0 inet manual
    mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on
#Embedded NIC#1 BRCM#1 (Left#4 from front) NPar#4; Switch-side VLAN 140 (Supervisor)

auto ens2f0np0
iface ens2f0np0 inet manual
    mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on

auto ens2f1np1
iface ens2f1np1 inet manual
    mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on

iface eno8303np0 inet manual
    mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on

auto eno8503np2
iface eno8503np2 inet manual
    mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on

auto eno8603np3
iface eno8603np3 inet manual
    mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on

auto eno8404np1
iface eno8404np1 inet manual
    mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on

auto eno8504np2
iface eno8504np2 inet manual
    mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on

auto eno8604np3
iface eno8604np3 inet manual
    mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on

auto eno8403np1
iface eno8403np1 inet manual
    mtu 9000

auto eno8405np1
iface eno8405np1 inet manual
    mtu 9000

auto eno8406np1
iface eno8406np1 inet manual
    mtu 9000

auto eno8505np2
iface eno8505np2 inet manual
    mtu 9000

auto eno8506np2
iface eno8506np2 inet manual
    mtu 9000

auto eno8605np3
iface eno8605np3 inet manual
    mtu 9000

auto eno8606np3
iface eno8606np3 inet manual
    mtu 9000

auto bond0
iface bond0 inet manual
    bond-slaves eno8304np0 eno8305np0 eno8306np0
    bond-miimon 100
    bond-mode balance-xor
    bond-xmit-hash-policy layer2+3
    mtu 9000
#Bond - 25G Management VLAN (Npar=3)

auto bond1
iface bond1 inet manual
    bond-slaves eno8403np1 eno8404np1 eno8405np1 eno8406np1
    bond-miimon 100
    bond-mode balance-xor
    mtu 9000
#Bond - 25G Storage (Npar=4)

auto bond2
iface bond2 inet manual
    bond-slaves eno8503np2 eno8504np2 eno8505np2 eno8506np2
    bond-miimon 100
    bond-mode balance-xor
    bond-xmit-hash-policy layer2+3
    mtu 9000
#Bond - 25G Storage Backup (Npar=4)

auto bond3
iface bond3 inet manual
    bond-slaves eno8603np3 eno8604np3 eno8605np3 eno8606np3
    bond-miimon 100
    bond-mode balance-xor
    bond-xmit-hash-policy layer2+3
    mtu 9000
#Bond - 25G Server (Npar=4, use for cluster in future)

auto vmbr0
iface vmbr0 inet static
    address 192.168.140.7/24
    gateway 192.168.140.1
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0
    mtu 9000
#Bridge - Management VLAN(Npar=3, 1 not usable)

auto vmbr1
iface vmbr1 inet static
    address 192.168.120.40/24
    #gateway 192.168.120.1
    bridge-ports bond1
    bridge-stp off
    bridge-fd 0
    mtu 9000
#Bridge - Storage (25G Npar=4)

auto vmbr2
iface vmbr2 inet static
    address 192.168.121.40/24
    #gateway 192.168.121.1
    bridge-ports bond2
    bridge-stp off
    bridge-fd 0
    mtu 9000
#Bridge - Storage Backup (25G Npar=4)

auto vmbr3
iface vmbr3 inet static
    address 192.168.100.40/24
  #gateway 192.168.100.1
    bridge-ports bond3
    bridge-stp off
    bridge-fd 0
    mtu 9000
#Bridge - Server VLAN (non-SRIOV mobility)

iface vmbr4 inet static
    address 192.168.100.41/24
    bridge-ports ens2f0np0
    bridge-stp off
    bridge-fd 0
    mtu 9000
#gateway 192.168.100.1
#Bridge - Server VLAN (SRIOV) #1

iface vmbr5 inet static
    address 192.168.100.42/24
    bridge-ports ens2f1np1
    bridge-stp off
    bridge-fd 0
    mtu 9000
#gateway 192.168.100.1
#Bridge - Server VLAN (SRIOV) #2

source /etc/network/interfaces.d/*

pveversion

Code:

pve-manager/8.2.7/3e0176e6bb2ade3b (running kernel: 6.8.12-2-pve)

Scenario#1 (OK)
With only a default gateway (.140.1), life is good. But, this doesn't serve the purpose because default gateway on vmbr0 is always used for reaching out to NFS server.

Code:

ip r
 default via 192.168.140.1 dev vmbr0 proto kernel onlink
 192.168.140.0/24 dev vmbr0 proto kernel scope link src 192.168.140.7
ip route get 192.168.120.7
 192.168.120.7 via 192.168.140.1 dev vmbr0 src 192.168.140.7 uid 0

Scenario#2 (NOT OK)
There should only be one default gateway. PVE GUI doesn't allow adding one. A gateway entry can be manually added to `/etc/network/interfaces` but setting that aside. As the intention is for PVE to use one of the NICs (vmbr1) for its own (non-VM) purpose of NFS mounts, I attempt to add a route to the routing table:

Code:

ip route add 192.168.120.7/32 dev vmbr1
ip r
 default via 192.168.140.1 dev vmbr0 proto kernel onlink
 192.168.140.0/24 dev vmbr0 proto kernel scope link src 192.168.140.7
 192.168.120.7/32 dev vmbr1 proto kernel scope link
ip get route 192.168.120.7
 192.168.120.7 dev vmbr1 src 192.168.120.40 uid 0

I am able to ping the NFS server for a minute, after which STP blocking occurs.

Scenario#3 (NOT OK)
PVE GUI doesn't allow adding one but i manually add one to `/etc/network/interfaces` followed by `ifreload -a:

Code:

ip r
 default via 192.168.140.1 dev vmbr0 proto kernel onlink
 192.168.100.0/24 dev vmbr3 proto kernel scope link src 192.168.100.40
 192.168.120.0/24 dev vmbr1 proto kernel scope link src 192.168.120.40
 192.168.121.0/24 dev vmbr2 proto kernel scope link src 192.168.121.40
 192.168.140.0/24 dev vmbr0 proto kernel scope link src 192.168.140.7

Things appear fine for a minute and everything on the network becomes inaccessible from / to everywhere else.

Other scenarios i've attempted which yield in more or less the same output (either node can't reach elsewhere, or rest of the network craps) including assigning ip address to bonded interface instead of bridges. I don't understand the essential difference being doing this vs interface->bond->vmbr(ip assigned here)

shanreich · Friday at 17:54

Where are you configuring the VLANs exactly? Have you done this done on the switch? Otherwise you'd need to add the respective VLAN Tags in the network configuration, like so:

Code:

auto <interface>.20
iface <interface>.20 inet manual
  ....

Adding a route for 192.168.120.7/32 is not necessary - because you configured vmbr1 like this:

Code:

auto vmbr1
iface vmbr1 inet static
    address 192.168.120.40/24

So, if the network configuration succeeds, a route for 192.168.120.0/24 via vmbr1 will get automatically created.

Are you sure you will benefit from NPAR? I haven't heard of it, but a quick google search shows that it's SR-IOV for systems that don't support SR-IOV? Am I correct in this assumption? If you're using the NIC only on the host anyway I don't think it makes sense to split the NIC into 4 and then bond them together. Just use the NIC as-is. Also, for debugging it makes sense to simplify the setup as much as possible.

morik_proxmox said:
Other scenarios i've attempted which yield in more or less the same output (either node can't reach elsewhere, or rest of the network craps) including assigning ip address to bonded interface instead of bridges. I don't understand the essential difference being doing this vs interface->bond->vmbr(ip assigned here)

Bridges are generally for when you want to use the network connection with your VMs. So, for instance, if you only access the storage network via the NICs and no VM needs to use the network then you do not need to create a bridge on it, and just configure everything directly on the interface.

So, as summary the network configuration for one NIC would then look something like this:

Code:

auto <ifname>.120
iface <ifname>.120 inet static
  address <your_subnet>
  mtu 9000

Do this for every VLAN and you should be good to go.

morik_proxmox · Friday at 22:47

Stefan,
Thank you so much for taking the time to look into my post. My responses can be found inline below.

shanreich said:
Where are you configuring the VLANs exactly? Have you done this done on the switch? Otherwise you'd need to add the respective VLAN Tags in the network configuration, like so:

Code:

auto <interface>.20 iface <interface>.20 inet manual ....

Correct. The switch-side configuration was done on Cisco's nxos using their CLI. Interfaces between host and switch are access-only; no trunks yet. Therefore, tagging on PVE is not warranted just yet.

shanreich said:
Adding a route for 192.168.120.7/32 is not necessary - because you configured vmbr1 like this:

Code:

auto vmbr1 iface vmbr1 inet static address 192.168.120.40/24

So, if the network configuration succeeds, a route for 192.168.120.0/24 via vmbr1 will get automatically created.

Configuration succeeds, but the route is NOT created automatically. It does get added if I manually add the gateway entry into pve interface configuration. But, then default gateway changes from that defined in `vmbr0 (192.168.140.1) to vmbr1 (192.168.120.1)`. (as confirmed by `ip route get 192.168.120.7` in Scenario#1 and subsequent errors in Scenario#2).

shanreich said:
Are you sure you will benefit from NPAR? I haven't heard of it, but a quick google search shows that it's SR-IOV for systems that don't support SR-IOV? Am I correct in this assumption? If you're using the NIC only on the host anyway I don't think it makes sense to split the NIC into 4 and then bond them together. Just use the NIC as-is. Also, for debugging it makes sense to simplify the setup as much as possible.

Your read is indeed correct. NPar is intended for use by systems which do not support SR-IOV. PVE does support it. The benefit of NPar would be extremely fast in-hardware packet processing at the NIC layer itself (because OS/hypervisor thinks of it as a singular NIC).
Your are also correct in that presently I am using all NPar ports on a given port by bonding together - which seemingly defeats the purpose of having the ports split up at the firmware-layer to begin with. My rationale is two-fold:

Presently, the host (Dell) needs a port for its LifeCycle Management (using its LCC program) which needs a working internet connection. LCC is installed on a boss card and is not visible to PVE. If not for NPar, I'd have to dedicate a big fat 25G port or one of two 100G ports just for an occasional firmware update / systems management function. Employing NPar allows me to present it as a NIC to LCC but not dedicate the whole physical port to it. Such that, PVE can use three of the four NPar ports of that one physical port. Enabling NPar is a device-level feature (as evident in niccli + sliff) not a port-level feature.
In future, as I begin to add VMs to the host, VMs with administration / management capabilities e.g. DNS servers and such can be assigned a dedicated 'virtual NIC' belonging to the NPar set of a port. Because PVE supports SR-IOV, in principle the same function can be achieved w/ SR-IOV as well. But, due to /1/ above, I can't get away without NPar on all ports (if enabled). And also, SR-IOV prevents VM migration when in a cluster where no such limitation will exist with NPar.

shanreich said:
Bridges are generally for when you want to use the network connection with your VMs. So, for instance, if you only access the storage network via the NICs and no VM needs to use the network then you do not need to create a bridge on it, and just configure everything directly on the interface.

Question: For storage, I'm coming from ESXi / vSAN prior to PVE. There, I had iSCSI w/ multi-pathing setup for an active/backup across between two physical port pairs with each pair in its own VLAN (120, 121 respectively). Meaning, bonding of two interfaces (virtual or physical) would be needed, at minimum, for NFS, and four for iSCSI multi-path. Latter seems a bit difficult to setup on PVE (but i haven't yet attempted it). So, I wanted to use NFS to start with. In such a case, will a bonded interface with an interface IP suffice or a bridge still be required? E.g.
option#1:

Code:

auto bondX
iface bondX inet manual
    bond-slaves <int1> <int2> <int3>
    bond-miimon 100
    bond-mode active-backup
    bond-primary <int1>
    address <ip1/mask>
    mtu 9000

or
option#2:

Code:

auto bondX
iface bondX inet manual
    bond-slaves <int1> <int2> <int3>
    bond-miimon 100
    bond-mode active-backup
    bond-primary <int1>
    mtu 9000
auto vmbrY
iface vmbrY inet manual
    address <ip1/mask>
    <<< no gateway here >>>

shanreich said:
So, as summary the network configuration for one NIC would then look something like this:

Code:

auto <ifname>.120 iface <ifname>.120 inet static address <your_subnet> mtu 9000

Do this for every VLAN and you should be good to go.

A very good point indeed. I thought about it as well. Per-interface IP is rather cumbersome. Once I get this basic connectivity sorted, my thinking was the following:

convert ports, towards switch, from access to trunk
Each virtual port in NPar can then carry its own tag e.g. Port#1 NPar#1 can be vlan=140, NPar#2 can be vlan =120
bond two NPar'ed ports belonging to different physical ports but of the same vlan tag to allow port redundancy / lacp

Nonetheless, I think the key lies in your statement:

So, if the network configuration succeeds, a route for 192.168.120.0/24 via vmbr1 will get automatically created.

This is the expectation from Scenario#1. But, it isn't happening. Nothing abnormal in the logs which i could find/trace. What do you reckon the next step could be?

Best regards
M

Search

Search

Stumped by a routing issue

morik_proxmox

New Member

shanreich

Proxmox Staff Member

morik_proxmox

New Member