Dear experts,
I'm a bit puzzled and don't know how to move forward. Hence, requesting your assistance.
Desire: Use dedicated NICs for dedicated functions (e.g. 1 NIC for storage, 2nd NIC as backup to 1st - via VPC on the switches, 3rd NIC for certain VMs, 4th NIC for certain other VMs etc)
Problem: Employing multiple bridges crashes the whole network (STP loop detected).
Setup:
Sample `lshw -class net` output for BRCM NIC
NIC drivers for both types are properly loaded
/etc/network/interfaces
pveversion
Scenario#1 (OK)
With only a default gateway (.140.1), life is good. But, this doesn't serve the purpose because default gateway on vmbr0 is always used for reaching out to NFS server.
Scenario#2 (NOT OK)
There should only be one default gateway. PVE GUI doesn't allow adding one. A gateway entry can be manually added to `/etc/network/interfaces` but setting that aside. As the intention is for PVE to use one of the NICs (vmbr1) for its own (non-VM) purpose of NFS mounts, I attempt to add a route to the routing table:
I am able to ping the NFS server for a minute, after which STP blocking occurs.
Scenario#3 (NOT OK)
PVE GUI doesn't allow adding one but i manually add one to `/etc/network/interfaces` followed by `ifreload -a:
Things appear fine for a minute and everything on the network becomes inaccessible from / to everywhere else.
Other scenarios i've attempted which yield in more or less the same output (either node can't reach elsewhere, or rest of the network craps) including assigning ip address to bonded interface instead of bridges. I don't understand the essential difference being doing this vs interface->bond->vmbr(ip assigned here)
I'm a bit puzzled and don't know how to move forward. Hence, requesting your assistance.
Desire: Use dedicated NICs for dedicated functions (e.g. 1 NIC for storage, 2nd NIC as backup to 1st - via VPC on the switches, 3rd NIC for certain VMs, 4th NIC for certain other VMs etc)
Problem: Employing multiple bridges crashes the whole network (STP loop detected).
Setup:
- 6 NICs on a standalone PVE8.2 (w/ an intention to convert to cluster)
- Four of which are Broadcom (BRCM 57504) 25G. Each port has Npar=4 which in turn presents 4 ports virtual ports to hypervisor (PVE). Meaning, PVE sees a total of 4*4=16 BRCM ports. SR-IOV is disabled.
- Two of which are NVIDIA 100G. SR-IOV is enabled but VF is set to 0 in the firmware.
- VLANs involved:
- 140: Management
- 100: Server
- 120: Storage (NFS access here)
- 121: Storage backup (NFS access desired here)
- To simplify setup
- No VMs installed yet
- Npar'ed ports are bonded w/ balanced-xor in its own pair Physical port which is then assigned to a separate bridge which each have an IP
- All 6 ports are configured as access on the (Cisco) switch w/ the following configuration (where X= one of the above VLANs):
Code:
switchport mode access
switchport access vlan X
mtu 9216
no flowcontrol receive on
Sample `lshw -class net` output for BRCM NIC
Code:
*-network:1
description: Ethernet interface
product: BCM57504 NetXtreme-E RDMA Partition
vendor: Broadcom Inc. and subsidiaries
physical id: 0.1
bus info: pci@0000:19:00.1
logical name: eno8403np1
version: 11
serial: d0:8e:79:f2:14:af
width: 64 bits
clock: 33MHz
capabilities: pm vpd msix pciexpress bus_master cap_list rom ethernet physical fibre autonegotiation
configuration: autonegotiation=on broadcast=yes driver=bnxt_en driverversion=6.8.12-2-pve duplex=full firmware=229.2.62.0/pkg 22.92.07.50 latency=0 link=yes multicast=yes port=fibre slave=yes
resources: iomemory:21ff0-21fef iomemory:21ff0-21fef iomemory:21ff0-21fef irq:17 memory:21ffff0e0000-21ffff0effff memory:21fffd000000-21fffdffffff memory:21ffff170000-21ffff177fff memory:a6340000-a637ffff
NIC drivers for both types are properly loaded
Code:
lsmod | grep -e nvidia -e bnxt
nvidia_vgpu_vfio 114688 10
nvidia 54284288 3
mdev 24576 1 nvidia_vgpu_vfio
kvm 1339392 2 nvidia_vgpu_vfio,kvm_intel
vfio_pci_core 86016 2 nvidia_vgpu_vfio,vfio_pci
irqbypass 12288 3 vfio_pci_core,nvidia_vgpu_vfio,kvm
vfio 65536 4 vfio_pci_core,nvidia_vgpu_vfio,vfio_iommu_type1,vfio_pci
bnxt_en 380928 0
Code:
[ 5.637754] bnxt_en 0000:19:00.0 eth0: Broadcom BCM57504 NetXtreme-E Ethernet Partition found at mem 21ffff0f0000, node addr d0:8e:79:f2:14:ae
[ 5.637767] bnxt_en 0000:19:00.0: 126.024 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x8 link)
[ 5.638226] bnxt_en 0000:19:00.1 (unnamed net_device) (uninitialized): Device requests max timeout of 60 seconds, may trigger hung task watchdog
[ 5.783770] bnxt_en 0000:19:00.1 eth1: Broadcom BCM57504 NetXtreme-E Ethernet Partition found at mem 21ffff0e0000, node addr d0:8e:79:f2:14:af
[ 5.783786] bnxt_en 0000:19:00.1: 126.024 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x8 link)
[ 5.784264] bnxt_en 0000:19:00.2 (unnamed net_device) (uninitialized): Device requests max timeout of 60 seconds, may trigger hung task watchdog
[ 5.866088] Backport based on https://:@git-nbu.nvidia.com/r/a/mlnx_ofed/mlnx-ofa_kernel-4.0.git 51d9270
[ 5.866090] compat.git: https://:@git-nbu.nvidia.com/r/a/mlnx_ofed/mlnx-ofa_kernel-4.0.git
[ 5.921285] bnxt_en 0000:19:00.2 eth2: Broadcom BCM57504 NetXtreme-E Ethernet Partition found at mem 21ffff0d0000, node addr d0:8e:79:f2:14:b0
[ 5.921296] bnxt_en 0000:19:00.2: 126.024 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x8 link)
[ 5.921582] bnxt_en 0000:19:00.3 (unnamed net_device) (uninitialized): Device requests max timeout of 60 seconds, may trigger hung task watchdog
[ 5.946404] bnxt_en 0000:19:00.3 eth3: Broadcom BCM57504 NetXtreme-E Ethernet Partition found at mem 21ffff0c0000, node addr d0:8e:79:f2:14:b1
[ 5.946411] bnxt_en 0000:19:00.3: 126.024 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x8 link)
[ 5.946593] bnxt_en 0000:19:00.4 (unnamed net_device) (uninitialized): Device requests max timeout of 60 seconds, may trigger hung task watchdog
[ 5.970527] bnxt_en 0000:19:00.4 eth4: Broadcom BCM57504 NetXtreme-E Ethernet Partition found at mem 21ffff0b0000, node addr d0:8e:79:f2:14:b2
[ 5.970533] bnxt_en 0000:19:00.4: 126.024 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x8 link)
[ 5.970705] bnxt_en 0000:19:00.5 (unnamed net_device) (uninitialized): Device requests max timeout of 60 seconds, may trigger hung task watchdog
[ 5.993783] bnxt_en 0000:19:00.5 eth5: Broadcom BCM57504 NetXtreme-E Ethernet Partition found at mem 21ffff0a0000, node addr d0:8e:79:f2:14:b3
[ 5.993788] bnxt_en 0000:19:00.5: 126.024 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x8 link)
Code:
dmesg | grep -e nvidia | head -n 20
[ 5.866088] Backport based on https://:@git-nbu.nvidia.com/r/a/mlnx_ofed/mlnx-ofa_kernel-4.0.git 51d9270
[ 5.866090] compat.git: https://:@git-nbu.nvidia.com/r/a/mlnx_ofed/mlnx-ofa_kernel-4.0.git
[ 9.043196] nvidia-nvlink: Nvlink Core is being initialized, major device number 509
[ 10.003012] audit: type=1400 audit(1729815717.795:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1558 comm="apparmor_parser"
[ 10.003025] audit: type=1400 audit(1729815717.795:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1558 comm="apparmor_parser"
[ 195.238507] nvidia 0000:8a:00.4: enabling device (0000 -> 0002)
[ 195.238842] nvidia 0000:8a:00.5: enabling device (0000 -> 0002)
[ 195.238987] nvidia 0000:8a:00.6: enabling device (0000 -> 0002)
[ 195.239348] nvidia 0000:8a:00.7: enabling device (0000 -> 0002)
[ 195.239475] nvidia 0000:8a:01.0: enabling device (0000 -> 0002)
[ 195.239581] nvidia 0000:8a:01.1: enabling device (0000 -> 0002)
[ 195.239768] nvidia 0000:8a:01.2: enabling device (0000 -> 0002)
[ 195.239929] nvidia 0000:8a:01.3: enabling device (0000 -> 0002)
[ 195.240082] nvidia 0000:8a:01.4: enabling device (0000 -> 0002)
[ 195.240269] nvidia 0000:8a:01.5: enabling device (0000 -> 0002)
[ 195.240384] nvidia 0000:8a:01.6: enabling device (0000 -> 0002)
[ 195.240527] nvidia 0000:8a:01.7: enabling device (0000 -> 0002)
[ 195.240652] nvidia 0000:8a:02.0: enabling device (0000 -> 0002)
[ 195.240819] nvidia 0000:8a:02.1: enabling device (0000 -> 0002)
[ 195.240941] nvidia 0000:8a:02.2: enabling device (0000 -> 0002)
/etc/network/interfaces
Code:
auto lo
iface lo inet loopback
auto eno8304np0
iface eno8304np0 inet manual
mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on
#Embedded NIC#1 BRCM#1 (Left#4 from front) NPar#2; Switch-side VLAN 140 (Supervisor)
auto eno8305np0
iface eno8305np0 inet manual
mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on
#Embedded NIC#1 BRCM#1 (Left#4 from front) NPar#3; Switch-side VLAN 140 (Supervisor)
auto eno8306np0
iface eno8306np0 inet manual
mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on
#Embedded NIC#1 BRCM#1 (Left#4 from front) NPar#4; Switch-side VLAN 140 (Supervisor)
auto ens2f0np0
iface ens2f0np0 inet manual
mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on
auto ens2f1np1
iface ens2f1np1 inet manual
mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on
iface eno8303np0 inet manual
mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on
auto eno8503np2
iface eno8503np2 inet manual
mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on
auto eno8603np3
iface eno8603np3 inet manual
mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on
auto eno8404np1
iface eno8404np1 inet manual
mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on
auto eno8504np2
iface eno8504np2 inet manual
mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on
auto eno8604np3
iface eno8604np3 inet manual
mtu 9000
#post-up ethtool $IFACE rx 2047 tx 2047 tx flow-control on rx flow-control on
auto eno8403np1
iface eno8403np1 inet manual
mtu 9000
auto eno8405np1
iface eno8405np1 inet manual
mtu 9000
auto eno8406np1
iface eno8406np1 inet manual
mtu 9000
auto eno8505np2
iface eno8505np2 inet manual
mtu 9000
auto eno8506np2
iface eno8506np2 inet manual
mtu 9000
auto eno8605np3
iface eno8605np3 inet manual
mtu 9000
auto eno8606np3
iface eno8606np3 inet manual
mtu 9000
auto bond0
iface bond0 inet manual
bond-slaves eno8304np0 eno8305np0 eno8306np0
bond-miimon 100
bond-mode balance-xor
bond-xmit-hash-policy layer2+3
mtu 9000
#Bond - 25G Management VLAN (Npar=3)
auto bond1
iface bond1 inet manual
bond-slaves eno8403np1 eno8404np1 eno8405np1 eno8406np1
bond-miimon 100
bond-mode balance-xor
mtu 9000
#Bond - 25G Storage (Npar=4)
auto bond2
iface bond2 inet manual
bond-slaves eno8503np2 eno8504np2 eno8505np2 eno8506np2
bond-miimon 100
bond-mode balance-xor
bond-xmit-hash-policy layer2+3
mtu 9000
#Bond - 25G Storage Backup (Npar=4)
auto bond3
iface bond3 inet manual
bond-slaves eno8603np3 eno8604np3 eno8605np3 eno8606np3
bond-miimon 100
bond-mode balance-xor
bond-xmit-hash-policy layer2+3
mtu 9000
#Bond - 25G Server (Npar=4, use for cluster in future)
auto vmbr0
iface vmbr0 inet static
address 192.168.140.7/24
gateway 192.168.140.1
bridge-ports bond0
bridge-stp off
bridge-fd 0
mtu 9000
#Bridge - Management VLAN(Npar=3, 1 not usable)
auto vmbr1
iface vmbr1 inet static
address 192.168.120.40/24
#gateway 192.168.120.1
bridge-ports bond1
bridge-stp off
bridge-fd 0
mtu 9000
#Bridge - Storage (25G Npar=4)
auto vmbr2
iface vmbr2 inet static
address 192.168.121.40/24
#gateway 192.168.121.1
bridge-ports bond2
bridge-stp off
bridge-fd 0
mtu 9000
#Bridge - Storage Backup (25G Npar=4)
auto vmbr3
iface vmbr3 inet static
address 192.168.100.40/24
#gateway 192.168.100.1
bridge-ports bond3
bridge-stp off
bridge-fd 0
mtu 9000
#Bridge - Server VLAN (non-SRIOV mobility)
iface vmbr4 inet static
address 192.168.100.41/24
bridge-ports ens2f0np0
bridge-stp off
bridge-fd 0
mtu 9000
#gateway 192.168.100.1
#Bridge - Server VLAN (SRIOV) #1
iface vmbr5 inet static
address 192.168.100.42/24
bridge-ports ens2f1np1
bridge-stp off
bridge-fd 0
mtu 9000
#gateway 192.168.100.1
#Bridge - Server VLAN (SRIOV) #2
source /etc/network/interfaces.d/*
pveversion
Code:
pve-manager/8.2.7/3e0176e6bb2ade3b (running kernel: 6.8.12-2-pve)
Scenario#1 (OK)
With only a default gateway (.140.1), life is good. But, this doesn't serve the purpose because default gateway on vmbr0 is always used for reaching out to NFS server.
Code:
ip r
default via 192.168.140.1 dev vmbr0 proto kernel onlink
192.168.140.0/24 dev vmbr0 proto kernel scope link src 192.168.140.7
ip route get 192.168.120.7
192.168.120.7 via 192.168.140.1 dev vmbr0 src 192.168.140.7 uid 0
Scenario#2 (NOT OK)
There should only be one default gateway. PVE GUI doesn't allow adding one. A gateway entry can be manually added to `/etc/network/interfaces` but setting that aside. As the intention is for PVE to use one of the NICs (vmbr1) for its own (non-VM) purpose of NFS mounts, I attempt to add a route to the routing table:
Code:
ip route add 192.168.120.7/32 dev vmbr1
ip r
default via 192.168.140.1 dev vmbr0 proto kernel onlink
192.168.140.0/24 dev vmbr0 proto kernel scope link src 192.168.140.7
192.168.120.7/32 dev vmbr1 proto kernel scope link
ip get route 192.168.120.7
192.168.120.7 dev vmbr1 src 192.168.120.40 uid 0
Scenario#3 (NOT OK)
PVE GUI doesn't allow adding one but i manually add one to `/etc/network/interfaces` followed by `ifreload -a:
Code:
ip r
default via 192.168.140.1 dev vmbr0 proto kernel onlink
192.168.100.0/24 dev vmbr3 proto kernel scope link src 192.168.100.40
192.168.120.0/24 dev vmbr1 proto kernel scope link src 192.168.120.40
192.168.121.0/24 dev vmbr2 proto kernel scope link src 192.168.121.40
192.168.140.0/24 dev vmbr0 proto kernel scope link src 192.168.140.7
Other scenarios i've attempted which yield in more or less the same output (either node can't reach elsewhere, or rest of the network craps) including assigning ip address to bonded interface instead of bridges. I don't understand the essential difference being doing this vs interface->bond->vmbr(ip assigned here)
Last edited: