[TUTORIAL] Broadcom NICs down after PVE 8.2 (Kernel 6.8)

jsterr

Well-Known Member
Jul 24, 2020
699
177
53
32
We had some issues with some broadcom nics going down after update to 6.8

Workaround: NICs go up if you do a service networking restart
FIX: Update Broadcom Firmware to latest firmware and blacklist their "beautiful" infiniband-driver

This will update ALL YOUR Broadcom-Network Cards to their latest firmware (live) (but reboot needed after it):

apt install unzip
cat << 'EOF' > bcm-nic-update.sh
wget https://www.thomas-krenn.com/redx/tools/mb_download.php/ct.YuuHGw/mid.y9b3b4ba2bf7ab3b8/bnxtnvm.zip
unzip bnxtnvm.zip
chmod +x bnxtnvm
for i in $(./bnxtnvm listdev | grep 'Device Interface Name' | awk '{print $5}')
do
./bnxtnvm -dev=$i install -online -y
done
EOF

chmod +x bcm-nic-update.sh
./bcm-nic-update.sh

This is a snippet from our standards in our thomas-krenn pve ceph deployments. Theres also a snippet for blacklisting the infiniband driver:

Code:
echo "blacklist bnxt_re" >> /etc/modprobe.d/blacklist-bnxt_re.conf
update-initramfs -u

The Firmwareupdate needs a reboot to get active!
 
Last edited:
Hello, with Supermicro H12SSL-NT-O motherboard, fresh install of Proxmox 8.1 and then update to 8.2 (6.5 to 6.8), there is no way to update the driver, as for some reason this particular model of the NIC does not support the online update:
Code:
Device Interface Name       : eno1np0
MACAddress                  : 7c:c2:55:xx:xx:xx
Base MACAddress             : 7C:C2:55:XX:XX:XX
Device Serial Number        : N/A
Chip Number                 : BCM57416
Part Number                 : BCM57416
Description                 : Supermicro 10GBASE-T Ethernet Controller
PCI Vendor Id               : 14e4
PCI Device Id               : 16d8
PCI Subsys Vendor Id        : 15d9
PCI Subsys Device Id        : 16d8
PCI Device Name             : 0000:45:00.0
Adapter Rev                 : 01
Active Package version      : 226.1.107.1
Package version on NVM      : 226.1.107.1
Firmware version            : 226.0.145.0
Active NVM config version   : 0.0.0
NVM config version          : 0.0.0
HCRM Profile ID             : 1
HCRM Profile Version        : 1.0.4
Firmware Reset Counter      : 0
Error Recovery Counter      : 0
Crash Dump Timestamp        : N/A
Reboot Required             : No
Secure Boot                 : Enabled
Secure Firmware Update      : Enabled

There is also no way to install the official driver, as:
1. You will break a leg way before you will get to download URL for the package
2. It will not install as it wants to install linux-headers (even with modification applied to playbooks and patching apt to install pve-headers when asking for linux-headers)
3. Funny enough, it actually fails to install it

So, for now, i guess people who have Broadcom devices in their system and are unable to use the solution proposed by @jsterr just stay away from the reboot button on your server.

Only ONE of my servers actually managed to boot back up while having
Code:
echo "blacklist bnxt_re" >> /etc/modprobe.d/blacklist-bnxt_re.conf
update-initramfs -u
applied.

Sadly, i have another PCIe NIC coming in with Broadcom adapter (cheaped out, should've bought Intel)...
 
Seems to only apply to onboard Broadcom chips - had no big issues so far on dedicated or mezzanine (broadcom branded) nics.
 
Last edited:
We had some issues with some broadcom nics going down after update to 6.8

Workaround: NICs go up if you do a service networking restart
FIX: Update Broadcom Firmware to latest firmware and blacklist their "beautiful" infiniband-driver

This will update ALL YOUR Broadcom-Network Cards to their latest firmware (live) (but reboot needed after it):

Thanks big-time for the write-up!

One thing I still want to add is the option to simply disable infiniband support in the NIC (through broadcom's niccli utility) - as I expect very few users to actually use infiniband on those NICs (and those would not be happy with blocking the infiniband drivers ;).

I described it shortly in the 6.8 release thread:

https://forum.proxmox.com/threads/o...le-on-test-no-subscription.144557/post-652507
 
  • Like
Reactions: weehooey and jsterr
Hello, with Supermicro H12SSL-NT-O motherboard, fresh install of Proxmox 8.1 and then update to 8.2 (6.5 to 6.8), there is no way to update the driver, as for some reason this particular model of the NIC does not support the online update:
Code:
Device Interface Name       : eno1np0
Chip Number                 : BCM57416
Part Number                 : BCM57416
Description                 : Supermicro 10GBASE-T Ethernet Controller


There is also no way to install the official driver, as:
1. You will break a leg way before you will get to download URL for the package
2. It will not install as it wants to install linux-headers (even with modification applied to playbooks and patching apt to install pve-headers when asking for linux-headers)
3. Funny enough, it actually fails to install it

So, for now, i guess people who have Broadcom devices in their system and are unable to use the solution proposed by [USER=97686]@jsterr[/USER] just stay away from the reboot button on your server.

Only ONE of my servers actually managed to boot back up while having
[CODE]echo "blacklist bnxt_re" >> /etc/modprobe.d/blacklist-bnxt_re.conf
update-initramfs -u
applied.

Sadly, i have another PCIe NIC coming in with Broadcom adapter (cheaped out, should've bought Intel)...
Is the bnxtnvm device_info output from onboard or a PCI NIC? My H12SSL-CT onboard BCM57416 does not use the bnxt_re driver, only bnxt_en. I havent updated to 8.2 yet but I will next week.

I tried updating the Firmware with online and the .pkg of the PCIe variant (P210TP, BCM957416A4160C) and Ive gotten an invalid package error. I suspect the onboard chip isnt meant to be flashed since Supermicro doesnt include any info regarding this as well

Edit: My onboard P210tep/BCM57416 works perfectly fine after reboot. I guess this problem does not apply when bnxt_re doesnt exist
 
Last edited:
Sadly, i have another PCIe NIC coming in with Broadcom adapter (cheaped out, should've bought Intel)...
I guess you can update firmware on that one using Broadcom utilities.

Supermicro on-board LANs uses Supermicro custom firmware and rarely Supermicro thinks it is necessary to update those.
So rarely that we only buy boards with Intel on-board LANs to minimize the necessity of a firmware update. Old school I210 never needs those.
 
Last edited:
We had some issues with some broadcom nics going down after update to 6.8

Workaround: NICs go up if you do a service networking restart
FIX: Update Broadcom Firmware to latest firmware and blacklist their "beautiful" infiniband-driver

This will update ALL YOUR Broadcom-Network Cards to their latest firmware (live) (but reboot needed after it):



This is a snippet from our standards in our thomas-krenn pve ceph deployments. Theres also a snippet for blacklisting the infiniband driver:

Code:
echo "blacklist bnxt_re" >> /etc/modprobe.d/blacklist-bnxt_re.conf
update-initramfs -u

The Firmwareupdate needs a reboot to get active!

Are you able to grab the firmware that it downloads and post it here? As mentioned, the Supermicro doesn't allow online updates (it used to a while ago) and the one that Supermicro just sent me was from 2021.

This page [https://support.hpe.com/connect/s/s...0c6376f0dc447bf8485dccf7a&tab=revisionHistory] has a good list of the rev history but I'm hesitant to force flash one of these because the Broadcom site lets you select Broadcom or HP when downloading various utils/firmware, implying that HP may be totally different. Broadcom doesn't have the firmware for the BCM57416 on their site unfortunately.

Disabling RoCE/RDMA via niccli combined with the driver blacklist is working for now, but I'm wondering if the latest firmware could fix this instead. The latest release on the HP site (228.1.111.0) from 04-19-24 shows "This product fixed RoCE read_context HWRM fails for CQ". Hard to say if that's actually related.
 
Success!

(disclaimer: I'm running Fedora 40 [6.8.8-300.fc40.x86_64] on a Supermicro H11SSW-NT but had the same issues as this and other posts here so I ended up here)

I did a hex dump of my original firmware and saw that at the end there was mention of "p210tep", which is very close to the p210tp firmware on Broadcom's site (direct link: https://docs.broadcom.com/docs/BCM957416A4160C_FW_229.1.123.0). Checking the file from that link also shows "p210tep" at the end, so I felt pretty comfortable in trying this:

niccli -i 1 install -rescue -force BCM957416A4160C.pkg (this will flash both NICs)

I don't need to blacklist bnxt_re anymore, but I still do need to disable RDMA or I get this error:

UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
shift exponent 64 is too large for 64-bit type 'long unsigned int'

This is mentioned in other places on this forum and appears to be a driver problem. I don't need RDMA anyway, so no big deal.

(disable RDMA for easy reference):

niccli -i 1 nvm -setoption support_rdma -scope 0 -value 0
niccli -i 2 nvm -setoption support_rdma -scope 0 -value 0

All Broadcom firmware is listed here: https://www.broadcom.com/support/do...itching,+and+PHYs&pn=&pa=&po=Broadcom&dk=&pl=

lshw:

Code:
  *-network:0
       description: Ethernet interface
       product: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller
       vendor: Broadcom Inc. and subsidiaries
       physical id: 0
       bus info: pci@0000:05:00.0
       logical name: eno1np0
       version: 01
       serial: 00:25:90:5f:99:ec
       size: 10Gbit/s
       capacity: 10Gbit/s
       width: 64 bits
       clock: 33MHz
       capabilities: pm vpd msix pciexpress bus_master cap_list rom ethernet physical tp 1000bt-fd 10000bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=bnxt_en driverversion=6.8.8-300.fc40.x86_64 duplex=full firmware=229.0.141.0/pkg 229.1.123.0 latency=0 link=yes multicast=yes port=twisted pair slave=yes speed=10Gbit/s
       resources: irq:250 memory:edf10000-edf1ffff memory:ede00000-edefffff memory:edf22000-edf23fff memory:ef980000-ef9fffff
  *-network:1
       description: Ethernet interface
       product: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller
       vendor: Broadcom Inc. and subsidiaries
       physical id: 0.1
       bus info: pci@0000:05:00.1
       logical name: eno2np1
       version: 01
       serial: 00:25:90:5f:99:ec
       size: 10Gbit/s
       capacity: 10Gbit/s
       width: 64 bits
       clock: 33MHz
       capabilities: pm vpd msix pciexpress bus_master cap_list rom ethernet physical tp 1000bt-fd 10000bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=bnxt_en driverversion=6.8.8-300.fc40.x86_64 duplex=full firmware=229.0.141.0/pkg 229.1.123.0 latency=0 link=yes multicast=yes port=twisted pair slave=yes speed=10Gbit/s
       resources: irq:267 memory:edf00000-edf0ffff memory:edd00000-eddfffff memory:edf20000-edf21fff memory:ef900000-ef97ffff
 
Last edited:
  • Like
Reactions: jsterr
Thanks big-time for the write-up!

One thing I still want to add is the option to simply disable infiniband support in the NIC (through broadcom's niccli utility) - as I expect very few users to actually use infiniband on those NICs (and those would not be happy with blocking the infiniband drivers ;).

I described it shortly in the 6.8 release thread:

https://forum.proxmox.com/threads/o...le-on-test-no-subscription.144557/post-652507
I'm not sure why but how in the heck do you get the niccli utility... I am just having the worst time trying to install it.
 
I'm currently setting up two new servers with Supermicro H13SSL-NT motherboards, which use the BCM57416 chip.
The newest available Firmware on the Thomas Krenn download page was:
Code:
Active Package version      : 225.1.95.0
Package version on NVM      : 225.1.95.0
Firmware version            : 225.0.132.0

My experience with this version (RDMA is disabled in both cases and no running VMs):
bnxt_re blacklisted: No problems with boot/shutdown, network connection is available right after boot. BUT the network connection is unstable. After 2-24 hours the network goes down and can be fixed using "service networking restart".
bnxt_re not blacklisted: Multiple minutes of delay when booting/shutting down the server, network connection is available right after boot. The connection seems to be stable. At least it was stable on both servers for 48 hours.

I asked the Thomas Krenn Support Team for a newer Firmware Version, they contacted Supermicro and sent me:
Code:
Active Package version      : 226.1.107.1
Package version on NVM      : 226.1.107.1
Firmware version            : 226.0.145.0

My experience with this version (RDMA is disabled in both cases and no running VMs):
bnxt_re blacklisted: Same behaviour
bnxt_re not blacklisted: Didn't test this configuration yet
 
In our weekly Wednesday-morning call, @sbohn and I went through the changelog of the Broadcom driver in the Linux Kernel between Kernel versions 6.5 to 6.8:
PS: we had a similar issue back then with Kernel 5.13 with an issue of the Mellanox "net/mlx5: Refactor module EEPROM query" - see https://www.thomas-krenn.com/de/wiki/Mellanox_ConnectX-4/5/6_Bitfehler_ab_Linux_Kernel_5.13

@jsterr, what do you think?
 
  • Like
Reactions: jsterr
@wefinet I did a test on one of our systems with a fresh install of Debian 12 and zabbly Kernels: https://github.com/zabbly/linux
Using custom kernels is new to me. So I thought this is the easiest way for me to test different kernels. But I'm ready to test something different, if that helps.

NIC Firmware version 226, RDMA disabled, bnxt_re module enabled using the following kernels:
linux-image-6.8.2-zabbly+
linux-image-6.8.4-zabbly+
linux-image-6.8.9-zabbly+
=> No functional issues with any of the kernels. Though I don't know if the connection is stable, because I didn't run the system for more than a few minutes.
Messages like "infiniband bnxt_re0: Couldn't start port" and "shift exponent 64 is too large for 64-bit type 'long unsigned int'" appear in kernel logs at boot time, but that does not seem to have any impact.

Everything the same, but RDMA enabled:
=> "infiniband bnxt_re0: Couldn't start port" and "shift exponent 64 is too large for 64-bit type 'long unsigned int'" are still appearing in kernel logs.
Shutting down the system is delayed by about 100 seconds, as the system get's stuck for a while at this stage:
Screenshot_20240515_130506.png
The only functional issue seems to be the shutdown-delay. Though I don't know if the connection is stable, because I didn't run the system for more than a few minutes.

So a newer 6.8 Kernel does not seem to help.

Then I tested linux-image-6.5.10-zabbly+ and linux-image-6.7.9-zabbly+:
The "shift exponent 64 is too large for 64-bit type 'long unsigned int'" messages are still appearing.
The "infiniband bnxt_re0: Couldn't start port" messages and shutdown delay (in case of RDMA enabled) are gone.
RDMA enabled/disabled makes no difference.

####

@jsterr Did you test a NIC with a newer Firmware than 226 without blacklisting the bnxt_re module?
n0xlf's Post suggest that Firmware 229 could fix the issue.
But I won't force-flash a firmware on my brand new motherboards and don't have a dedicated card on hand, that is compatible with this firmware officially.
 
  • Like
Reactions: wefinet
@jsterr Did you test a NIC with a newer Firmware than 226 without blacklisting the bnxt_re module?
n0xlf's Post suggest that Firmware 229 could fix the issue.
But I won't force-flash a firmware on my brand new motherboards and don't have a dedicated card on hand, that is compatible with this firmware officially.
Hi @msc I only tested with addon cards. Latest version for me is:
NetXtreme-E Controller #1 will be updated to firmware version v229.1.123.0

I personally always disabled the bnxt_re module (even before 6.8) because I had messages in dmesg about it, that I wanted to get removed.

Code:
root@PMX8:~# lshw -c network -businfo
Bus info          Device         Class          Description
===========================================================
pci@0000:01:00.0  enp1s0f0np0    network        BCM57416 NetXtreme-E Dual-Media
pci@0000:01:00.1  enp1s0f1np1    network        BCM57416 NetXtreme-E Dual-Media
pci@0000:41:00.0  enp65s0f0      network        I350 Gigabit Network Connection
pci@0000:41:00.1  enp65s0f1      network        I350 Gigabit Network Connection
pci@0000:a1:00.0  enp161s0f0np0  network        BCM57504 NetXtreme-E 10Gb/25Gb/4
pci@0000:a1:00.1  enp161s0f1np1  network        BCM57504 NetXtreme-E 10Gb/25Gb/4
pci@0000:a1:00.2  enp161s0f2np2  network        BCM57504 NetXtreme-E 10Gb/25Gb/4
pci@0000:a1:00.3  enp161s0f3np3  network        BCM57504 NetXtreme-E 10Gb/25Gb/4
root@PMX8:~#
 
Thank you @jsterr!

I asked for bnxt_re enabled because with my onboard NICs on Firmware 226, the network connection usually breaks after a few hours of uptime, when the bnxt_re kernel module is disabled.
But if Firmware 229 works fine with bnxt_re disabled, that's fine for me, too.

I will ask the Thomas Krenn support if maybe Supermicro could provide Firmware 229 for the onboard NICs.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!