Kernel 6.5.11 - bug with NIC HP 631FLR-SFP28

dominiaz

Well-Known Member
Sep 16, 2016
34
1
48
37
I have 5d:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01) on HP DL380G10.

On fresh installation proxmox 8.1 there is an error like this. I have tried to upgrade kernel to 6.5.11-7-pve from apt but the error still persist:

Code:
[    5.123860] ================================================================================
[    5.123862] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
[    5.123865] shift exponent 64 is too large for 64-bit type 'long unsigned int'
[    5.123867] CPU: 60 PID: 935 Comm: (udev-worker) Tainted: P           O       6.5.11-7-pve #1
[    5.123871] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 07/20/2023
[    5.123872] Call Trace:
[    5.123875]  <TASK>
[    5.123878]  dump_stack_lvl+0x48/0x70
[    5.123892]  dump_stack+0x10/0x20
[    5.123895]  __ubsan_handle_shift_out_of_bounds+0x1ac/0x360
[    5.123904]  bnxt_qplib_alloc_init_hwq.cold+0xa9/0x104 [bnxt_re]
[    5.123925]  bnxt_qplib_create_qp+0x1b4/0x7a0 [bnxt_re]
[    5.123939]  ? bnxt_qplib_rcfw_send_message+0x3e/0x70 [bnxt_re]
[    5.123957]  bnxt_re_create_qp+0x995/0xd80 [bnxt_re]
[    5.123974]  create_qp+0x17a/0x290 [ib_core]
[    5.124015]  ? create_qp+0x17a/0x290 [ib_core]
[    5.124054]  ib_create_qp_kernel+0x3b/0xe0 [ib_core]
[    5.124093]  create_mad_qp+0x8e/0x100 [ib_core]
[    5.124137]  ? __pfx_qp_event_handler+0x10/0x10 [ib_core]
[    5.124181]  ib_mad_init_device+0x294/0x840 [ib_core]
[    5.124228]  add_client_context+0x127/0x1c0 [ib_core]
[    5.124270]  enable_device_and_get+0xe6/0x1e0 [ib_core]
[    5.124311]  ib_register_device+0x506/0x610 [ib_core]
[    5.124355]  ? __kmalloc+0x4d/0xd0
[    5.124361]  ? ib_device_set_netdev+0x160/0x1b0 [ib_core]
[    5.124403]  bnxt_re_probe+0xd7a/0x1070 [bnxt_re]
[    5.124418]  ? __pfx_bnxt_re_probe+0x10/0x10 [bnxt_re]
[    5.124430]  auxiliary_bus_probe+0x3e/0xa0
[    5.124438]  really_probe+0x1c9/0x430
[    5.124444]  __driver_probe_device+0x8c/0x190
[    5.124448]  driver_probe_device+0x24/0xd0
[    5.124452]  __driver_attach+0x10b/0x210
[    5.124455]  ? __pfx___driver_attach+0x10/0x10
[    5.124458]  bus_for_each_dev+0x8a/0xf0
[    5.124462]  driver_attach+0x1e/0x30
[    5.124465]  bus_add_driver+0x127/0x240
[    5.124469]  driver_register+0x5e/0x130
[    5.124472]  __auxiliary_driver_register+0x73/0xf0
[    5.124476]  ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
[    5.124488]  bnxt_re_mod_init+0x3e/0xff0 [bnxt_re]
[    5.124499]  ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
[    5.124509]  do_one_initcall+0x5b/0x340
[    5.124518]  do_init_module+0x68/0x260
[    5.124528]  load_module+0x213a/0x22a0
[    5.124534]  ? security_kernel_post_read_file+0x75/0x90
[    5.124542]  init_module_from_file+0x96/0x100
[    5.124546]  ? init_module_from_file+0x96/0x100
[    5.124549]  ? mmap_region+0x698/0x9e0
[    5.124558]  idempotent_init_module+0x11c/0x2b0
[    5.124564]  __x64_sys_finit_module+0x64/0xd0
[    5.124569]  do_syscall_64+0x58/0x90
[    5.124576]  ? ksys_mmap_pgoff+0x120/0x240
[    5.124578]  ? __secure_computing+0x89/0xf0
[    5.124586]  ? exit_to_user_mode_prepare+0x39/0x190
[    5.124590]  ? syscall_exit_to_user_mode+0x37/0x60
[    5.124597]  ? do_syscall_64+0x67/0x90
[    5.124599]  ? exit_to_user_mode_prepare+0x39/0x190
[    5.124602]  ? syscall_exit_to_user_mode+0x37/0x60
[    5.124605]  ? do_syscall_64+0x67/0x90
[    5.124609]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[    5.124615] RIP: 0033:0x7ff4cca87559
[    5.124637] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 77 08 0d 00 f7 d8 64 89 01 48
[    5.124640] RSP: 002b:00007fff3b7a5bb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[    5.124644] RAX: ffffffffffffffda RBX: 00005613bd360a70 RCX: 00007ff4cca87559
[    5.124646] RDX: 0000000000000000 RSI: 00007ff4ccc1aefd RDI: 000000000000000f
[    5.124647] RBP: 00007ff4ccc1aefd R08: 0000000000000000 R09: 00005613bd35f7e0
[    5.124649] R10: 000000000000000f R11: 0000000000000246 R12: 0000000000020000
[    5.124651] R13: 0000000000000000 R14: 00005613bd39f780 R15: 00005613bcd1aec1
[    5.124654]  </TASK>
[    5.124656] ================================================================================
[    5.125035] bnxt_en 0000:5d:00.0: QPLIB: cmdq[0xe]=0x3 status 0x3
[    5.125042] bnxt_en 0000:5d:00.0 bnxt_re0: Failed to modify HW QP
[    5.125045] infiniband bnxt_re0: Couldn't change QP1 state to INIT: -14
[    5.125048] infiniband bnxt_re0: Couldn't start port
[    5.125371] infiniband bnxt_re0: Couldn't open port 1
[    5.125654] infiniband bnxt_re0: Device registered with IB successfully
[    5.131241] ioatdma 0000:00:04.5: enabling device (0004 -> 0006)
[    5.139986] cryptd: max_cpu_qlen set to 1000
[    5.143670] ioatdma 0000:00:04.6: enabling device (0004 -> 0006)
[    5.146297] AVX2 version of gcm_enc/dec engaged.
[    5.146429] AES CTR mode by8 optimization enabled
[    5.161123] Console: switching to colour frame buffer device 128x48
[    5.176153] bnxt_en 0000:5d:00.1: QPLIB: cmdq[0xe]=0x3 status 0x3
[    5.176159] bnxt_en 0000:5d:00.1 bnxt_re1: Failed to modify HW QP
[    5.176161] infiniband bnxt_re1: Couldn't change QP1 state to INIT: -14
[    5.176164] infiniband bnxt_re1: Couldn't start port
[    5.176444] infiniband bnxt_re1: Couldn't open port 1
[    5.176632] infiniband bnxt_re1: Device registered with IB successfully

NIC is working fine but there is strange error on dmesg.

I have downgraded kernel to 6.2.16-20-pve and the error disappeared, so the bug is in kernel 6.5.11.

Any help?
 
Solved

there is no bonding involved

just the kernel (version) + bnxt_re module + Broadcom firmware + initializing RDMA RoCE

I had the same problem.
In my rack there are in total 25 pieces of Broadcom BCM57504 4x25G SFP28 PCIe network cards.

Most of them do not have this problem, but some of them have.
I have figured out, that the problem is caused by older firmware, which makes the kernel think that RDMA is on the card enabled.
Afterwards, because of that, and only on the servers with the older firmware is the kernel module bnxt_re loaded additionally to the standard bnxt_en.
And this module on the cards with enabled RDMA or older firmware crashes. It loads fine on the servers with new firmware but it doesn't enable any RDMA functions on these cards with new firmware - so the bug does not trigger.

The solution is to blacklist bnxt_re in /etc/modprobe.d/pve-blacklist.conf or update the firmware and factory reset the firmware (this is not possible under the new kernel, i had to use RHEL with 5.14 kernel)

example of old firmware causing the problem:

Code:
#lspci -v
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)
        Subsystem: Broadcom Inc. and subsidiaries BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (NetXtreme-E Quad-port 25G SFP28 Ethernet PCIe4.0 x16 Adapter (BCM957504-P425G))
        Flags: bus master, fast devsel, latency 0, IRQ 150
        Memory at f8f30000 (64-bit, non-prefetchable) [size=64K]
        Memory at 380b0000000 (64-bit, prefetchable) [size=1M]
        Memory at 380b0106000 (64-bit, prefetchable) [size=8K]
        Expansion ROM at f8e80000 [disabled] [size=512K]
        Capabilities: [48] Power Management version 3
        Capabilities: [50] Vital Product Data
        Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
        Capabilities: [a0] MSI-X: Enable+ Count=74 Masked-
        Capabilities: [ac] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [13c] Device Serial Number 5c-6f-69-ff-fe-88-ec-b0
        Capabilities: [150] Power Budgeting <?>
        Capabilities: [160] Virtual Channel
        Capabilities: [180] Vendor Specific Information: ID=0000 Rev=0 Len=020 <?>
        Capabilities: [1b0] Latency Tolerance Reporting
        Capabilities: [1b8] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [230] Transaction Processing Hints
        Capabilities: [300] Secondary PCI Express
        Capabilities: [200] Precision Time Measurement
        Capabilities: [358] Physical Layer 16.0 GT/s <?>
        Capabilities: [388] Lane Margining at the Receiver <?>
        Kernel driver in use: bnxt_en
        Kernel modules: bnxt_en

#ethtool -i enp1s0f0np0
driver: bnxt_en
version: 6.5.11-8-pve
firmware-version: 216.4.16.8/pkg 216.0.333.11
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

#rmmod bnxt_re && modprobe bnxt_re
[ 4794.513390] bnxt_re: Broadcom NetXtreme-C/E RoCE Driver
[ 4794.514285] infiniband (null): Low latency framework is enabled
[ 4794.518262] ================================================================================
[ 4794.518280] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
[ 4794.518290] shift exponent 64 is too large for 64-bit type 'long unsigned int'
...
errors
...
[ 4978.368844] infiniband bnxt_re0: Device registered with IB successfully
[ 4978.368869] bnxt_en 0000:01:00.0: QPLIB: cmdq[0xf]=0x8c status 0x5
[ 4978.368879] bnxt_en 0000:01:00.0 bnxt_re0: Failed to setup CC enable = 1
[ 4978.369142] infiniband (null): Low latency framework is enabled
[ 4978.373399] infiniband bnxt_re1: Device registered with IB successfully
[ 4978.373423] bnxt_en 0000:01:00.1: QPLIB: cmdq[0xf]=0x8c status 0x5
[ 4978.373432] bnxt_en 0000:01:00.1 bnxt_re1: Failed to setup CC enable = 1
[ 4978.373702] infiniband (null): Low latency framework is enabled
[ 4978.377404] infiniband bnxt_re2: Device registered with IB successfully
[ 4978.377427] bnxt_en 0000:01:00.2: QPLIB: cmdq[0xf]=0x8c status 0x5
[ 4978.377437] bnxt_en 0000:01:00.2 bnxt_re2: Failed to setup CC enable = 1
[ 4978.377713] infiniband (null): Low latency framework is enabled
[ 4978.381393] infiniband bnxt_re3: Device registered with IB successfully
[ 4978.381418] bnxt_en 0000:01:00.3: QPLIB: cmdq[0xf]=0x8c status 0x5
[ 4978.381428] bnxt_en 0000:01:00.3 bnxt_re3: Failed to setup CC enable = 1

#ibv_devices
    device                 node GUID
    ------              ----------------
    bnxt_re0            5e6f69fffe88ecb0
    bnxt_re1            5e6f69fffe88ecb1
    bnxt_re2            5e6f69fffe88ecb2
    bnxt_re3            5e6f69fffe88ecb3

example of new firmware without any problem:

Code:
#lspci -v 
41:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)
        Subsystem: Broadcom Inc. and subsidiaries NetXtreme-E P425D BCM57504 4x25G SFP28 PCIE
        Flags: bus master, fast devsel, latency 0, IRQ 54
        Memory at 28080130000 (64-bit, prefetchable) [size=64K]
        Memory at 28080000000 (64-bit, prefetchable) [size=1M]
        Memory at 28080158000 (64-bit, prefetchable) [size=32K]
        Expansion ROM at f26c0000 [disabled] [size=256K]
        Capabilities: [48] Power Management version 3
        Capabilities: [50] Vital Product Data
        Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
        Capabilities: [a0] MSI-X: Enable+ Count=74 Masked-
        Capabilities: [ac] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [13c] Device Serial Number 5c-6f-69-ff-fe-cf-93-00
        Capabilities: [150] Power Budgeting <?>
        Capabilities: [160] Virtual Channel
        Capabilities: [180] Vendor Specific Information: ID=0000 Rev=0 Len=020 <?>
        Capabilities: [1b0] Latency Tolerance Reporting
        Capabilities: [1b8] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [230] Transaction Processing Hints
        Capabilities: [300] Secondary PCI Express
        Capabilities: [200] Precision Time Measurement
        Capabilities: [358] Physical Layer 16.0 GT/s <?>
        Capabilities: [388] Lane Margining at the Receiver <?>
        Kernel driver in use: bnxt_en
        Kernel modules: bnxt_en

#ethtool -i enp65s0f0np0
driver: bnxt_en
version: 6.5.11-8-pve
firmware-version: 218.0.219.13/pkg 21.85.21.91
expansion-rom-version:
bus-info: 0000:41:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

#rmmod bnxt_re && modprobe bnxt_re
[66678.364522] bnxt_re: Broadcom NetXtreme-C/E RoCE Driver

#ibv_devices
    device                 node GUID
    ------              ----------------

And don't even think about using RDMA Infiniband with Proxmox and Broadcom chipsets...
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!