Proxmox 8.3.2 and Broadcom iSER connections

starnetwork · Jan 15, 2025

Hi,
we using Proxmox updated version: 8.3.2 (we did now upgrade from 7.4.x) in env with Broadcom BCM57504 NetXtreme-E 10Gb/25Gb on Dell Server
we using remote iscsi over iser storage and pbs for backups, the problem is that each time we have backup for disks > 100Gb it's stop, Hangs the server and shows the following error: bnxt_en Failed to create HW QP
(attached as screenshot)

we have seen the Threads

Thread 'Opt-in Linux 6.8 Kernel for Proxmox VE 8 available on test & no-subscription'

Apr 5, 2024

We recently uploaded a 6.8 kernel into our repositories, it will be used as new default kernel in the next Proxmox VE 8.2 point release (Q2'2024).
This follows our tradition of upgrading the Proxmox VE kernel to match the current Ubuntu version until we reach an (Ubuntu) LTS release. This kernel is based on the upcoming Ubuntu 24.04 Noble release.

We have run this kernel on some parts of our test setups over the last few days without any notable issues.

How to install:

Ensure that either the pve-no-subscription or pvetest repository is set up correctly.
You can...

and

[TUTORIAL] Thread 'Broadcom NICs down after PVE 8.2 (Kernel 6.8)'

Apr 30, 2024

We had some issues with some broadcom nics going down after update to 6.8

Workaround: NICs go up if you do a service networking restart
FIX: Update Broadcom Firmware to latest firmware and blacklist their "beautiful" infiniband-driver

This will update ALL YOUR Broadcom-Network Cards to their latest firmware (live) (but reboot needed after it):

apt install unzip
cat << 'EOF' > bcm-nic-update.sh
wget https://www.thomas-krenn.com/redx/tools/mb_download.php/ct.YuuHGw/mid.y9b3b4ba2bf7ab3b8/bnxtnvm.zip
unzip bnxtnvm.zip
chmod +x bnxtnvm
for i in $(./bnxtnvm listdev...

but in our case we don't want to disable the iSER/RDMA support since it's in use, we did firmware update to the NIC, now it's v231.1.162.1, didn't help any advice?

Code:

# pveversion
pve-manager/8.3.2/3e76eec21c4a14a7 (running kernel: 6.8.12-5-pve)

Code:

root@testnode1:~# ethtool -i ens1f0np0
driver: bnxt_en
version: 6.8.12-5-pve
firmware-version: 231.0.154.0/pkg 231.1.162.1
expansion-rom-version:
bus-info: 0000:5e:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
root@testnode1:~#

Kind Regards,

dakralex · Jan 15, 2025

Hi!

Is there anything printed to the log before these errors occur? As the other error message states, the driver seems to be unable to allocate resources for queueing the data transfer, but I'm not entirely sure what the error code 110 is for. It is normally used for ETIMEDOUT, but I couldn't verify whether this is the actual error returned here.

starnetwork · Jan 15, 2025

Dear @dakralex
thanks for your response!
attached additional error messages:

Code:

2025-01-14T08:47:05.359511+01:00 node404 kernel: [ 7657.161991] DMAR: DRHD: handling fault status reg 2
2025-01-14T08:47:05.359523+01:00 node404 kernel: [ 7657.162495] DMAR: [DMA Read NO_PASID] Request device [5e:00.3] fault addr 0xe3d7d000 [fault reason 0x06] PTE Read access is not set
2025-01-14T08:47:07.602146+01:00 node404 corosync[4529]:   [KNET  ] link: host: 3 link: 1 is down
2025-01-14T08:47:07.602509+01:00 node404 corosync[4529]:   [KNET  ] link: host: 6 link: 1 is down
2025-01-14T08:47:07.602556+01:00 node404 corosync[4529]:   [KNET  ] link: host: 4 link: 1 is down
2025-01-14T08:47:07.602602+01:00 node404 corosync[4529]:   [KNET  ] link: host: 2 link: 1 is down
2025-01-14T08:47:07.602645+01:00 node404 corosync[4529]:   [KNET  ] link: host: 1 link: 1 is down
2025-01-14T08:47:07.602710+01:00 node404 corosync[4529]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
2025-01-14T08:47:07.602754+01:00 node404 corosync[4529]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-01-14T08:47:07.602797+01:00 node404 corosync[4529]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
2025-01-14T08:47:07.602841+01:00 node404 corosync[4529]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
2025-01-14T08:47:07.602886+01:00 node404 corosync[4529]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
2025-01-14T08:47:15.347469+01:00 node404 kernel: [ 7667.149858]  connection1:0: detected conn error (1011)
2025-01-14T08:47:15.518430+01:00 node404 kernel: [ 7667.320891]  connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4302322256, last ping 4302327296, now 4302332416
2025-01-14T08:47:15.518438+01:00 node404 kernel: [ 7667.321871]  connection2:0: detected conn error (1022)
2025-01-14T08:47:15.518840+01:00 node404 kernel: [ 7667.322000] iser: iser_qp_event_callback: qp event QP fatal error (1)
2025-01-14T08:47:16.032699+01:00 node404 iscsid: Kernel reported iSCSI connection 1:0 error (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
2025-01-14T08:47:16.032836+01:00 node404 iscsid: Kernel reported iSCSI connection 2:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3)
2025-01-14T08:47:17.566471+01:00 node404 kernel: [ 7669.368820] bnxt_en 0000:5e:00.3 ens2f3np3: NETDEV WATCHDOG: CPU: 2: transmit queue 17 timed out 10752 ms

Please let me know if you need anything else to check this issue

dakralex · Jan 15, 2025

Okay, so it probably was ETIMEDOUT after all. Could you try to increase the timeout interval for No-Op requests? This should be node.conn[0].timeo.noop_out_timeout in the /etc/iscsi/iscsid.conf, depending on your configuration, e.g. to 15 seconds. Does the problem persist?

starnetwork · Jan 15, 2025

thanks, I did for 20 seconds and check new backup, same issue
suggestions?

Kind Regards,

dakralex · Jan 16, 2025

Okay, thanks for clarifying this!

starnetwork said:
we using Proxmox updated version: 8.3.2 (we did now upgrade from 7.4.x)

I haven't noticed this before, did the new version introduce this error or did this happen before? If so this might be a regression in the driver, what was the latest running kernel version and/or firmware version?

starnetwork · Jan 16, 2025

Hi @dakralex
yes, new version introduce this error, before this upgrade I was working with 7.4.x and backups was working well

versions:

Code:

root@node404:~# pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-5-pve)
pve-manager: 8.3.2 (running version: 8.3.2/3e76eec21c4a14a7)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-12
proxmox-kernel-6.8: 6.8.12-5
proxmox-kernel-6.8.12-5-pve-signed: 6.8.12-5
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
pve-kernel-5.15.149-1-pve: 5.15.149-1
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 16.2.15+ds-0+deb12u1
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.3
pve-cluster: 8.0.10
pve-container: 5.2.3
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-2
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.3
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1

Firmware:
root@node404:~# apt list --installed | grep pve-firmware
pve-firmware/stable,now 3.14-2 all [installed]

NIC Firmware:
231.1.162.1

Kind Regards,

dakralex · Jan 16, 2025

I'm guessing that this is a regression in the driver. Could you send me the last kernel version that was working correctly for you before the error was introduced/the kernel package installed for 7.4.x? In doubt, last reboot lists the versions which were booted.

If you have the time, it would also be helpful if you could boot older kernel versions and see which version is the last where it still works as expected and which version is the first where it doesn't work. If it doesn't work for any kernel at all, then this is a bug at our side, and we can take a closer look.

starnetwork · Jan 20, 2025

Hi @dakralex
Thanks for your feedback!
I have following boot menu options: Screenshot 2025-01-20 140404.png
5.15.149-1
6.8.4-3
6.8.12-5
on 5.15.149-1 backup working well

on all new, it's show error and failed, show errors on screenshots

Please advice.

Kind Regards,

dakralex · Jan 22, 2025

Hi!

So if I get it right, the kernel version 5.15.149-1 works fine, but both 6.8.4-3 and 6.8.12-5 cause the bug? Could you also try to install the opt-in kernel version 6.11 and try it out there? Usually fixes should be backported to both, but it could be that it slipped through.

Unfortunately, I don't have a Broadcom NIC to test this and unfortunately there are numerous reports about RDMA being quite buggy over Broadcom chips for users on here... But if you could further pin down which version introduced the bug (by installing more kernel versions inbetween the last working and first breaking version and trying them out), I can send a more detailed bug report to the developers of the bnxt_en driver and see if they can figure out a fix or have already fixed it upstream. There are unfortunately way too many patches applied between 5.15.149-1 to 6.8.12-5 to figure out if there's a patch which fixed this issue directly.

IamLunchbox · Jan 23, 2025

I had differing trouble with my broadcom ethernet card: a newly provisioned PVE-server did not boot with working network connection due to networking.service failing - its dependency ifupdown2-pre.service failed. But maybe the broadcom driver is the same core issue here. @starnetwork, you did not try kernel 6.5, is that correct?

Pinning kernel 6.5.13-6-pve solved this issue with my Broadcom 57454 10GBASE-T Ethernet Adapter. Newer kernels (6.12, 6.11, 6.8) all let ifupdown2-pre.service fail with bnxt_en errors in the kernel logs.

In my case a weird and annoying error, because it forced me to go through the console, manually restart ifupdown2-pre.service and then networking.service - which worked once the system reached multi-user.target.

starnetwork · Jan 27, 2025

dakralex said:
Hi!

So if I get it right, the kernel version 5.15.149-1 works fine, but both 6.8.4-3 and 6.8.12-5 cause the bug? Could you also try to install the opt-in kernel version 6.11 and try it out there? Usually fixes should be backported to both, but it could be that it slipped through.

Unfortunately, I don't have a Broadcom NIC to test this and unfortunately there are numerous reports about RDMA being quite buggy over Broadcom chips for users on here... But if you could further pin down which version introduced the bug (by installing more kernel versions inbetween the last working and first breaking version and trying them out), I can send a more detailed bug report to the developers of the bnxt_en driver and see if they can figure out a fix or have already fixed it upstream. There are unfortunately way too many patches applied between 5.15.149-1 to 6.8.12-5 to figure out if there's a patch which fixed this issue directly.

thanks for your response!
Unfortunately, the problem also exists in kernel version 6.11

other suggestions?

Kind Regards,

starnetwork · Jan 27, 2025

IamLunchbox said:
I had differing trouble with my broadcom ethernet card: a newly provisioned PVE-server did not boot with working network connection due to networking.service failing - its dependency ifupdown2-pre.service failed. But maybe the broadcom driver is the same core issue here. @starnetwork, you did not try kernel 6.5, is that correct?

Pinning kernel 6.5.13-6-pve solved this issue with my Broadcom 57454 10GBASE-T Ethernet Adapter. Newer kernels (6.12, 6.11, 6.8) all let ifupdown2-pre.service fail with bnxt_en errors in the kernel logs.

In my case a weird and annoying error, because it forced me to go through the console, manually restart ifupdown2-pre.service and then networking.service - which worked once the system reached multi-user.target.

thanks for your info!

with 6.5.13-6-pve it's working well
so it's Kernel related
any suggestion how to fix it on new Kernels?

Kind Regards,

Search

Search

Proxmox 8.3.2 and Broadcom iSER connections

starnetwork

Renowned Member

Thread 'Opt-in Linux 6.8 Kernel for Proxmox VE 8 available on test & no-subscription'

[TUTORIAL] Thread 'Broadcom NICs down after PVE 8.2 (Kernel 6.8)'

Attachments

dakralex

Proxmox Staff Member

starnetwork

Renowned Member

dakralex

Proxmox Staff Member

starnetwork

Renowned Member

dakralex

Proxmox Staff Member

starnetwork

Renowned Member

dakralex

Proxmox Staff Member

starnetwork

Renowned Member

Attachments

dakralex

Proxmox Staff Member

IamLunchbox

Member

starnetwork

Renowned Member

starnetwork

Renowned Member

We value your privacy