Proxmox 8.3.2 and Broadcom iSER connections

starnetwork

Renowned Member
Dec 8, 2009
430
10
83
Hi,
we using Proxmox updated version: 8.3.2 (we did now upgrade from 7.4.x) in env with Broadcom BCM57504 NetXtreme-E 10Gb/25Gb on Dell Server
we using remote iscsi over iser storage and pbs for backups, the problem is that each time we have backup for disks > 100Gb it's stop, Hangs the server and shows the following error: bnxt_en Failed to create HW QP
(attached as screenshot)

we have seen the Threads
and
but in our case we don't want to disable the iSER/RDMA support since it's in use, we did firmware update to the NIC, now it's v231.1.162.1, didn't help any advice?

Code:
# pveversion
pve-manager/8.3.2/3e76eec21c4a14a7 (running kernel: 6.8.12-5-pve)

Code:
root@testnode1:~# ethtool -i ens1f0np0
driver: bnxt_en
version: 6.8.12-5-pve
firmware-version: 231.0.154.0/pkg 231.1.162.1
expansion-rom-version:
bus-info: 0000:5e:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
root@testnode1:~#

Kind Regards,
 

Attachments

  • image_2025_01_14T07_50_08_405Z.png
    image_2025_01_14T07_50_08_405Z.png
    628.5 KB · Views: 9
Hi!

Is there anything printed to the log before these errors occur? As the other error message states, the driver seems to be unable to allocate resources for queueing the data transfer, but I'm not entirely sure what the error code 110 is for. It is normally used for ETIMEDOUT, but I couldn't verify whether this is the actual error returned here.
 
Dear @dakralex
thanks for your response!
attached additional error messages:
Code:
2025-01-14T08:47:05.359511+01:00 node404 kernel: [ 7657.161991] DMAR: DRHD: handling fault status reg 2
2025-01-14T08:47:05.359523+01:00 node404 kernel: [ 7657.162495] DMAR: [DMA Read NO_PASID] Request device [5e:00.3] fault addr 0xe3d7d000 [fault reason 0x06] PTE Read access is not set
2025-01-14T08:47:07.602146+01:00 node404 corosync[4529]:   [KNET  ] link: host: 3 link: 1 is down
2025-01-14T08:47:07.602509+01:00 node404 corosync[4529]:   [KNET  ] link: host: 6 link: 1 is down
2025-01-14T08:47:07.602556+01:00 node404 corosync[4529]:   [KNET  ] link: host: 4 link: 1 is down
2025-01-14T08:47:07.602602+01:00 node404 corosync[4529]:   [KNET  ] link: host: 2 link: 1 is down
2025-01-14T08:47:07.602645+01:00 node404 corosync[4529]:   [KNET  ] link: host: 1 link: 1 is down
2025-01-14T08:47:07.602710+01:00 node404 corosync[4529]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
2025-01-14T08:47:07.602754+01:00 node404 corosync[4529]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-01-14T08:47:07.602797+01:00 node404 corosync[4529]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
2025-01-14T08:47:07.602841+01:00 node404 corosync[4529]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
2025-01-14T08:47:07.602886+01:00 node404 corosync[4529]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
2025-01-14T08:47:15.347469+01:00 node404 kernel: [ 7667.149858]  connection1:0: detected conn error (1011)
2025-01-14T08:47:15.518430+01:00 node404 kernel: [ 7667.320891]  connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4302322256, last ping 4302327296, now 4302332416
2025-01-14T08:47:15.518438+01:00 node404 kernel: [ 7667.321871]  connection2:0: detected conn error (1022)
2025-01-14T08:47:15.518840+01:00 node404 kernel: [ 7667.322000] iser: iser_qp_event_callback: qp event QP fatal error (1)
2025-01-14T08:47:16.032699+01:00 node404 iscsid: Kernel reported iSCSI connection 1:0 error (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
2025-01-14T08:47:16.032836+01:00 node404 iscsid: Kernel reported iSCSI connection 2:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3)
2025-01-14T08:47:17.566471+01:00 node404 kernel: [ 7669.368820] bnxt_en 0000:5e:00.3 ens2f3np3: NETDEV WATCHDOG: CPU: 2: transmit queue 17 timed out 10752 ms

Please let me know if you need anything else to check this issue
 
Okay, so it probably was ETIMEDOUT after all. Could you try to increase the timeout interval for No-Op requests? This should be node.conn[0].timeo.noop_out_timeout in the /etc/iscsi/iscsid.conf, depending on your configuration, e.g. to 15 seconds. Does the problem persist?
 
Okay, thanks for clarifying this!
we using Proxmox updated version: 8.3.2 (we did now upgrade from 7.4.x)
I haven't noticed this before, did the new version introduce this error or did this happen before? If so this might be a regression in the driver, what was the latest running kernel version and/or firmware version?
 
Hi @dakralex
yes, new version introduce this error, before this upgrade I was working with 7.4.x and backups was working well

versions:
Code:
root@node404:~# pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-5-pve)
pve-manager: 8.3.2 (running version: 8.3.2/3e76eec21c4a14a7)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-12
proxmox-kernel-6.8: 6.8.12-5
proxmox-kernel-6.8.12-5-pve-signed: 6.8.12-5
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
pve-kernel-5.15.149-1-pve: 5.15.149-1
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 16.2.15+ds-0+deb12u1
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.3
pve-cluster: 8.0.10
pve-container: 5.2.3
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-2
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.3
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1

Firmware:
root@node404:~# apt list --installed | grep pve-firmware
pve-firmware/stable,now 3.14-2 all [installed]

NIC Firmware:
231.1.162.1

Kind Regards,
 
I'm guessing that this is a regression in the driver. Could you send me the last kernel version that was working correctly for you before the error was introduced/the kernel package installed for 7.4.x? In doubt, last reboot lists the versions which were booted.

If you have the time, it would also be helpful if you could boot older kernel versions and see which version is the last where it still works as expected and which version is the first where it doesn't work. If it doesn't work for any kernel at all, then this is a bug at our side, and we can take a closer look.
 
Hi @dakralex
Thanks for your feedback!
I have following boot menu options: Screenshot 2025-01-20 140404.png
5.15.149-1
6.8.4-3
6.8.12-5
on 5.15.149-1 backup working well

on all new, it's show error and failed, show errors on screenshots

Please advice.

Kind Regards,
 

Attachments

  • Screenshot 2025-01-20 140404.png
    Screenshot 2025-01-20 140404.png
    42.1 KB · Views: 6
  • Screenshot 2025-01-20 141148.png
    Screenshot 2025-01-20 141148.png
    698.8 KB · Views: 6
  • Screenshot 2025-01-20 141200.png
    Screenshot 2025-01-20 141200.png
    698.8 KB · Views: 5
  • Screenshot 2025-01-20 141218.png
    Screenshot 2025-01-20 141218.png
    379.8 KB · Views: 5
Hi!

So if I get it right, the kernel version 5.15.149-1 works fine, but both 6.8.4-3 and 6.8.12-5 cause the bug? Could you also try to install the opt-in kernel version 6.11 and try it out there? Usually fixes should be backported to both, but it could be that it slipped through.

Unfortunately, I don't have a Broadcom NIC to test this and unfortunately there are numerous reports about RDMA being quite buggy over Broadcom chips for users on here... But if you could further pin down which version introduced the bug (by installing more kernel versions inbetween the last working and first breaking version and trying them out), I can send a more detailed bug report to the developers of the bnxt_en driver and see if they can figure out a fix or have already fixed it upstream. There are unfortunately way too many patches applied between 5.15.149-1 to 6.8.12-5 to figure out if there's a patch which fixed this issue directly.
 
I had differing trouble with my broadcom ethernet card: a newly provisioned PVE-server did not boot with working network connection due to networking.service failing - its dependency ifupdown2-pre.service failed. But maybe the broadcom driver is the same core issue here. @starnetwork, you did not try kernel 6.5, is that correct?

Pinning kernel 6.5.13-6-pve solved this issue with my Broadcom 57454 10GBASE-T Ethernet Adapter. Newer kernels (6.12, 6.11, 6.8) all let ifupdown2-pre.service fail with bnxt_en errors in the kernel logs.

In my case a weird and annoying error, because it forced me to go through the console, manually restart ifupdown2-pre.service and then networking.service - which worked once the system reached multi-user.target.
 
Last edited:
Hi!

So if I get it right, the kernel version 5.15.149-1 works fine, but both 6.8.4-3 and 6.8.12-5 cause the bug? Could you also try to install the opt-in kernel version 6.11 and try it out there? Usually fixes should be backported to both, but it could be that it slipped through.

Unfortunately, I don't have a Broadcom NIC to test this and unfortunately there are numerous reports about RDMA being quite buggy over Broadcom chips for users on here... But if you could further pin down which version introduced the bug (by installing more kernel versions inbetween the last working and first breaking version and trying them out), I can send a more detailed bug report to the developers of the bnxt_en driver and see if they can figure out a fix or have already fixed it upstream. There are unfortunately way too many patches applied between 5.15.149-1 to 6.8.12-5 to figure out if there's a patch which fixed this issue directly.
thanks for your response!
Unfortunately, the problem also exists in kernel version 6.11

other suggestions?

Kind Regards,
 
I had differing trouble with my broadcom ethernet card: a newly provisioned PVE-server did not boot with working network connection due to networking.service failing - its dependency ifupdown2-pre.service failed. But maybe the broadcom driver is the same core issue here. @starnetwork, you did not try kernel 6.5, is that correct?

Pinning kernel 6.5.13-6-pve solved this issue with my Broadcom 57454 10GBASE-T Ethernet Adapter. Newer kernels (6.12, 6.11, 6.8) all let ifupdown2-pre.service fail with bnxt_en errors in the kernel logs.

In my case a weird and annoying error, because it forced me to go through the console, manually restart ifupdown2-pre.service and then networking.service - which worked once the system reached multi-user.target.
thanks for your info!

with 6.5.13-6-pve it's working well
so it's Kernel related
any suggestion how to fix it on new Kernels?

Kind Regards,