Proxmox 8.3.2 and Broadcom iSER connections

starnetwork

Renowned Member
Dec 8, 2009
425
10
83
Hi,
we using Proxmox updated version: 8.3.2 (we did now upgrade from 7.4.x) in env with Broadcom BCM57504 NetXtreme-E 10Gb/25Gb on Dell Server
we using remote iscsi over iser storage and pbs for backups, the problem is that each time we have backup for disks > 100Gb it's stop, Hangs the server and shows the following error: bnxt_en Failed to create HW QP
(attached as screenshot)

we have seen the Threads
and
but in our case we don't want to disable the iSER/RDMA support since it's in use, we did firmware update to the NIC, now it's v231.1.162.1, didn't help any advice?

Code:
# pveversion
pve-manager/8.3.2/3e76eec21c4a14a7 (running kernel: 6.8.12-5-pve)

Code:
root@testnode1:~# ethtool -i ens1f0np0
driver: bnxt_en
version: 6.8.12-5-pve
firmware-version: 231.0.154.0/pkg 231.1.162.1
expansion-rom-version:
bus-info: 0000:5e:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
root@testnode1:~#

Kind Regards,
 

Attachments

  • image_2025_01_14T07_50_08_405Z.png
    image_2025_01_14T07_50_08_405Z.png
    628.5 KB · Views: 4
Hi!

Is there anything printed to the log before these errors occur? As the other error message states, the driver seems to be unable to allocate resources for queueing the data transfer, but I'm not entirely sure what the error code 110 is for. It is normally used for ETIMEDOUT, but I couldn't verify whether this is the actual error returned here.
 
Dear @dakralex
thanks for your response!
attached additional error messages:
Code:
2025-01-14T08:47:05.359511+01:00 node404 kernel: [ 7657.161991] DMAR: DRHD: handling fault status reg 2
2025-01-14T08:47:05.359523+01:00 node404 kernel: [ 7657.162495] DMAR: [DMA Read NO_PASID] Request device [5e:00.3] fault addr 0xe3d7d000 [fault reason 0x06] PTE Read access is not set
2025-01-14T08:47:07.602146+01:00 node404 corosync[4529]:   [KNET  ] link: host: 3 link: 1 is down
2025-01-14T08:47:07.602509+01:00 node404 corosync[4529]:   [KNET  ] link: host: 6 link: 1 is down
2025-01-14T08:47:07.602556+01:00 node404 corosync[4529]:   [KNET  ] link: host: 4 link: 1 is down
2025-01-14T08:47:07.602602+01:00 node404 corosync[4529]:   [KNET  ] link: host: 2 link: 1 is down
2025-01-14T08:47:07.602645+01:00 node404 corosync[4529]:   [KNET  ] link: host: 1 link: 1 is down
2025-01-14T08:47:07.602710+01:00 node404 corosync[4529]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
2025-01-14T08:47:07.602754+01:00 node404 corosync[4529]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-01-14T08:47:07.602797+01:00 node404 corosync[4529]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
2025-01-14T08:47:07.602841+01:00 node404 corosync[4529]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
2025-01-14T08:47:07.602886+01:00 node404 corosync[4529]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
2025-01-14T08:47:15.347469+01:00 node404 kernel: [ 7667.149858]  connection1:0: detected conn error (1011)
2025-01-14T08:47:15.518430+01:00 node404 kernel: [ 7667.320891]  connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4302322256, last ping 4302327296, now 4302332416
2025-01-14T08:47:15.518438+01:00 node404 kernel: [ 7667.321871]  connection2:0: detected conn error (1022)
2025-01-14T08:47:15.518840+01:00 node404 kernel: [ 7667.322000] iser: iser_qp_event_callback: qp event QP fatal error (1)
2025-01-14T08:47:16.032699+01:00 node404 iscsid: Kernel reported iSCSI connection 1:0 error (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
2025-01-14T08:47:16.032836+01:00 node404 iscsid: Kernel reported iSCSI connection 2:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3)
2025-01-14T08:47:17.566471+01:00 node404 kernel: [ 7669.368820] bnxt_en 0000:5e:00.3 ens2f3np3: NETDEV WATCHDOG: CPU: 2: transmit queue 17 timed out 10752 ms

Please let me know if you need anything else to check this issue
 
Okay, so it probably was ETIMEDOUT after all. Could you try to increase the timeout interval for No-Op requests? This should be node.conn[0].timeo.noop_out_timeout in the /etc/iscsi/iscsid.conf, depending on your configuration, e.g. to 15 seconds. Does the problem persist?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!