Bash Deadlock after SCSI rescan with 9.1

dompro · 2025-11-26T08:08:23+0100

Hi,

we are currently migrating to Proxmox, slowly adding more Nodes (HPE DL380 G10) to our cluster. The current nodes are on 9.0.10 and the new nodes which we are just adding now are on 9.1.1. (we are aware that we have to update all nodes in the cluster to the same version).
Now on those new 9.1.1 nodes in the cluster, we are having trouble when adding a new LUN over fibre channel. In our process we have to trigger a Bus rescan, since the newly presented disks don't show up without doing that. We trigger the rescan by running:

Bash:

for host in /sys/class/scsi_host/host*/scan; do echo "- - -" > $host; done

This works on the 9.0.10 nodes without any problem, but with 9.1.1 the command never finishes, locking up the bash. This is only resolved by rebooting the node.
Now if we run "dmesg | grep lun" we get the following information:

Bash:

[ 5677.064642] sd 2:1:0:0: lun4194304 has a LUN larger than allowed by the host adapter

Running lsscsi we find that this ID belongs to the HPE p408i-a adapter which holds our local RAID1. It does not belong to the HPE SN1100Q adapter which we use to access the fibre channel LUN.

Bash:

[2:0:0:0]    enclosu HPE      Smart Adapter    6.22  -       
[2:1:0:0]    disk    HPE      LOGICAL VOLUME   6.22  /dev/sda
[2:2:0:0]    storage HPE      P408i-a SR Gen10 6.22  -

This means we can work around this issue by running a rescan which excludes this adapter (host2).

Bash:

for host in $(ls /sys/class/scsi_host/ | grep -v host2); do echo "- - -" > /sys/class/scsi_host/$host/scan; done

Running the rescan without this adapter finishes without any problems.

We have found a similar problem here. https://forum.proxmox.com/threads/i...roblem-with-qlogic-fiber-channel-cards.78797/
But the P408i-a runs on the smartpqi module, which can't make use of the ql2xmaxlun option. According to the manpage it does not have such an option.
Now we are wondering how we can solve this to not accidentally run into a bash lockup.
thanks!

PS: All nodes have the same hardware and as far as we can tell the same configuration.
PPS: On the nodes running 9.0.10 this LUN also has a weirdly high ID, without causing issues:

Bash:

root@node:~# cat /sys/class/scsi_disk/2\:1\:0\:0/device/lunid
0x0000004000000000

fabian · 2025-11-26T08:58:29+0100

which kernel are you running on the 9.0 nodes, and which on the 9.1 nodes?

dompro · 2025-11-26T09:00:05+0100

9.1: Linux 6.17.2-1-pve
9.0: Linux 6.14.11-4-pve

fabian · 2025-11-26T09:08:40+0100

could you try (installing and) booting the 6.14 kernel on the 9.1 machine?

dompro · 2025-11-26T10:38:47+0100

We will try, need some time for that. Will update once we tested this.

bbgeek17 · 2025-11-26T17:38:14+0100

I’ve asked one of the developers to weigh in, and our assessment aligns with @fabian’s recommendation to try a different kernel.

The symptoms point to a potential regression in the newer kernel, most likely within the vendor-provided driver for this specific FC card.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Search

Search

Bash Deadlock after SCSI rescan with 9.1

dompro

New Member

fabian

Proxmox Staff Member

dompro

New Member

fabian

Proxmox Staff Member

dompro

New Member

bbgeek17

Distinguished Member

We value your privacy