Just for the record, I'm my case, I'm using a PERC H310 (Dell T420 server) where kernel 6.8.4-2/6.8.4-3 won't boot. I had to pin on the 6.5.13-5 kernel to work
I am using Supermicro machines with SAS/SATA enclosures. Not identical to the one mentioned above (and a bit older) but I would be happy to try the patch. I am slightly ashamed to admit that I have been a long time Nix user but never really deployed a patch in this capacity, nor compiled a kernel from scratch. If you are able to provide me some documentation on how to apply this patch, I would be happy to give it a shot and provide a large sample population.So meanwhile, Martin Petersen has come up with a fix, which works fine on my machine: it boots again, all devices and RAID arrays are detected again. I have tested his patch against mainline 6.8.4, 6.8.10 and 6.9.1.
See https://lore.kernel.org/linux-scsi/20240521023040.2703884-1-martin.petersen@oracle.com/
The reason for the hang was that his original patch revealed a buggy SCSI Inquiry implemention (requesting VPD page 0) in my SAS/SATA enclosure which had the byte order wrong in its reply. This made Martin's query function crash and the kernel hang.
In case anybody else is affected: The enclosure device in question with that buggy
behaviour is that in a Supermicro 745BTQ-R920B server casing, with SAS/SATA Backplane
"743 SAS BACKPLANE W/AMI MG9072", MG9072 being the controller chip by American Megatrends,
Inc. according to the device documentation which can be found here:
https://www.supermicro.com/de/products/chassis/4u/745/sc745btq-r920b
It would be interesting and useful if anybody who reported here to also be affected by this hang would try Martin's patch and report back if it helps, or not, possibly with some more device information.
I am using Supermicro machines with SAS/SATA enclosures. Not identical to the one mentioned above (and a bit older) but I would be happy to try the patch. I am slightly ashamed to admit that I have been a long time Nix user but never really deployed a patch in this capacity, nor compiled a kernel from scratch. If you are able to provide me some documentation on how to apply this patch, I would be happy to give it a shot and provide a large sample population.
# apt install sg3-utils lssci
root@linus:~# lsscsi
[0:0:0:0] disk Adaptec linus_raid1_1tb V1.0 /dev/sda
[0:0:1:0] disk Adaptec linus_raid5_5tb V1.0 /dev/sdb
[0:1:0:0] disk Samsung SSD 860 RVM0 -
[0:1:1:0] disk Samsung SSD 860 RVM0 -
[0:1:2:0] disk Hitachi HUA722010CLA330 JP4O -
[0:1:3:0] disk Hitachi HUA722010CLA330 JP4O -
[0:1:4:0] disk Hitachi HUA722010CLA330 JP4O -
[0:1:5:0] disk Hitachi HUA722010CLA330 JP4O -
[0:1:6:0] disk Hitachi HUA722010CLA330 JP4O -
[0:1:7:0] disk Hitachi HUA722010CLA330 JP4O -
[0:3:0:0] enclosu ADAPTEC Virtual SGPIO 0 0001 -
[0:3:1:0] enclosu ADAPTEC Virtual SGPIO 1 0001 -
[1:0:0:0] disk ATA M4-CT256M4SSD2 0309 /dev/sdc
[2:0:0:0] disk ATA M4-CT256M4SSD2 0309 /dev/sdf
[3:0:0:0] cd/dvd ATAPI iHAS124 W HL0F /dev/sr1
[8:0:0:0] cd/dvd AMI Virtual CDROM0 1.00 /dev/sr0
[9:0:0:0] disk AMI Virtual Floppy0 1.00 /dev/sdd
[10:0:0:0] disk AMI Virtual HDISK0 1.00 /dev/sde
root@linus:~# dmesg | grep "Attached"
[ 64.715990] sd 0:0:0:0: Attached scsi generic sg0 type 0
[ 64.716179] sd 0:0:1:0: Attached scsi generic sg1 type 0
[ 64.716379] scsi 0:1:0:0: Attached scsi generic sg2 type 0
[ 64.716592] scsi 0:1:1:0: Attached scsi generic sg3 type 0
[ 64.716748] scsi 0:1:2:0: Attached scsi generic sg4 type 0
[ 64.716909] scsi 0:1:3:0: Attached scsi generic sg5 type 0
[ 64.717062] scsi 0:1:4:0: Attached scsi generic sg6 type 0
[ 64.717217] scsi 0:1:5:0: Attached scsi generic sg7 type 0
[ 64.717341] scsi 0:1:6:0: Attached scsi generic sg8 type 0
[ 64.717447] scsi 0:1:7:0: Attached scsi generic sg9 type 0
[ 64.717567] scsi 0:3:0:0: Attached scsi generic sg10 type 13
[ 64.717679] scsi 0:3:1:0: Attached scsi generic sg11 type 13
[ 64.718159] sd 1:0:0:0: Attached scsi generic sg12 type 0
[ 64.719229] sd 0:0:0:0: [sda] Attached SCSI removable disk
[ 64.719420] sd 1:0:0:0: [sdc] Attached SCSI disk
[ 64.722551] sr 8:0:0:0: Attached scsi CD-ROM sr0
[ 64.722637] sr 8:0:0:0: Attached scsi generic sg13 type 5
[ 64.722808] sd 9:0:0:0: Attached scsi generic sg14 type 0
[ 64.722950] sd 10:0:0:0: Attached scsi generic sg15 type 0
[ 64.723342] sd 2:0:0:0: Attached scsi generic sg16 type 0
[ 64.724189] sd 9:0:0:0: [sdd] Attached SCSI removable disk
[ 64.724394] sd 10:0:0:0: [sde] Attached SCSI removable disk
[ 64.725483] sd 2:0:0:0: [sdf] Attached SCSI disk
[ 64.733786] sd 0:0:1:0: [sdb] Attached SCSI removable disk
[ 64.841321] sr 3:0:0:0: Attached scsi CD-ROM sr1
[ 64.841417] sr 3:0:0:0: Attached scsi generic sg17 type 5
[ 64.849620] ses 0:3:0:0: Attached Enclosure device
[ 64.849638] ses 0:3:1:0: Attached Enclosure device
root@linus:~#
root@linus:~# sg_vpd --all /dev/sg10
Supported VPD pages VPD page:
fetching VPD page failed: Numerical argument out of domain
sg_vpd failed: Numerical argument out of domain
root@linus:~# sg_vpd --all -HHHH /dev/sg10
0d 00 02 00 00 83
fetching VPD page failed: Numerical argument out of domain
sg_vpd failed: Numerical argument out of domain
Peter,
> root@linus:~# sg_vpd --all /dev/sg10
> Supported VPD pages VPD page:
>
> fetching VPD page failed: Numerical argument out of domain
> sg_vpd failed: Numerical argument out of domain
See? The Linux kernel is not the only thing having problems. sg_vpd also
fails because the device is reporting utter nonsense.
> root@linus:~# sg_vpd --all -HHHH /dev/sg10
> 0d 00 02 00 00 83
- The first byte, 0d, means the device is an "SES enclosure".
- Second byte, 00, is the VPD page number. In this case page 00 which
lists all the VPD pages supported by the device.
- The last two bytes "00 83" are the individual VPD pages supported by
the device. In this case the Supported VPD page, 00, and the Device
Identification VPD page, 83, which contains the unique identifier for
the device.
The problem is the two middle bytes. "02 00". That's supposed to be "00
02" since the number of supported pages is 2 -- page 00 and page 83.
However, device has byte ordering swapped and reports "02 00", i.e. 512
pages instead of 2. This causes us to think the page is 516 bytes long
(512 bytes as reported by the device + the 4 byte VPD header).
Then let's see what your enclosure devices are:
root@PROXMOX-PVE-02:~# lsscsi
[4:0:0:0] disk Adaptec OS_RAID_1 V1.0 /dev/sda
[4:1:0:0] disk INTEL SSDSA2CW12 4PC1 -
[4:1:1:0] disk INTEL SSDSA2CW12 4PC1 -
[4:1:2:0] disk MB3000GCWDB HPGH /dev/sdb
[4:1:3:0] disk MB3000GCWDB HPGH /dev/sdc
[4:3:0:0] enclosu ADAPTEC Virtual SGPIO 0 0001 -
Gotcha. I believe my two drives (INTEL) are being access as block devices via my Adeptec RAID controller.As you see, the enclosure ports are not being exposed as block device, they have no such device name as /dev/sd* because they are only being accessed as block devices through the driver of the controller card to which the enclosure is attached. But they are exposed as SCSI generic character devices, and here's how you would identify them:
dmesg
-root@PROXMOX-PVE-02:~# dmesg | grep "Attached"
[ 33.884708] sd 4:0:0:0: Attached scsi generic sg0 type 0
[ 33.885039] scsi 4:1:0:0: Attached scsi generic sg1 type 0
[ 33.885276] scsi 4:1:1:0: Attached scsi generic sg2 type 0
[ 33.885508] sd 4:1:2:0: Attached scsi generic sg3 type 0
[ 33.885731] sd 4:1:3:0: Attached scsi generic sg4 type 0
[ 33.885891] scsi 4:3:0:0: Attached scsi generic sg5 type 13
[ 33.916035] sd 4:0:0:0: [sda] Attached SCSI removable disk
[ 34.010099] sd 4:1:3:0: [sdc] Attached SCSI removable disk
[ 34.010245] sd 4:1:2:0: [sdb] Attached SCSI removable disk
[ 34.021849] ses 4:3:0:0: Attached Enclosure device
root@PROXMOX-PVE-02:~# sg_vpd --all /dev/sg5
Supported VPD pages VPD page:
fetching VPD page failed: Numerical argument out of domain
sg_vpd failed: Numerical argument out of domain
root@PROXMOX-PVE-02:~# sg_vpd --all -HHHH /dev/sg5
0d 00 02 00 00 83
fetching VPD page failed: Numerical argument out of domain
sg_vpd failed: Numerical argument out of domain
# apt install git build-essential bc kmod cpio flex libncurses5-dev libelf-dev libssl-dev dwarves bison
# mkdir -p /usr/src; cd /usr/src
# git clone https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
# cd linux
I decided to build/test this patch against 6.8.4 first, because that was the PVE kernel version that was failing for my machine.
# git checkout v6.8.4
Then I saved Martin's patch into a file named scsi_init_AMI_MG9072_device_quirk_mkpetersen_2024-05-21-003.patch
# cat scsi_init_AMI_MG9072_device_quirk_mkpetersen_2024-05-21-003.patch
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 3e0c0381277a..f0464db3f9de 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -350,6 +350,13 @@ static int scsi_get_vpd_size(struct scsi_device *sdev, u8 page)
if (result < SCSI_VPD_HEADER_SIZE)
return 0;
+ if (result > sizeof(vpd)) {
+ dev_warn_once(&sdev->sdev_gendev,
+ "%s: long VPD page 0 length: %d bytes\n",
+ __func__, result);
+ result = sizeof(vpd);
+ }
+
result -= SCSI_VPD_HEADER_SIZE;
if (!memchr(&vpd[SCSI_VPD_HEADER_SIZE], page, result))
return 0;
Now we apply this patch:
root@linus:/usr/src/linux# git apply scsi_init_AMI_MG9072_device_quirk_mkpetersen_2024-05-21-003.patch -v
Prüfe Patch drivers/scsi/scsi.c...
Patch drivers/scsi/scsi.c sauber angewendet.
Then we need a kernel config file. I used that of PVE 6.5.13-5-pve which I last booted into successfully:
root@linus:/usr/src/linux# cp /boot/config-6.5.13-5-pve .config
root@linus:/usr/src/linux# make olddefconfig
HOSTCC scripts/basic/fixdep
HOSTCC scripts/kconfig/conf.o
HOSTCC scripts/kconfig/confdata.o
HOSTCC scripts/kconfig/expr.o
LEX scripts/kconfig/lexer.lex.c
YACC scripts/kconfig/parser.tab.[ch]
HOSTCC scripts/kconfig/lexer.lex.o
HOSTCC scripts/kconfig/menu.o
HOSTCC scripts/kconfig/parser.tab.o
HOSTCC scripts/kconfig/preprocess.o
HOSTCC scripts/kconfig/symbol.o
HOSTCC scripts/kconfig/util.o
HOSTLD scripts/kconfig/conf
.config:10571:warning: symbol value 'm' invalid for ANDROID_BINDER_IPC
.config:10572:warning: symbol value 'm' invalid for ANDROID_BINDERFS
.config:10797:warning: symbol value 'm' invalid for FSCACHE
#
# configuration written to .config
#
root@linus:/usr/src/linux#
You can ignore those warnings.
Now we have to do 3 things:
# make -j 48
# make -j 48 modules_install
# make install
(48) is my number of CPUs, use your CPU count to fully utilize your machine. On my box, this takes around ~25 minutes.
The last step will install the kernel, generate an initrd.img and enter the new kernel to the Grub config.
It will probably be named something like "6.8.4-dirty" by the kernel build system, because we did not git commit our patch.
Now you can boot test this kernel. If the patch helps, you'll find something like this in the dmesg output:
scsi 0:3:0:0: scsi_get_vpd_size: long VPD page 0 length: 516 bytes
This is the dev_warn_once that Martin added as a diagnostic.
Let us know whether all this works for you, too.
If you want to build again, against another kernel version, I would first commit this, and also assign a tag, so you can return to this version later if needed:
# git add .
# git commit -m "Applied SCSI init patch against 6.8.4"
# git tag "v6.8.4_with_scsi_init_AMI_MG9072_device_quirk_mkpetersen_2024-05-21-003.patch"
Then
# make clean
# make distclean
and you can checkout a different kernel version tag to build against. Remember to reapply the patch!
You could just boot test your compiled kernel to see if the problem is gone (maybe just disable all VM autostarting for that, and re-enable afterwards).I do use ZFS for VM storage, but not for booting. Again this just my homelab, so I could migrate all of the VMs to my other nodes while I mess around with trying this patch out. But if the juice is not worth the squeeze then I can just stop. If Martin wants more people to test patch and it would be good for some more testing - happy to do it. Assuming I would still be able get the ZFS module back in the future.
That was my plan! I migrated all of the VMs off to other nodes in the cluster and disabled auto-start.You could just boot test your compiled kernel to see if the problem is gone (maybe just disable all VM autostarting for that, and re-enable afterwards).
/usr/src/
and then clone the kernel down into that directory. From there I checked out kernel version 6.5.13-3-pve
since that was the last kernel version that worked for me and is working on my other nodes. I for sure messed this up a few times and had to restart the process. My nodes are super old and not good for kernel compilation. But I was able to apply the patch (I think) without issue./net/tls/tls.o
which I am not sure why. So I tried initiating it from an SSH connection directly. Then I ran out of local disk space on my drive since I had a bunch of old attempts left over. I only have 30GB of space left on my boot drives. So I cleared out the directory and restarted. I'll post another update once its completed.That was my plan! I migrated all of the VMs off to other nodes in the cluster and disabled auto-start.
Next, my Debain Proxmox host did require me to install some more of the utilities for kernel compiling but that was fairly straight forward. I was able to follow that guide to create the directory in/usr/src/
and then clone the kernel down into that directory. From there I checked out kernel version6.5.13-3-pve
since that was the last kernel version that worked for me and is working on my other nodes. I for sure messed this up a few times and had to restart the process. My nodes are super old and not good for kernel compilation. But I was able to apply the patch (I think) without issue.
I initially tried running this from the webUI console but I think my connection kept timing out. It look appears to have gotten "stuck" on/net/tls/tls.o
which I am not sure why. So I tried initiating it from an SSH connection directly. Then I ran out of local disk space on my drive since I had a bunch of old attempts left over. I only have 30GB of space left on my boot drives. So I cleared out the directory and restarted. I'll post another update once its completed.
I think that was just a misunderstanding on my part. I was attempting to follow your instructions when you had mentioned using the most recent kernel that had worked. So I just copied the kernel version from one of my nodes that is working. I will try it again using the 6.8.4 kernel.Applying the patch to 6.5.13 makes no sense. The code that triggers this enclosure implementation bug and thus the hang has been introduced in the 6.8.x development cycle. So the earliest version to which you might want to apply this patch, to test whether it resolves the issue, is 6.8.4.
I mean 6.5.13 boots fine, anyway, right? So what do you want to test?
I think that was just a misunderstanding on my part. I was attempting to follow your instructions when you had mentioned using the most recent kernel that had worked. So I just copied the kernel version from one of my nodes that is working. I will try it again using the 6.8.4 kernel.
cp /boot/config-6.5.13-5-pve .config
root@linus:/usr/src/linux# make olddefconfig
Can you explain the process to get this fix? I’m a newbie about these things and I’m stuck at the initramfs command promptSo meanwhile, Martin Petersen has come up with a fix, which works fine on my machine: it boots again, all devices and RAID arrays are detected again. I have tested his patch against mainline 6.8.4, 6.8.10 and 6.9.1.
See https://lore.kernel.org/linux-scsi/20240521023040.2703884-1-martin.petersen@oracle.com/
The reason for the hang was that his original patch revealed a buggy SCSI Inquiry implemention (requesting VPD page 0) in my SAS/SATA enclosure which had the byte order wrong in its reply. This made Martin's query function crash and the kernel hang.
In case anybody else is affected: The enclosure device in question with that buggy
behaviour is that in a Supermicro 745BTQ-R920B server casing, with SAS/SATA Backplane
"743 SAS BACKPLANE W/AMI MG9072", MG9072 being the controller chip by American Megatrends,
Inc. according to the device documentation which can be found here:
https://www.supermicro.com/de/products/chassis/4u/745/sc745btq-r920b
It would be interesting and useful if anybody who reported here to also be affected by this hang would try Martin's patch and report back if it helps, or not, possibly with some more device information.
Can you explain the process to get this fix? I’m a newbie about these things and I’m stuck at the initramfs command prompt
Fantastic work @pschneider1968The fix has meanwhile been incorporated into STABLE 6.1.94-rc1, 6.6.34-rc1 and 6.9.5-rc1 by stable kernel maintainer Greg Kroah-Hartman. I have built and tested them all, and they work fine on my machine.
Also, it has been in Linus' tree for a while now (since 6.10-rc2) so it will be in the upcoming MAINLINE 6.10, too.
Moreover, the Proxmox kernel team has picked this patch up for their 6.8.x-pve kernel, so it will most likely be in their next kernel bugfix release. Unfortunately, due to timing issues, it will not be in Greg's official STABLE 6.8 release, because this is EOL since 6.8.12, and it's not in there. So probably it will also not be picked up right now by Ubuntu (maybe later when they select patches to backport to their 6.8.x LTS kernel), but as Proxmox already has it, it's not important what Ubuntu does.
For me it was a nice and pleasant learning experience to write a bug report (actually I wrote two, another for an unrelated issue that was fixed, too) and to work with the kernel folks to help them debug and fix this issue. I also got into testing -rc kernel releases and reporting back about success/failure/regressions more regularly now, and I think it's a nice way of giving back to the Linux kernel project, even when you are not a developer.