PVE 8.2 / Kernel 6.8.4-2 does not boot - cannot find root device

Just for the record, I'm my case, I'm using a PERC H310 (Dell T420 server) where kernel 6.8.4-2/6.8.4-3 won't boot. I had to pin on the 6.5.13-5 kernel to work
 
  • Like
Reactions: pschneider1968
So meanwhile, Martin Petersen has come up with a fix, which works fine on my machine: it boots again, all devices and RAID arrays are detected again. I have tested his patch against mainline 6.8.4, 6.8.10 and 6.9.1.

See https://lore.kernel.org/linux-scsi/20240521023040.2703884-1-martin.petersen@oracle.com/

The reason for the hang was that his original patch revealed a buggy SCSI Inquiry implemention (requesting VPD page 0) in my SAS/SATA enclosure which had the byte order wrong in its reply. This made Martin's query function crash and the kernel hang.

In case anybody else is affected: The enclosure device in question with that buggy
behaviour is that in a Supermicro 745BTQ-R920B server casing, with SAS/SATA Backplane
"743 SAS BACKPLANE W/AMI MG9072", MG9072 being the controller chip by American Megatrends,
Inc. according to the device documentation which can be found here:

https://www.supermicro.com/de/products/chassis/4u/745/sc745btq-r920b

It would be interesting and useful if anybody who reported here to also be affected by this hang would try Martin's patch and report back if it helps, or not, possibly with some more device information.
 
  • Like
Reactions: Y0nderBoi
So meanwhile, Martin Petersen has come up with a fix, which works fine on my machine: it boots again, all devices and RAID arrays are detected again. I have tested his patch against mainline 6.8.4, 6.8.10 and 6.9.1.

See https://lore.kernel.org/linux-scsi/20240521023040.2703884-1-martin.petersen@oracle.com/

The reason for the hang was that his original patch revealed a buggy SCSI Inquiry implemention (requesting VPD page 0) in my SAS/SATA enclosure which had the byte order wrong in its reply. This made Martin's query function crash and the kernel hang.

In case anybody else is affected: The enclosure device in question with that buggy
behaviour is that in a Supermicro 745BTQ-R920B server casing, with SAS/SATA Backplane
"743 SAS BACKPLANE W/AMI MG9072", MG9072 being the controller chip by American Megatrends,
Inc. according to the device documentation which can be found here:

https://www.supermicro.com/de/products/chassis/4u/745/sc745btq-r920b

It would be interesting and useful if anybody who reported here to also be affected by this hang would try Martin's patch and report back if it helps, or not, possibly with some more device information.
I am using Supermicro machines with SAS/SATA enclosures. Not identical to the one mentioned above (and a bit older) but I would be happy to try the patch. I am slightly ashamed to admit that I have been a long time Nix user but never really deployed a patch in this capacity, nor compiled a kernel from scratch. If you are able to provide me some documentation on how to apply this patch, I would be happy to give it a shot and provide a large sample population.
 
  • Like
Reactions: pschneider1968
I am using Supermicro machines with SAS/SATA enclosures. Not identical to the one mentioned above (and a bit older) but I would be happy to try the patch. I am slightly ashamed to admit that I have been a long time Nix user but never really deployed a patch in this capacity, nor compiled a kernel from scratch. If you are able to provide me some documentation on how to apply this patch, I would be happy to give it a shot and provide a large sample population.

Perhaps it would make sense to first check whether you are really affected by the exact same enclosure SCSI protocol implementation bug. To do that:

Code:
# apt install sg3-utils lssci

Then let's see what your enclosure devices are:

Code:
root@linus:~# lsscsi
[0:0:0:0]    disk    Adaptec  linus_raid1_1tb  V1.0  /dev/sda
[0:0:1:0]    disk    Adaptec  linus_raid5_5tb  V1.0  /dev/sdb
[0:1:0:0]    disk             Samsung SSD 860  RVM0  -
[0:1:1:0]    disk             Samsung SSD 860  RVM0  -
[0:1:2:0]    disk    Hitachi  HUA722010CLA330  JP4O  -
[0:1:3:0]    disk    Hitachi  HUA722010CLA330  JP4O  -
[0:1:4:0]    disk    Hitachi  HUA722010CLA330  JP4O  -
[0:1:5:0]    disk    Hitachi  HUA722010CLA330  JP4O  -
[0:1:6:0]    disk    Hitachi  HUA722010CLA330  JP4O  -
[0:1:7:0]    disk    Hitachi  HUA722010CLA330  JP4O  -
[0:3:0:0]    enclosu ADAPTEC  Virtual SGPIO  0 0001  -
[0:3:1:0]    enclosu ADAPTEC  Virtual SGPIO  1 0001  -
[1:0:0:0]    disk    ATA      M4-CT256M4SSD2   0309  /dev/sdc
[2:0:0:0]    disk    ATA      M4-CT256M4SSD2   0309  /dev/sdf
[3:0:0:0]    cd/dvd  ATAPI    iHAS124   W      HL0F  /dev/sr1
[8:0:0:0]    cd/dvd  AMI      Virtual CDROM0   1.00  /dev/sr0
[9:0:0:0]    disk    AMI      Virtual Floppy0  1.00  /dev/sdd
[10:0:0:0]   disk    AMI      Virtual HDISK0   1.00  /dev/sde

As you see, the enclosure ports are not being exposed as block device, they have no such device name as /dev/sd* because they are only being accessed as block devices through the driver of the controller card to which the enclosure is attached. But they are exposed as SCSI generic character devices, and here's how you would identify them:

Code:
root@linus:~# dmesg | grep "Attached"
[   64.715990] sd 0:0:0:0: Attached scsi generic sg0 type 0
[   64.716179] sd 0:0:1:0: Attached scsi generic sg1 type 0
[   64.716379] scsi 0:1:0:0: Attached scsi generic sg2 type 0
[   64.716592] scsi 0:1:1:0: Attached scsi generic sg3 type 0
[   64.716748] scsi 0:1:2:0: Attached scsi generic sg4 type 0
[   64.716909] scsi 0:1:3:0: Attached scsi generic sg5 type 0
[   64.717062] scsi 0:1:4:0: Attached scsi generic sg6 type 0
[   64.717217] scsi 0:1:5:0: Attached scsi generic sg7 type 0
[   64.717341] scsi 0:1:6:0: Attached scsi generic sg8 type 0
[   64.717447] scsi 0:1:7:0: Attached scsi generic sg9 type 0
[   64.717567] scsi 0:3:0:0: Attached scsi generic sg10 type 13
[   64.717679] scsi 0:3:1:0: Attached scsi generic sg11 type 13
[   64.718159] sd 1:0:0:0: Attached scsi generic sg12 type 0
[   64.719229] sd 0:0:0:0: [sda] Attached SCSI removable disk
[   64.719420] sd 1:0:0:0: [sdc] Attached SCSI disk
[   64.722551] sr 8:0:0:0: Attached scsi CD-ROM sr0
[   64.722637] sr 8:0:0:0: Attached scsi generic sg13 type 5
[   64.722808] sd 9:0:0:0: Attached scsi generic sg14 type 0
[   64.722950] sd 10:0:0:0: Attached scsi generic sg15 type 0
[   64.723342] sd 2:0:0:0: Attached scsi generic sg16 type 0
[   64.724189] sd 9:0:0:0: [sdd] Attached SCSI removable disk
[   64.724394] sd 10:0:0:0: [sde] Attached SCSI removable disk
[   64.725483] sd 2:0:0:0: [sdf] Attached SCSI disk
[   64.733786] sd 0:0:1:0: [sdb] Attached SCSI removable disk
[   64.841321] sr 3:0:0:0: Attached scsi CD-ROM sr1
[   64.841417] sr 3:0:0:0: Attached scsi generic sg17 type 5
[   64.849620] ses 0:3:0:0: Attached Enclosure device
[   64.849638] ses 0:3:1:0: Attached Enclosure device
root@linus:~#

So type 13 (0x0d) is device type enclosure, and they are on SCSI H:C:T:L (host/controller/target/lun) of 0:3:0:0 and 0:3:1:0 on my machine. Somewhat above you see, they have device nodes /dev/sg10 and /dev/sg11. That's how you can query them with sg_vpd (from package sg3-utils).

If you get the same error messages for the two queries below, you are affected, and Martin's patch will help.

Code:
root@linus:~# sg_vpd --all /dev/sg10
Supported VPD pages VPD page:

fetching VPD page failed: Numerical argument out of domain
sg_vpd failed: Numerical argument out of domain

root@linus:~# sg_vpd --all -HHHH /dev/sg10
0d 00 02 00 00 83

fetching VPD page failed: Numerical argument out of domain
sg_vpd failed: Numerical argument out of domain

Martin's explanation was this:
Code:
Peter,

> root@linus:~# sg_vpd --all /dev/sg10
> Supported VPD pages VPD page:
>
> fetching VPD page failed: Numerical argument out of domain
> sg_vpd failed: Numerical argument out of domain

See? The Linux kernel is not the only thing having problems. sg_vpd also
fails because the device is reporting utter nonsense.

> root@linus:~# sg_vpd --all -HHHH /dev/sg10
> 0d 00 02 00 00 83

 - The first byte, 0d, means the device is an "SES enclosure".

 - Second byte, 00, is the VPD page number. In this case page 00 which
   lists all the VPD pages supported by the device.

 - The last two bytes "00 83" are the individual VPD pages supported by
   the device. In this case the Supported VPD page, 00, and the Device
   Identification VPD page, 83, which contains the unique identifier for
   the device.

The problem is the two middle bytes. "02 00". That's supposed to be "00
02" since the number of supported pages is 2 -- page 00 and page 83.

However, device has byte ordering swapped and reports "02 00", i.e. 512
pages instead of 2. This causes us to think the page is 516 bytes long
(512 bytes as reported by the device + the 4 byte VPD header).
 
  • Like
Reactions: Y0nderBoi
Sure can do!

Then let's see what your enclosure devices are:

Here you go -
Markdown (GitHub flavored):
root@PROXMOX-PVE-02:~# lsscsi
[4:0:0:0]    disk    Adaptec  OS_RAID_1        V1.0  /dev/sda
[4:1:0:0]    disk    INTEL    SSDSA2CW12       4PC1  -       
[4:1:1:0]    disk    INTEL    SSDSA2CW12       4PC1  -       
[4:1:2:0]    disk             MB3000GCWDB      HPGH  /dev/sdb
[4:1:3:0]    disk             MB3000GCWDB      HPGH  /dev/sdc
[4:3:0:0]    enclosu ADAPTEC  Virtual SGPIO  0 0001  -

As you see, the enclosure ports are not being exposed as block device, they have no such device name as /dev/sd* because they are only being accessed as block devices through the driver of the controller card to which the enclosure is attached. But they are exposed as SCSI generic character devices, and here's how you would identify them:
Gotcha. I believe my two drives (INTEL) are being access as block devices via my Adeptec RAID controller.

Here is the output of dmesg -
Markdown (GitHub flavored):
root@PROXMOX-PVE-02:~# dmesg | grep "Attached"
[   33.884708] sd 4:0:0:0: Attached scsi generic sg0 type 0
[   33.885039] scsi 4:1:0:0: Attached scsi generic sg1 type 0
[   33.885276] scsi 4:1:1:0: Attached scsi generic sg2 type 0
[   33.885508] sd 4:1:2:0: Attached scsi generic sg3 type 0
[   33.885731] sd 4:1:3:0: Attached scsi generic sg4 type 0
[   33.885891] scsi 4:3:0:0: Attached scsi generic sg5 type 13
[   33.916035] sd 4:0:0:0: [sda] Attached SCSI removable disk
[   34.010099] sd 4:1:3:0: [sdc] Attached SCSI removable disk
[   34.010245] sd 4:1:2:0: [sdb] Attached SCSI removable disk
[   34.021849] ses 4:3:0:0: Attached Enclosure device

So far, it looks like I also have a `type 13` option - just like you, so I think I might have a similar enough setup to test.

Here is the output of the other two tests -

Markdown (GitHub flavored):
root@PROXMOX-PVE-02:~# sg_vpd --all /dev/sg5
Supported VPD pages VPD page:

fetching VPD page failed: Numerical argument out of domain
sg_vpd failed: Numerical argument out of domain

Same error as you.

And again -
Markdown (GitHub flavored):
root@PROXMOX-PVE-02:~# sg_vpd --all -HHHH /dev/sg5
0d 00 02 00 00 83

fetching VPD page failed: Numerical argument out of domain
sg_vpd failed: Numerical argument out of domain
Same error again.

I read Martins error report and info in the link you posted, but again, not a kernel dev so a lot of it did not make a ton of sense other than the enclosure itself is reporting something that cannot be interpreted by either the kernel nor `sg_vpd`.

So I am happy to try the patch, if you are able to point me in the direction of some documentation of the manual, I would be happy to give it a whirl.
 
The process to build and install a custom mainline kernel is relatively straight forward. However: just to test whether this patch helps, I built a MAINLINE kernel. That means: no ZFS module! This is an addition to the kernel by Ubuntu/Proxmox. So if your PVE host setup requires ZFS for booting or for access to your VM storage, you are a bit out of luck here, and you would need to wait until the patch officially arrives in a PVE release kernel.

If you don't use ZFS, read on.

First,

Code:
# apt install git build-essential bc kmod cpio flex libncurses5-dev libelf-dev libssl-dev dwarves bison

Then check this list, and install everything else that is missing (except for the packages marked optional, i.e. you especially don't need Rust):

https://www.kernel.org/doc/html/v6.8/process/changes.html?highlight=minimal

Then I did
Code:
# mkdir -p /usr/src; cd /usr/src
# git clone https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
# cd linux

I decided to build/test this patch against 6.8.4 first, because that was the PVE kernel version that was failing for my machine.

# git checkout v6.8.4

Then I saved Martin's patch into a file named scsi_init_AMI_MG9072_device_quirk_mkpetersen_2024-05-21-003.patch

# cat scsi_init_AMI_MG9072_device_quirk_mkpetersen_2024-05-21-003.patch

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 3e0c0381277a..f0464db3f9de 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -350,6 +350,13 @@ static int scsi_get_vpd_size(struct scsi_device *sdev, u8 page)
                if (result < SCSI_VPD_HEADER_SIZE)
                        return 0;

+               if (result > sizeof(vpd)) {
+                       dev_warn_once(&sdev->sdev_gendev,
+                                     "%s: long VPD page 0 length: %d bytes\n",
+                                     __func__, result);
+                       result = sizeof(vpd);
+               }
+
                result -= SCSI_VPD_HEADER_SIZE;
                if (!memchr(&vpd[SCSI_VPD_HEADER_SIZE], page, result))
                        return 0;

Now we apply this patch:

root@linus:/usr/src/linux# git apply scsi_init_AMI_MG9072_device_quirk_mkpetersen_2024-05-21-003.patch -v
Prüfe Patch drivers/scsi/scsi.c...
Patch drivers/scsi/scsi.c sauber angewendet.

Then we need a kernel config file. I used that of PVE 6.5.13-5-pve which I last booted into successfully:

root@linus:/usr/src/linux# cp /boot/config-6.5.13-5-pve .config
root@linus:/usr/src/linux# make olddefconfig
  HOSTCC  scripts/basic/fixdep
  HOSTCC  scripts/kconfig/conf.o
  HOSTCC  scripts/kconfig/confdata.o
  HOSTCC  scripts/kconfig/expr.o
  LEX     scripts/kconfig/lexer.lex.c
  YACC    scripts/kconfig/parser.tab.[ch]
  HOSTCC  scripts/kconfig/lexer.lex.o
  HOSTCC  scripts/kconfig/menu.o
  HOSTCC  scripts/kconfig/parser.tab.o
  HOSTCC  scripts/kconfig/preprocess.o
  HOSTCC  scripts/kconfig/symbol.o
  HOSTCC  scripts/kconfig/util.o
  HOSTLD  scripts/kconfig/conf
.config:10571:warning: symbol value 'm' invalid for ANDROID_BINDER_IPC
.config:10572:warning: symbol value 'm' invalid for ANDROID_BINDERFS
.config:10797:warning: symbol value 'm' invalid for FSCACHE
#
# configuration written to .config
#
root@linus:/usr/src/linux#

You can ignore those warnings.

Now we have to do 3 things:

# make -j 48
# make -j 48 modules_install
# make install

(48) is my number of CPUs, use your CPU count to fully utilize your machine. On my box, this takes around ~25 minutes.

The last step will install the kernel, generate an initrd.img and enter the new kernel to the Grub config.

It will probably be named something like "6.8.4-dirty" by the kernel build system, because we did not git commit our patch.
Now you can boot test this kernel. If the patch helps, you'll find something like this in the dmesg output:

scsi 0:3:0:0: scsi_get_vpd_size: long VPD page 0 length: 516 bytes

This is the dev_warn_once that Martin added as a diagnostic.

Let us know whether all this works for you, too.


If you want to build again, against another kernel version, I would first commit this, and also assign a tag, so you can return to this version later if needed:

# git add .
# git commit -m "Applied SCSI init patch against 6.8.4"
# git tag "v6.8.4_with_scsi_init_AMI_MG9072_device_quirk_mkpetersen_2024-05-21-003.patch"

Then
# make clean
# make distclean

and you can checkout a different kernel version tag to build against. Remember to reapply the patch!
 
I do use ZFS for VM storage, but not for booting. Again this just my homelab, so I could migrate all of the VMs to my other nodes while I mess around with trying this patch out. But if the juice is not worth the squeeze then I can just stop. If Martin wants more people to test patch and it would be good for some more testing - happy to do it. Assuming I would still be able get the ZFS module back in the future.
 
  • Like
Reactions: pschneider1968
I do use ZFS for VM storage, but not for booting. Again this just my homelab, so I could migrate all of the VMs to my other nodes while I mess around with trying this patch out. But if the juice is not worth the squeeze then I can just stop. If Martin wants more people to test patch and it would be good for some more testing - happy to do it. Assuming I would still be able get the ZFS module back in the future.
You could just boot test your compiled kernel to see if the problem is gone (maybe just disable all VM autostarting for that, and re-enable afterwards).

Then of cause continue using the pinned working kernel until an officially fixed PVE kernel is released. I guess it won't be too long until then. In my bug report in the PVE bugzilla I can see that the PVE kernel team already have this patch in their queue for the next release. On the LKML, there is just a review of Martins's patch pending from some of his SCSI co-maintainers to get this integrated into Linus' tree for 6.10 and Greg's stable trees for 6.8.y and 6.9.y
 
You could just boot test your compiled kernel to see if the problem is gone (maybe just disable all VM autostarting for that, and re-enable afterwards).
That was my plan! I migrated all of the VMs off to other nodes in the cluster and disabled auto-start.

Next, my Debain Proxmox host did require me to install some more of the utilities for kernel compiling but that was fairly straight forward. I was able to follow that guide to create the directory in /usr/src/ and then clone the kernel down into that directory. From there I checked out kernel version 6.5.13-3-pve since that was the last kernel version that worked for me and is working on my other nodes. I for sure messed this up a few times and had to restart the process. My nodes are super old and not good for kernel compilation. But I was able to apply the patch (I think) without issue.

I initially tried running this from the webUI console but I think my connection kept timing out. It look appears to have gotten "stuck" on /net/tls/tls.o which I am not sure why. So I tried initiating it from an SSH connection directly. Then I ran out of local disk space on my drive since I had a bunch of old attempts left over. I only have 30GB of space left on my boot drives. So I cleared out the directory and restarted. I'll post another update once its completed.
 
That was my plan! I migrated all of the VMs off to other nodes in the cluster and disabled auto-start.

Next, my Debain Proxmox host did require me to install some more of the utilities for kernel compiling but that was fairly straight forward. I was able to follow that guide to create the directory in /usr/src/ and then clone the kernel down into that directory. From there I checked out kernel version 6.5.13-3-pve since that was the last kernel version that worked for me and is working on my other nodes. I for sure messed this up a few times and had to restart the process. My nodes are super old and not good for kernel compilation. But I was able to apply the patch (I think) without issue.

I initially tried running this from the webUI console but I think my connection kept timing out. It look appears to have gotten "stuck" on /net/tls/tls.o which I am not sure why. So I tried initiating it from an SSH connection directly. Then I ran out of local disk space on my drive since I had a bunch of old attempts left over. I only have 30GB of space left on my boot drives. So I cleared out the directory and restarted. I'll post another update once its completed.

Applying the patch to 6.5.13 makes no sense. The code that triggers this enclosure implementation bug and thus the hang has been introduced in the 6.8.x development cycle. So the earliest version to which you might want to apply this patch, to test whether it resolves the issue, is 6.8.4.

I mean 6.5.13 boots fine, anyway, right? So what do you want to test?
 
Applying the patch to 6.5.13 makes no sense. The code that triggers this enclosure implementation bug and thus the hang has been introduced in the 6.8.x development cycle. So the earliest version to which you might want to apply this patch, to test whether it resolves the issue, is 6.8.4.

I mean 6.5.13 boots fine, anyway, right? So what do you want to test?
I think that was just a misunderstanding on my part. I was attempting to follow your instructions when you had mentioned using the most recent kernel that had worked. So I just copied the kernel version from one of my nodes that is working. I will try it again using the 6.8.4 kernel.
 
  • Like
Reactions: pschneider1968
I think that was just a misunderstanding on my part. I was attempting to follow your instructions when you had mentioned using the most recent kernel that had worked. So I just copied the kernel version from one of my nodes that is working. I will try it again using the 6.8.4 kernel.

I meant to say that I copied the kernel config file from 6.5.13-pve as a starting point. The kernel config file states what functionality, features to compile, which to compile in and which as modules etc, like so:

Code:
cp /boot/config-6.5.13-5-pve .config

Then, based on that config file, we do:

Code:
root@linus:/usr/src/linux# make olddefconfig

which means everything from the 6.5.13 kernel that is already configured there is configured the same for our new (say 6.8.4) build, and every kernel feature new to 6.8.4 gets a default configuration value.
 
Well I don't think I can continue. Everytime I run make -j I run out of space on my boot drive.
 
I am also affected by this bug & will be watching this thread so I will hopefully be notified when it is hopefully resolved with a normal update. I have a H310 running in JBOD mode on Dell 7910r. For now just running 6.5.13-5-pve kernel.

Is there anything I need to do to make sure this working kernel is not deleted?
 
So meanwhile, Martin Petersen has come up with a fix, which works fine on my machine: it boots again, all devices and RAID arrays are detected again. I have tested his patch against mainline 6.8.4, 6.8.10 and 6.9.1.

See https://lore.kernel.org/linux-scsi/20240521023040.2703884-1-martin.petersen@oracle.com/

The reason for the hang was that his original patch revealed a buggy SCSI Inquiry implemention (requesting VPD page 0) in my SAS/SATA enclosure which had the byte order wrong in its reply. This made Martin's query function crash and the kernel hang.

In case anybody else is affected: The enclosure device in question with that buggy
behaviour is that in a Supermicro 745BTQ-R920B server casing, with SAS/SATA Backplane
"743 SAS BACKPLANE W/AMI MG9072", MG9072 being the controller chip by American Megatrends,
Inc. according to the device documentation which can be found here:

https://www.supermicro.com/de/products/chassis/4u/745/sc745btq-r920b

It would be interesting and useful if anybody who reported here to also be affected by this hang would try Martin's patch and report back if it helps, or not, possibly with some more device information.
Can you explain the process to get this fix? I’m a newbie about these things and I’m stuck at the initramfs command prompt
 
The fix has meanwhile been incorporated into STABLE 6.1.94-rc1, 6.6.34-rc1 and 6.9.5-rc1 by stable kernel maintainer Greg Kroah-Hartman. I have built and tested them all, and they work fine on my machine.

Also, it has been in Linus' tree for a while now (since 6.10-rc2) so it will be in the upcoming MAINLINE 6.10, too.

Moreover, the Proxmox kernel team has picked this patch up for their 6.8.x-pve kernel, so it will most likely be in their next kernel bugfix release. Unfortunately, due to timing issues, it will not be in Greg's official STABLE 6.8 release, because this is EOL since 6.8.12, and it's not in there. So probably it will also not be picked up right now by Ubuntu (maybe later when they select patches to backport to their 6.8.x LTS kernel), but as Proxmox already has it, it's not important what Ubuntu does.

For me it was a nice and pleasant learning experience to write a bug report (actually I wrote two, another for an unrelated issue that was fixed, too) and to work with the kernel folks to help them debug and fix this issue. I also got into testing -rc kernel releases and reporting back about success/failure/regressions more regularly now, and I think it's a nice way of giving back to the Linux kernel project, even when you are not a developer.
 
  • Like
Reactions: ddfdom and kabello
Can you explain the process to get this fix? I’m a newbie about these things and I’m stuck at the initramfs command prompt

The easiest thing would be to wait until Proxmox releases a new 6.8.x bugfix kernel with this patch included.

Otherwise, you would need to build a kernel of your own. If you use and need ZFS, you would have to checkout the Proxmox kernel from their git, apply the patch and deb-build a new kernel package. Or, if you don't use ZFS, you can just git clone a suitable STABLE kernel, say 6.9.4 from kernel.org, apply the patch, build and test-boot it. This is way easier than to build a patched deb package from the PVE kernel sources.

Or just wait two or three days until stable 6.9.5 is out. The patch will be included there, so just git clone and build that.
 
The fix has meanwhile been incorporated into STABLE 6.1.94-rc1, 6.6.34-rc1 and 6.9.5-rc1 by stable kernel maintainer Greg Kroah-Hartman. I have built and tested them all, and they work fine on my machine.

Also, it has been in Linus' tree for a while now (since 6.10-rc2) so it will be in the upcoming MAINLINE 6.10, too.

Moreover, the Proxmox kernel team has picked this patch up for their 6.8.x-pve kernel, so it will most likely be in their next kernel bugfix release. Unfortunately, due to timing issues, it will not be in Greg's official STABLE 6.8 release, because this is EOL since 6.8.12, and it's not in there. So probably it will also not be picked up right now by Ubuntu (maybe later when they select patches to backport to their 6.8.x LTS kernel), but as Proxmox already has it, it's not important what Ubuntu does.

For me it was a nice and pleasant learning experience to write a bug report (actually I wrote two, another for an unrelated issue that was fixed, too) and to work with the kernel folks to help them debug and fix this issue. I also got into testing -rc kernel releases and reporting back about success/failure/regressions more regularly now, and I think it's a nice way of giving back to the Linux kernel project, even when you are not a developer.
Fantastic work @pschneider1968
i'm facing the same behaviour with an HP P822 RAID card, juste waiting new kernel now
 
  • Like
Reactions: pschneider1968

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!