PVE 8.2 / Kernel 6.8.4-2 does not boot - cannot find root device

Nov 2, 2022
37
18
13
Bavaria, Germany
linktr.ee
With the new kernel in 8.2 I am facing the problem that initial waiting for the root device times out and boots into a busybox shell.

The error messages I see are:

Timed out for waiting the udev queue being empty (2 times)
Gave up waiting for suspend/resume device
Gave up waiting for root file system device
...
ALERT! UUID=xxxxxx... does not exist. Dropping to a shell!

I had similar problems initially with earlier kernels too, so from the very beginning with this machine using PVE, I had to set grub parameter rootdelay=60
With that, everything was fine, the busses settled and root device was found and system booted.

With this new kernel in PVE 8.2, after seeing this error for the first time again, I even increased the rootdelay to 120, but still this does not help and the kernel does not see the root device. If I boot into the last 6.5. kernel, everything is fine again!

This is an older server machine: 2-socket Ivy Bridge Xeon E5-2697 v2 (24C/48T) in an Asus Z9PE-D16/2L motherboard (Intel C-602A chipset); BIOS patched to the latest available from Asus.

Storage:

I have 2 Samsung 256 GiB SATA SSD attached to internal SATA ports; one of them is my root device and PVE installation drive. The other one I use for storing ISO images. My main VM storage is attached to a battery backed-up Adaptec 5805 SATA/SAS RAID controller: I have one RAID1 Array, consisting of two Samsung 1 TiB SATA SSDs for VM root disk images, and one RAID5 Array, consisting of 6 Hitachi 1 TiB HDDs which I use for storing VM data disk images. On both arrays, I use a LVM thin pool. When everything boots up, the system is running just fine and smoothly (and has for years!). Although this is "only" a homelab server, I love it dearly and use it for many private projects VMs, among them runing Windows Server VM with MS SQL Server, and Linux server VMs running Oracle Database Server (I'm a database guy )

I'd be grateful for a hint what might be the reason for this behaviour, and what else I could try to remedy the problem with the new kernel. Or else, how would I permanently go back to the last working 6.5 kernel?

Thanks,
Peter
 
Hi,
We are currently experiencing the same issue.
It appears to be related to the Adaptec controllers, since we are also using one in this machine.
Details:
  • RAID bus controller: Adaptec AAC-RAID (rev 09)
  • The machine boots and functions perfectly with version 6.2.16-20-pve.
We would greatly appreciate it if we could find a way to use version 6.8.
Thanks,
Sascha
 
Hi,
We are currently experiencing the same issue.
It appears to be related to the Adaptec controllers, since we are also using one in this machine.
Details:
  • RAID bus controller: Adaptec AAC-RAID (rev 09)
  • The machine boots and functions perfectly with version 6.2.16-20-pve.
We would greatly appreciate it if we could find a way to use version 6.8.
Thanks,
Sascha

For the time being, I used the command
pve-efiboot-tool kernel pin <...>

to pin and boot the previous and working (for me) 6.5.13-5-pve kernel, and I will check occasionally whether this problem has gone away in the 6.8.x kernel as updates keep coming.
 
So after some 15 years when I lastly built a kernel of my own, I'm currently in the process of checking out and building stable 6.8.7 from kernel.org to see whether this problem is present there, too. If not, I will try to bisect (never done this before...)


1714183462056.png
 
So after some 15 years when I lastly built a kernel of my own, I'm currently in the process of checking out and building stable 6.8.7 from kernel.org to see whether this problem is present there, too. If not, I will try to bisect (never done this before...)


View attachment 67089

What I've done so far is:

- git checkout mainline, built 6.9.rc5+ as of today. This shows the same buggy behaviour; root device not found, machine does not boot
- git checkout and built mainline 6.5.13. This works. I'm aware this is not the same as 6.5.13-5-pve because the latter is pulled from Ubuntu by Proxmox and has some more drivers etc. in it. But just to verify that mainline 6.5.13 boots, too.

I will now get some sleep, and will start to bisect this evening, and tomorrow afternoon, and hope that I can find a smoking gun.
 
In the process of bisecting, I have now built 13 kernels so far. 6 of them work fine, while 7 exhibit the buggy behaviour. I have narrowed it down to still 126 commits between 6.8.0-rc5+ (c6a597fcc7ad, works) and 6.8.0-rc6+ (2652b99e4340, doesn't work). So 7 more builds to go and test. I will continue tomorrow afternoon.

Also I found: mainline 6.8.7 and 6.8.4 (like the PVE version) don't work, either. Moreover, I just read on LWN that GregKH has just released 6.8.8, so I will test this tomorrow, too.
 
Last edited:
  • Like
Reactions: leesteken
So after compiling maaannnyyyy kernels, I could pinpoint the buggy behaviour to a single commit in the SCSI subsystem. I am not subscribed to any kernel mailing list, nor do I have knowledge of how to work with the kernel folks. So I am just attaching my findings here, and hope someone from Proxmox will pick it up and relay it to the kernel developers or kernel group in charge.

Short: the offending commit is this one:

b5fc07a5fb56216a49e6c1d0b172d5464d99a89b is the first bad commit
commit b5fc07a5fb56216a49e6c1d0b172d5464d99a89b
Author: Martin K. Petersen <martin.petersen@oracle.com>
Date: Wed Feb 14 17:14:11 2024 -0500

scsi: core: Consult supported VPD page list prior to fetching page

For my hardware setup, please see my first post. With proper instruction, I can compile/test patches.

Linus merged this into his tree with this commit on Feb 24:
https://git.kernel.org/pub/scm/linu.../?id=6d20acbf3e3a32d331947dbc3802cf2d1a399e7d

Long:

Kernel bug/regression:
AACRAID controller cannot be initialized; message: "Timed out for waiting the udev queue being empty."
=> System boot fails (Kernel hangs)


Proxmox Virtual Environmet (PVE) Kernels
========================================
6.5.13-5-pve WORKS last working PVE kernel; 5.15-pve and 6.2-pve work too
6.8.4-2-pve NOPE PVE release 8.2


Mainline Kernels
================
6.9.0-rc5+ NOPE Most recent (2024-04-27)
6.8.8 NOPE Most recent released (2024-04-29)
6.8.7 NOPE Most recent released (2024-04-27)
6.8.4 NOPE Same version as most recent released PVE 8.2 Kernel
6.5.13 WORKS


Bisecting...
============

root@linus:/usr/src/linux# git checkout master
Bereits auf 'master'
Ihr Branch ist auf demselben Stand wie 'origin/master'.
root@linus:/usr/src/linux# git log
commit 9d1ddab261f3e2af7c384dc02238784ce0cf9f98 (HEAD -> master, origin/master, origin/HEAD)
Merge: 71b1543c83d6 77d8aa79ecfb
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue Apr 23 09:37:32 2024 -0700

Merge tag '6.9-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

root@linus:/usr/src/linux# cp /boot/config-6.5.13-5-pve .config
root@linus:/usr/src/linux# git bisect start
Status: warte auf guten und schlechten Commit
root@linus:/usr/src/linux# git bisect bad
Status: warte auf gute(n) Commit(s), schlechter Commit bekannt
root@linus:/usr/src/linux# git bisect good v6.5.13
Binäre Suche: eine Merge-Basis muss geprüft werden
[2dde18cd1d8fac735875f2e4987f11817cc0bc2c] Linux 6.5
root@linus:/usr/src/linux# make olddefconfig
.config:10571:warning: symbol value 'm' invalid for ANDROID_BINDER_IPC
.config:10572:warning: symbol value 'm' invalid for ANDROID_BINDERFS
#
# configuration written to .config
#
root@linus:/usr/src/linux# make -j 48

=> 6.5.0 (Merge Base) WORKS

root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 32111 Commits zum Testen übrig (ungefähr 15 Schritte)
[0f5cc96c367f2e780eb492cc9cab84e3b2ca88da] Merge tag 's390-6.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

root@linus:/usr/src/linux# make -j 48

=> 6.7.0-rc2+ WORKS

root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 16056 Commits zum Testen übrig (ungefähr 14 Schritte)
[ee138217c32ccbfa75d5ea6b766158148e98f6fa] Merge tag 'btree-remove-btnum-6.9_2024-02-23' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.9-mergeC

=> 6.8.0-rc4+ WORKS

root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 8214 Commits zum Testen übrig (ungefähr 13 Schritte)
[e5e038b7ae9da96b93974bf072ca1876899a01a3] Merge tag 'fs_for_v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

=> 6.8.0+ NOPE => does not find root device, does not boot;
message: "BUG: arch topology borken the CPU domain not a subset of > the NUMA domain"
message: "Timed out for waiting the udev queue being empty."

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 3954 Commits zum Testen übrig (ungefähr 12 Schritte)
[f153fbe1ea11939e2514ba4b3b62bbd946e2892c] Merge tag 'erofs-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

=> 6.8.0+ (HEAD losgelöst bei f153fbe1ea11) NOPE => same as above

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 1945 Commits zum Testen übrig (ungefähr 11 Schritte)
[1ddeeb2a058d7b2a58ed9e820396b4ceb715d529] Merge tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux

=> 6.8.0+ (HEAD losgelöst bei 1ddeeb2a058d) NOPE => same as above

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 970 Commits zum Testen übrig (ungefähr 10 Schritte)
[2652b99e43403dc464f3648483ffb38e48872fe4] ice: virtchnl: stop pretending to support RSS over AQ or registers

=> 6.8.0-rc6+ (2652b99e4340) NOPE => same

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 506 Commits zum Testen übrig (ungefähr 9 Schritte)
[efa80dcbb7a3ecc4a1b2f54624c49b5a612f92b3] Merge tag 'trace-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

=> 6.8.0-rc5+ (efa80dcbb7a3) WORKS

root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 251 Commits zum Testen übrig (ungefähr 8 Schritte)
[c6a597fcc7ad7335a3ecf8f5287a0459f793a257] Merge tag 'loongarch-fixes-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson

=> 6.8.0-rc5+ (c6a597fcc7ad) WORKS

root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 126 Commits zum Testen übrig (ungefähr 7 Schritte)
[cf1182944c7cc9f1c21a8a44e0d29abe12527412] Merge tag 'lsm-pr-20240227' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

=> 6.8.0-rc6+ (cf1182944c7c) NOPE

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 62 Commits zum Testen übrig (ungefähr 6 Schritte)
[4ca0d9894fd517a2f2c0c10d26ebe99ab4396fe3] Merge tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

=> 6.8.0-rc5+ (4ca0d9894fd5) NOPE

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 36 Commits zum Testen übrig (ungefähr 5 Schritte)
[ac389bc0ca56e1a2f92b2a17e58298390a3879a8] Merge tag 'cxl-fixes-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

=> 6.8.0-rc5+ (ac389bc0ca56) NOPE

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 12 Commits zum Testen übrig (ungefähr 4 Schritte)
[40de53fd002c6ba087a623722915e8006ed68a02] Merge branch 'for-6.8/cxl-cper' into for-6.8/cxl

=> 6.8.0-rc5+ (40de53fd002c) WORKS

root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 6 Commits zum Testen übrig (ungefähr 3 Schritte)
[9ddf190a7df77b77817f955fdb9c2ae9d1c9c9a3] scsi: jazz_esp: Only build if SCSI core is builtin

=> 6.8.0-rc1+ (9ddf190a7df7) NOPE

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 2 Commits zum Testen übrig (ungefähr 2 Schritte)
[de959094eb2197636f7c803af0943cb9d3b35804] scsi: target: pscsi: Fix bio_put() for error case

=> 6.8.0-rc1+ (de959094eb21) NOPE

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 0 Commits zum Testen übrig (ungefähr 1 Schritt)
[b5fc07a5fb56216a49e6c1d0b172d5464d99a89b] scsi: core: Consult supported VPD page list prior to fetching page

=> 6.8.0-rc1+ (b5fc07a5fb56) NOPE

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 0 Commits zum Testen übrig (ungefähr 0 Schritte)
[321da3dc1f3c92a12e3c5da934090d2992a8814c] scsi: sd: usb_storage: uas: Access media prior to querying device properties

=> 6.8.0-rc1+ (321da3dc1f3c) WORKS

root@linus:/usr/src/linux# git bisect good
b5fc07a5fb56216a49e6c1d0b172d5464d99a89b is the first bad commit
commit b5fc07a5fb56216a49e6c1d0b172d5464d99a89b
Author: Martin K. Petersen <martin.petersen@oracle.com>
Date: Wed Feb 14 17:14:11 2024 -0500

scsi: core: Consult supported VPD page list prior to fetching page

Commit c92a6b5d6335 ("scsi: core: Query VPD size before getting full
page") removed the logic which checks whether a VPD page is present on
the supported pages list before asking for the page itself. That was
done because SPC helpfully states "The Supported VPD Pages VPD page
list may or may not include all the VPD pages that are able to be
returned by the device server". Testing had revealed a few devices
that supported some of the 0xBn pages but didn't actually list them in
page 0.

Julian Sikorski bisected a problem with his drive resetting during
discovery to the commit above. As it turns out, this particular drive
firmware will crash if we attempt to fetch page 0xB9.

Various approaches were attempted to work around this. In the end,
reinstating the logic that consults VPD page 0 before fetching any
other page was the path of least resistance. A firmware update for the
devices which originally compelled us to remove the check has since
been released.

Link: https://lore.kernel.org/r/20240214221411.2888112-1-martin.petersen@oracle.com
Fixes: c92a6b5d6335 ("scsi: core: Query VPD size before getting full page")
Cc: stable@vger.kernel.org
Cc: Bart Van Assche <bvanassche@acm.org>
Reported-by: Julian Sikorski <belegdol@gmail.com>
Tested-by: Julian Sikorski <belegdol@gmail.com>
Reviewed-by: Lee Duncan <lee.duncan@suse.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

drivers/scsi/scsi.c | 22 ++++++++++++++++++++--
include/scsi/scsi_device.h | 4 ----
2 files changed, 20 insertions(+), 6 deletions(-)
root@linus:/usr/src/linux#
 
Last edited:
Yes, I did. It doesn't work either, most likely and quite simply because the offending commit

Code:
b5fc07a5fb56216a49e6c1d0b172d5464d99a89b is the first bad commit
commit b5fc07a5fb56216a49e6c1d0b172d5464d99a89b
Author: Martin K. Petersen <martin.petersen@oracle.com>
Date:   Wed Feb 14 17:14:11 2024 -0500

   scsi: core: Consult supported VPD page list prior to fetching page

which I found by bisecting (see my last post above) is still in.
 
  • Like
Reactions: kabello
So I am not sure if I am experiencing the same issue or not. But I upgraded my nodes a few days back and rebooted one and I noticed it was not coming back online. So I fired up my IPMI and the JavaKVM (Old nodes - I know). I rebooted the node again to see the APCI on the screen and I am getting this error.
2024-05-07_10-17.png

Google brought me here and I think it might be related? Could anyone confirm? I believe it is related because @tafkaz had mentioned his nodes are using Adeptec RAID controllers - I am as well.

Current workaround is just to leverage an older kernel and then if I need to reboot just select it from GRUB. But I am heavily considering pinning it like @pschneider1968 mentioned.

Anything else I could try and troubleshoot? I had booted into another old kernel (5.15.149-1)and ran dmesg and journalctl -bl0 but I am not sure if that information will relevant since I am only having issues on the 6.7.4-2
 
So I am not sure if I am experiencing the same issue or not. But I upgraded my nodes a few days back and rebooted one and I noticed it was not coming back online. So I fired up my IPMI and the JavaKVM (Old nodes - I know). I rebooted the node again to see the APCI on the screen and I am getting this error.
View attachment 67762

Google brought me here and I think it might be related? Could anyone confirm? I believe it is related because @tafkaz had mentioned his nodes are using Adeptec RAID controllers - I am as well.

Current workaround is just to leverage an older kernel and then if I need to reboot just select it from GRUB. But I am heavily considering pinning it like @pschneider1968 mentioned.

Anything else I could try and troubleshoot? I had booted into another old kernel (5.15.149-1)and ran dmesg and journalctl -bl0 but I am not sure if that information will relevant since I am only having issues on the 6.7.4-2

Which Adaptec controller do you have in this machine? Mine is Adaptec 5805, with BBU unit, it has Firmware 18948 (previous 18937 behaved the same way, though).

It seems the buggy kernel patch I found prevents the initialization of the controller, because in a "bad" boot, dmesg output just stops right before it.

Unfortunately, none of the kernel maintainers have reacted yet to my bug report on the linux-scsi, linux-stable and kernel mailing list...
 
Which Adaptec controller do you have in this machine? Mine is Adaptec 5805, with BBU unit, it has Firmware 18948 (previous 18937 behaved the same way, though).

It seems the buggy kernel patch I found prevents the initialization of the controller, because in a "bad" boot, dmesg output just stops right before it.

Unfortunately, none of the kernel maintainers have reacted yet to my bug report on the linux-scsi, linux-stable and kernel mailing list...
I just checked and the node I am having issues with is using an Adaptec - From lspci I get Adaptec AAC-RAID [9005:0285]. I would reboot the node to get the full version but I'm feeling lazy at the moment. However the RAID controller BIOS is able to load, its just once I get into GRUB and then into the PVE kernel is where my issue comes into play. But I suspect you are correct. Just because my RAID BIOS loads fine, does not mean it is being properly initialized.

Not stressed about a kernel patch my work around is to use an old kernel for now and just try not to reboot the node LOL.
 
I've stumbled over this thread too.
I am running proxmox on a dell micro 5070 and I'm pretty sure it uses an adaptec raid controller.

I upgraded from proxmox 8.1.4 to 8.2.2 and it wouldn't boot unless in safe mode.
Fortunately, I had backed up my proxmox host with a clonezilla image and just restored that back to 8.1.4 and everything immediately started working again.

EDIT:- After re-reading the thread, i have also pinned the kernel to 6.5.13-1-pve using the instructions here:-

https://pve.proxmox.com/wiki/Host_Bootloader#sysboot_kernel_pin

I'm now running on 8.2.2
 
Last edited:
I have the exact same behaviour with a DELL PowerEdge R720 and it's definitely not running an Adaptec adapter (it's a Perc H310 Mini which is LSI based if I'm correct). I selected version 6.5.13-5-pve of the kernel and could successfully boot (instead of 6.8.4-3-pve). I pinned the previous version, I hope it survives a reboot.
 
Last edited:
I have the exact same behaviour with a DELL PowerEdge R720 and it's definitely not running an Adaptec adapter (it's a Perc H310 Mini which is LSI based if I'm correct). I selected version 6.5.13-5-pve of the kernel and could successfully boot (instead of 6.8.4-3-pve). I pinned the previous version, I hope it survives a reboot.
What's the last successful dmesg output with regard to device initialization of a kernel failing to boot (i.e. 6.8.4) compared to a successfully booting kernel?
 
Same with HP EliteDesk G2 800. Nothing special in it. Kernel 6.8.4-2-pve and -3-pve won't boot. I pinned back to 6.5.13-5-pve.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!