[solved] Kernel Panic when upgrading from pve 9.1.6 to 9.1.9 (kernel 7.0.x) - amdgpu drivers

gargakumar

Member
Oct 31, 2023
10
0
6
As discussed on the 7.0 kernel thread, I'm opening a new thread to investigate the issue.
Yesterday I updated pve from 9.1.6 to 9.1.9, which installed a 7.0.x kernel. However on reboot I got a Kernel Panic :

Code:
unable to mount root fs on unknown-block(0,0)

Rebooted back to 6.17.13-7-pve and all seem ok.

Hardware setup :
GMKTech K12 Mini PC Ryzen 7 H 255, 64 GB of RAM.
RootFS is on a 2TB NVME drive (Acer Predator GM7000), using the standard LVM setup with a 45 GB root partition

Not sure if the proper fix would be to pin kernel 6.17 or try to resolve the issue on kernel 7.0.x. I would prefer that second option, but I have no clue on how to move forward.

@fabian : I'm happy to help investigate the issue, please let me know if I can provide more info.

here is the working kernel boot log (journalctl -b -k") : log
 
thanks! could you also post "lspci -vv -s 0000:01"?
 
thanks! could you also post "lspci -vv -s 0000:01"?
Sure :

Code:
lspci -vv -s 0000:01

00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Host Bridge
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        IOMMU group: 0

00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Phoenix GPP Bridge (prog-if 00 [Normal decode])
        Subsystem: Advanced Micro Devices, Inc. [AMD] Device 1453
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin ? routed to IRQ 35
        IOMMU group: 1
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: [disabled] [32-bit]
        Memory behind bridge: dcb00000-dcbfffff [size=1M] [32-bit]
        Prefetchable memory behind bridge: [disabled] [64-bit]
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Root Port (Slot+), IntMsgNum 0
                DevCap: MaxPayload 256 bytes, PhantFunc 0
                        ExtTag+ RBE+ TEE-IO-
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
                        ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt+ AutBWInt+
                LnkSta: Speed 16GT/s, Width x4
                        TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
                SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
                        Slot #0, PowerLimit 75W; Interlock- NoCompl+
                SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
                        Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
                        Changed: MRL- PresDet- LinkState+
                RootCap: CRSVisible+
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- LN System CLS Not Supported, TPHComp+ ExtTPHComp- ARIFwd+
                         AtomicOpsCap: Routing+ 32bit+ 64bit+ 128bitCAS-
                DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- ARIFwd+
                         AtomicOpsCtl: ReqEn- EgressBlck-
                         IDOReq- IDOCompl- LTR+ EmergencyPowerReductionReq-
                         10BitTagReq+ OBFF Disabled, EETLPPrefixBlk-
                LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee00000  Data: 0000
        Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD] Device 1453
        Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
        Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [270 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [2a0 v1] Access Control Services
                ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
        Capabilities: [370 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1- L1_PM_Substates+
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                L1SubCtl2:
        Capabilities: [400 v1] Data Link Feature <?>
        Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [440 v1] Lane Margining at the Receiver
                PortCap: Uses Driver-
                PortSta: MargReady+ MargSoftReady-
        Kernel driver in use: pcieport
 
sorry, that should have been "01:00"!
 
No worries.

Code:
lspci -vv -s 01:00

01:00.0 Non-Volatile memory controller: Biwin Storage Technology Co., Ltd. Device 5236 (rev 01) (prog-if 02 [NVM Express])
        Subsystem: Biwin Storage Technology Co., Ltd. Device 5236
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 63
        IOMMU group: 13
        Region 0: Memory at dcb00000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] Express (v2) Endpoint, IntMsgNum 0
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75W TEE-IO-
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 16GT/s, Width x4
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                         AtomicOpsCtl: ReqEn-
                         IDOReq- IDOCompl- LTR+ EmergencyPowerReductionReq-
                         10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
                LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: Upstream Port
        Capabilities: [b0] MSI-X: Enable+ Count=66 Masked-
                Vector table: BAR=0 offset=00002000
                PBA: BAR=0 offset=00003000
        Capabilities: [d0] Vital Product Data
                Not readable
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
                        ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr+ HeaderOF+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [148 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [158 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [178 v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [19c v1] Lane Margining at the Receiver
                PortCap: Uses Driver-
                PortSta: MargReady+ MargSoftReady-
        Capabilities: [1f4 v1] Latency Tolerance Reporting
                Max snoop latency: 1048576ns
                Max no snoop latency: 1048576ns
        Capabilities: [1fc v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=32768ns
                L1SubCtl2: T_PwrOn=10us
        Capabilities: [20c v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
        Capabilities: [244 v1] Data Link Feature <?>
        Kernel driver in use: nvme
        Kernel modules: nvme
 
Hey there,

any chance that your /boot is running out of space? I had the same problems a few moments ago, after a bit of research in the install logs, I found out that I don't have enough space on my /boot partition. Got the exact same kernel panic message as you...

I purged some older kernel instances with apt autoremove and retried the kernel installation. My System works fine with Kernel 7 for now ;)
 
  • Like
Reactions: apollotonkosmo
For me this was it.
/boot was full and initrd image wan't created for the latest kernel.
I Booted a previous kernel, deleted all unneeded kernels apt remove --purge [insert kernel versions here]` , then `update-initramfs -u -k all` and `proxmox-boot-tool refresh`.

On the other hand though, one of my machines, was freezing after successfully booting with kernel 7.0.2-2 and I dont know why.
 
For me this was it.
/boot was full and initrd image wan't created for the latest kernel.
I Booted a previous kernel, deleted all unneeded kernels apt remove --purge [insert kernel versions here]` , then `update-initramfs -u -k all` and `proxmox-boot-tool refresh`.

On the other hand though, one of my machines, was freezing after successfully booting with kernel 7.0.2-2 and I dont know why.
 
For me this was it.
/boot was full and initrd image wan't created for the latest kernel.
I Booted a previous kernel, deleted all unneeded kernels apt remove --purge [insert kernel versions here]` , then `update-initramfs -u -k all` and `proxmox-boot-tool refresh`.

On the other hand though, one of my machines, was freezing after successfully booting with kernel 7.0.2-2 and I dont know why.
 
that might be a possible source. I haven't found anything in particular related to that model of NVME upstream, so if it is not an issue with the ESP/initrd/.. the next step would be to obtain boot messages from the broken kernel on your system..
 
Hi. It think my boot partition is ok.

Code:
df -h

Filesystem            Size  Used Avail Use% Mounted on
udev                   30G     0   30G   0% /dev
tmpfs                 5.9G  3.7M  5.9G   1% /run
/dev/mapper/pve-root   44G   29G   13G  70% /
tmpfs                  30G   25M   30G   1% /dev/shm
efivarfs              128K   52K   72K  42% /sys/firmware/efi/efivars
tmpfs                 5.0M     0  5.0M   0% /run/lock
tmpfs                 1.0M     0  1.0M   0% /run/credentials/systemd-journald.service
tmpfs                  30G     0   30G   0% /tmp
/dev/nvme0n1p2       1022M  8.8M 1014M   1% /boot/efi
/dev/fuse             128M   36K  128M   1% /etc/pve
tmpfs                 1.0M     0  1.0M   0% /run/credentials/getty@tty1.service
tmpfs                 5.9G  4.0K  5.9G   1% /run/user/0

I just tried an apt update && apt upgrade and I noticed this error message after the initramfs build logs :

Code:
Setting up proxmox-kernel-6.17 (6.17.13-8) ...
dpkg: dependency problems prevent configuration of proxmox-kernel-7.0:
 proxmox-kernel-7.0 depends on proxmox-kernel-7.0.2-3-pve-signed | proxmox-kernel-7.0.2-3-pve; however:
  Package proxmox-kernel-7.0.2-3-pve-signed is not configured yet.
  Package proxmox-kernel-7.0.2-3-pve is not installed.
  Package proxmox-kernel-7.0.2-3-pve-signed which provides proxmox-kernel-7.0.2-3-pve is not configured yet.

dpkg: error processing package proxmox-kernel-7.0 (--configure):
 dependency problems - leaving unconfigured

Maybe this could be a clue for the issue ?
 
Last edited:
Reading further, I have at least one issue in this part of the initramfs build logs : I'll got try and find AMD GPU drivers for 7.0.x to see if it could resolve the issue.

Code:
etting up smartmontools (7.5-pve2) ...
/var/lib/smartmontools/drivedb/drivedb.h 7.3/5528 replaced with 7.5/5706 (NOT VERIFIED)
Setting up systemd-boot-efi:amd64 (257.13-1~deb13u1) ...
Setting up libnvpair3linux:amd64 (2.4.2-pve1) ...
Setting up proxmox-widget-toolkit (5.1.10) ...
Setting up pve-xtermjs (6.0.0-1) ...
Setting up lxcfs (7.0.0-pve1) ...
Setting up pve-qemu-kvm (11.0.0-2) ...
Setting up proxmox-kernel-7.0.2-3-pve-signed (7.0.2-3) ...
Examining /etc/kernel/postinst.d.
run-parts: executing /etc/kernel/postinst.d/dkms 7.0.2-3-pve /boot/vmlinuz-7.0.2-3-pve
Sign command: /lib/modules/7.0.2-3-pve/build/scripts/sign-file
Signing key: /var/lib/dkms/mok.key
Public certificate (MOK): /var/lib/dkms/mok.pub

Autoinstall of module amdgpu/6.16.6-2255209.24.04 for kernel 7.0.2-3-pve (x86_64)
Building module(s)........(bad exit status: 2)
Failed command:
'make' KERNELVER=7.0.2-3-pve

Error! Bad return status for module build on kernel: 7.0.2-3-pve (x86_64)
Consult /var/lib/dkms/amdgpu/6.16.6-2255209.24.04/build/make.log for more information.

Autoinstall on 7.0.2-3-pve failed for module(s) amdgpu(10).

Error! One or more modules failed to install during autoinstall.
Refer to previous errors for more information.
run-parts: /etc/kernel/postinst.d/dkms exited with return code 1
Failed to process /etc/kernel/postinst.d at /var/lib/dpkg/info/proxmox-kernel-7.0.2-3-pve-signed.postinst line 20.
dpkg: error processing package proxmox-kernel-7.0.2-3-pve-signed (--configure):
 installed proxmox-kernel-7.0.2-3-pve-signed package post-installation script subprocess returned error exit status 2
 
I was previously running amdgpu/30.20.1. Tried updating to the current latest stable version : amdgpu/30.30.3
This new version built fine on 6.17.x kernels but failed on the 7.0.2 still :(

So I guess I'm stuck with 6.17.x for now, right ?

What would be the best way to pin the kernel to 6.17 even for the few proxmox upgrades and remove the 7.x entries from grub ?

Thank you.
 
Thank you ! This did work to setup grub :
Code:
proxmox-boot-tool kernel pin 6.17.13-8-pve
proxmox-boot-tool refresh

I also tried removing the 7.0 kernels to remove the dkms errors for future updates.
However running `apt remove proxmox-kernel-7.0.2-3-pve` keeps reinstalling another 7.0.2-3-pve-signed kernel. And apt remove proxmox-kernel-7.0 wants to remove "proxmox-default-kernel proxmox-kernel-7.0 proxmox-ve", which looks like it is too much.
 
You can indeed not remove the (new) default kernel version since the Proxmox packages depend on it. You'll have to ask AMD for a newer driver that is compatible with Linux kernel version 7.0. Or maybe you can use the open-source amdgpu driver that comes automatically with the Linux kernel (and uninstall the one you have now)?
 
OK I get it. I'll mark the thread as solved for now. Thank you all for the help in investigating the issue.

As for the amdgpu drivers I would need to run some tests using ROCm and comfyui using the oss driver in kernel 7.0.x . This would be a cool option. But I'll keep it for another day :)