Q: weird problem after upgrade Prox7>8 / Boot newer kernel - "alert! /dev/mapper/pve-root does not exist" - not bootable

fortechitsolutions

Renowned Member
Jun 4, 2008
469
56
93
Hi, small fun question I wonder if anyone has any thoughts on.

I've got a ~decent box at a client office - it is not latest hardware - was deployed as a 'refurb' lenovo tower server box, dual socket xeon, 96gb ram, megaraid hardware raid

It was originally deployed as Proxmox v6 when that was current, then was updated to v7, and most recently I finally moved it up to latest v8 a few days ago with some scheduled maintenance that was a bit overdue. I am running proxmox-free-community repo version of PVE on here. No subscription/paid repository. (Yes I realize, disclaimer for using it in prod, etc etc).

So, the weird drama. Things seemed fine after the upgrade to latest, Rebooted to new kernel, new things all happy, seems like a clean upgrade

Today - weekend - user tells me 'hey is server down?' and I check - indeed it is down. I can get in via remote IPMI console to see the console of proxmox is on a failed boot init boot loader with sad messages, among them > "alert! /dev/mapper/pve-root does not exist" and it is dropped to init> prompt

after chasing a few red herrings, I found that if I rebooted "warm Ctrl-Alt-Del" style on the server > bootloader GRUB menu > choose the last good kernel before the latest
then things boot just fine, server is up and running in no time.

The newer kernel that makes me sad, is this one:
/boot/vmlinuz-6.8.12-8-pve

and the older one that made me happy, was this one:
/boot/vmlinuz-5.15.158-2-pve

So, for now, I came across another thread that felt kind of similar:

and I for the moment have given up on debugging so I can go enjoy a bit of the weekend, and have the proxmox host booted up operational on the older 5.15 kernel

the workaround was to pin the kernel:

Code:
proxmox-boot-tool kernel pin 5.15.158-2-pve
I am wondering if this sounds familiar to anyone, or an easy reason why something like this is broken?

I am wondering if the hardware raid - driver - module is bork on this new kernel for example?

I am using a pretty standard hardware raid card here:

as per

Code:
root@pve:/opt/bin# megaclisas-status
-- Controller information --
-- ID | H/W Model                | RAM    | Temp | BBU    | Firmware
c0    | LSI MegaRAID SAS 9240-8i | 0MB    | N/A  | Absent | FW: 20.13.1-0203
...truncated....
and

root@pve:/opt/bin# lsmod | grep -i mega
megaraid_sas          184320  3

I believe I had some drama a couple of years ago with a (Different)~vintage-ish Dell with Perc (also LSI Megaraid in disguise) that just gave me pain with newer proxmox, and I am kind of wondering if this might be a vaguely related new-improved-drama on that old happy topic. Maybe.

Or possibly something else is going on.

If anyone has any idea-suggestion, or has seen this before, and can say "yes ah there is easy fix" then that would be lovely.

Clearly it is "OK" for now with a working kernel pinned but I am not sure I am entirely thrilled to leave this in place as my long term 'forever' fix.

Any comments suggestions etc are greatly appreciated

thank you,

Tim
 
Footnote on the thread, I did a bit more dig in forum, and found this discussion which sounds very much maybe like the right thing possibly I need to chase.


So I'll give that a go sometime - looks like I can tweak the kernel boot option stanza slightly
approx as per from above thread,
Code:
# Edit the file /etc/default/grub
nano /etc/default/grub
# change the variable GRUB_CMDLINE_LINUX_DEFAULT from GRUB_CMDLINE_LINUX_DEFAULT="quiet" to
# For AMD CPU
GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on iommu=pt"
# For INTEL CPU
GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on iommu=pt"

# Then Update Grub and reboot
update-grub

I have not tried this yet, but once I do so I will update the forum thread here to confirm if it did/not make any change.

-Tim
 
Story Time

I did have a somewhat similar problem recently on a home lab machine. I was doing something off the beaten path - I think side-grading a Debian 12 system to Proxmox 8 - and it rebooting fine the first time (or so I thought), but wouldn't boot up again. I had also changed some BIOS settings... and then reset it to factory defaults, which wiped the UEFI menu - so who knows!?

Anyway, I knew it was a fool's errand, but I dug into it just for fun and learning. I was able to install items into the boot menu from the bios from \efi, but the Proxmox entry wouldn't boot. There was also a previous Ubuntu config which I was also able to add into the boot menu and that did boot (with some manual grub changes to point to the right drive).

Ultimately, I did the sane thing and reinstalled. I'm pretty sure it got the same kernel version that wasn't booting, but with a fresh install it booted just fine.

So, that all said...

Two Suggestions

1. Have you tried running Proxmox Rescue from the Install ISO?

I've had really good success just letting it do its thing - I know it will automatically fix some classes of boot problems - regenerating grub config and UEFI entries.

Granted, it would be better to know the root cause, but I'd be curious to see if the problem is or isn't in the category of what the installer fixes.

(in that specific case above the Rescue Boot wouldn't work because of the base Debian system, but in other cases it's worked its magic and I've learned to trust it)

2. Could you do a Saturday night reinstall? Backup of /etc/, /var/lib/vz, VMs, and restore? Perhaps using PBS for the VMs on a spare desktop class computer with sufficient storage?

If you've kept the PVE host clean or you have a backup of /etc/, it's often simpler to reinstall (10-15 minutes, plus however long the restore takes) rather than digging deep into strange issues.

Obviously, it should Just Work™ - subscription or not - but as with any OS, going through multiple upgrade cycles often leaves unfortunate residue.

If you haven't tried those two things, I say try it. I'd be interested to hear what comes of it.
 
Last edited:
HI, thanks for suggestion. I had not tried the InstallerISO>Rescue option. In theory I can try that as an Edu piece. Might be good to see what it tells me.

I am not super keen to do a reinstall 'just to see how it goes'. But it is indeed an option to try I suppose. I do have this proxmox linked to a PBS host so can in theory stop all VM / make sure backups are current / nuke the proxmox host / clean install it from Installer ISO with latest version / see if it will boot ok and reliably / then if yes - restore all the VM / see how it goes. But really I am not keen to do this exercise.

I do know I had banged my head on the - grub kernel boot option stanza topic - >1-ish year ago approx - for different hardware but similar (ie, somewhat old 'robust' dell server with a dell version of some similar-ish LSI raid card - so called "DRAC" raid - and in that situation it was precisely the drama, that older linux booted fine but newer modern linux refused to boot unless I added the flags to the grub boot related to IOMMU. and that this same issue was impacting newer proxmox in same way. basically more or less. Just in that other case I seem to recall, the boot process went further / I had more dmesg kernel stuff visible / and then the raid controller was clearly pissed off / or at least the raid controller LSI driver <> was not happy <> and things were sad.

So. Anyhoo. I will give this thing a poke. But not this weekend, as I don't have a maintenance window open for at least 9-12 days. So I won't update the thread for a while. But once I get to this - I will followup - just to tidy up the loose thread in the forum.

thank you!
Tim
 
Sorry, I probably phrased that too softly: I don't mean to try it just as an exercise like "throw it at the wall and see what sticks", I mean that in my experience a reinstall is typically a faster way to solve odd "one-off" problems related to boot/kernel and corosync than to troubleshoot them because the problems are nuanced and the default config is usually correct.

If the Rescue boot works, then it will probably fix it right then and there and reboot normally from then on out.

If the Rescue boot works, but the next boot without it doesn't, then I highly suspect that there's some local/manual config that wasn't updated or modified as part of the upgrade process, but that needs to be cleaned out and replaced.

If the Rescue boot doesn't work at all, I would be more cautious about a Reinstall. I certainly wouldn't want you to risk it over a newer kernel when you have a working system!

One other bit of information: You shouldn't have to add any iommu kernel flags. Those were necessary in the 5.x series of kernels for some time, but it became automatic before 6.x.

The Proxmox documentation specifically references 6.8 https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_general_requirements, but I'll hazard a guess that that's being conservative relative to some other update in the documentation, as I've seen the 5.x number in kernel announcements (I just don't have that source on-hand). Whenever it became the default in Linux proper, Proxmox probably lagged behind a few cycles, regardless, we're past the 6.8 minimum now.

Have a good one!
 
Last edited:
Sorry, I probably phrased that too softly: I don't mean to try it just as an exercise like "throw it at the wall and see what sticks", I mean that in my experience a reinstall is typically a faster way to solve odd "one-off" problems related to boot/kernel and corosync than to troubleshoot them because the problems are nuanced and the default config is usually correct.

If the Rescue boot works, then it will probably fix it right then and there and reboot normally from then on out.

If the Rescue boot works, but the next boot without it doesn't, then I highly suspect that there's some local/manual config that wasn't updated or modified as part of the upgrade process, but that needs to be cleaned out and replaced.

If the Rescue boot doesn't work at all, I would be more cautious about a Reinstall. I certainly wouldn't want you to risk it over a newer kernel when you have a working system!

One other bit of information: You shouldn't have to add any iommu kernel flags. Those were necessary in the 5.x series of kernels for some time, but it became automatic before 6.x.

The Proxmox documentation specifically references 6.8 https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_general_requirements, but I'll hazard a guess that that's being conservative relative to some other update in the documentation, as I've seen the 5.x number in kernel announcements (I just don't have that source on-hand). Whenever it became the default in Linux proper, Proxmox probably lagged behind a few cycles, regardless, we're past the 6.8 minimum now.

Have a good one!
Ok, thanks for added detail. This is 'weird' because - this setup is very (very) vanilla - stock proxmox ISO installer was used originally for deployment. It has been through a few in-place upgrades over a ~few years but nothing problematic with that normally. Hardware is ~plain (lenovo, stock parts, stock raid card, etc). Not a custom blackbox install, not a customized hardware / software config. So. It is definitely an adventure waiting to happen. I just need to get a window for maintenance and kick at it a bit. At least I've got a solid working config for now as a baseline (ie, pinned old kernel).

thank you!

Tim