Upgrade PVE 6.x to 7.x : Grub issues

Two new files.
 

Attachments

  • Output from the non-working node.txt
    52.3 KB · Views: 30
  • Output from a working node.txt
    41.1 KB · Views: 17
okay, so for some strange reason grub is unable to find the pve/root LV (it does see and skip the thin LVs on pve/data and the thin pool itself). I don't think we can get further on that front unless you are comfortable with running grub-probe under gdb and debugging grub's LVM implementation..

how does the full generated grub.cfg look like on the node with the errors? did you already reboot since the upgrade (don't do it yet please, just tell me if you have ;))?
 
okay, so for some strange reason grub is unable to find the pve/root LV (it does see and skip the thin LVs on pve/data and the thin pool itself). I don't think we can get further on that front unless you are comfortable with running grub-probe under gdb and debugging grub's LVM implementation..

how does the full generated grub.cfg look like on the node with the errors? did you already reboot since the upgrade (don't do it yet please, just tell me if you have ;))?

The auto-generated grub.cfg is really different on the non-working node compared to the working nodes.

On the non-working node, the reboot remains ok but I'm not comfortable to use it with a potential grub failure.
 
could you please still post the generated grub.cfg files?
 
could you please still post the generated grub.cfg files?

Here is the grub.cfg file from the non-working node.

It contains much less information than the grub.cf from a working node.
 

Attachments

  • grub.cfg.txt
    7.6 KB · Views: 22
Last edited:
okay, so that is indeed missing the explicit module loading for gpt and lvm2, but it seems those are loaded by default anyway else booting would not work.

I am afraid you are hitting some sort of bug in Grub's lvm parser (possibly because your / nvme is on the big side), further debugging would require stepping through the grub-probe call in a debugger and printing more debugging output at various points. the last changes in grub's lvm code were fixing various security issues with overflows / reading beyond buffer limits / .., possibly your disk happens to trigger one of those new checks and they are not 100% correct.

could you maybe generate a VG backup using vgcfgbackup pve on both nodes? it should generate a file in /etc/lvm/backup/pve containing information about the VG structure, maybe we can spot some peculiarity that trips up grub there..
 
  • Like
Reactions: YAGA
Hello @fabian

I performed additional tests, I downgraded the grub version.

After the 6.x to 7.x upgrade the grub version is grub2/testing 2.04-19 amd64

I downgraded the grub version to the PVE 6.x based on buster which worked well. grub2/now 2.02+dfsg1-20+deb10u4 amd64

I've added these sources
deb [URL]http://ftp.fr.debian.org/debian[/URL] buster main contrib deb [URL]http://ftp.fr.debian.org/debian[/URL] buster-updates main contrib

I've done this downgrade
apt install grub2=2.02+dfsg1-20+deb10u4 grub-pc=2.02+dfsg1-20+deb10u4 grub-common=2.02+dfsg1-20+deb10u4 grub2-common=2.02+dfsg1-20+deb10u4 grub-pc-bin=2.02+dfsg1-20+deb10u4 os-prober

Unfortunately, I am getting the same errors during the grub install process.

From my understanding, this issue might not be related to the grub version but something else during the PVE upgrade from 6.x to 7.x.

I've set up a new server for testing, so if you like, I can do some test in a secure environnement.

Regards,
 
could you maybe generate a VG backup using vgcfgbackup pve on both nodes? it should generate a file in /etc/lvm/backup/pve containing information about the VG structure, maybe we can spot some peculiarity that trips up grub there..
 
Hello Fabian,

I've generated a VG backup using vgcfgbackup pve on both nodes and there are identical except, of course, the node name and the lvmid (PV & VG).

I have no idea of why only one node upgrade gone wrong, so I've decided to do a fresh install of PVE 7.x on this node and it was a success.

This is not related to a hardware failure or hardware incompatibility.

All these nodes were previously upgraded from 5.x to 6.x and 6.x to 7.x.

It seems that the LVM detection process on my NVMe SSD was wrong due to a bug or to a SSD logical partition error.

In my opinion it's not related to Proxmox, but it'll be interesting to check all the LVM partitions and the grub bootloader in the pve6to7 script before the upgrade.

I'll keep you in touch if other nodes are also impacted.

I'd would like to thank you very much for your time and consideration.

Kind regards,
 
  • Like
Reactions: fabian
thanks for the patience with all the back and forth! unfortunately, Grub has a lot of custom code for various filesystems, volume management and disk handling that can sometimes fail in strange and unexpected ways in edge-cases. this is one of the reasons we switched to no longer booting directly from ZFS with Grub. it's possible you hit one of those edge-cases that the "real" LVM supports without any issues, but the grub "LVM" takes a wrong turn even though your setup is completely valid.
 
  • Like
Reactions: YAGA
I had exactly the same issue and I've been busy the whole day to fix this.
The solution was as follow:
- Execute all the steps from here: https://pve.proxmox.com/wiki/Recover_From_Grub_Failure
- the update-grub will fail. At this point you need to lvextend your logical volume with 2G for example.
lvextend -L 298G /dev/pve/root
resize2fs /dev/pve/root 298G
- Now you can try again the update-grub. And magically it doesn't give any error
- now you can finally run the grub-install. In my case it was grub-install /dev/nvme1n1 (my second ssd where proxmox is installed. I have a first ssd with Windows installed)

Thanks to Miquel and his post here https://forum.proxmox.com/threads/system-unbootable-grub-error-disk-lvmid-not-found.98761/ for the solution. I'm attaching a screen shot where we can see the grub-update before and after the lvextend. After this I ran grub-install /dev/nvme1n1 and reboot and promox was back.

IMG_20211111_223329-4.jpg
 
Awesome!!!
Thanks for sharing.
This saved me a reinstall of my Lab Host as well.

I was puzzled why I could not boot my proxmox machine anymore. Only thing I know is that I used a script to change the vmid for a VM.
This script uses vgrename changes config files, etc.

Could this have triggered grub to become corrupt?

Regardless this behaviour seems rather flaky.
 
I'm curious why you chose to lvextend as opposed to lvreduce?

I'd have thought extending the LVM when there's no physical space to really do so would cause problems, especially as you then resize the filesystem?

I realise that LVM is "logical". And I also know that the data LVM is LVM-thin. But to my old fashioned way of thinking (LVM is still "magic" to me) I'm still imagining some data potentially being overwritten if you are really, really unlucky.

I also realise that reducing VLMs and the filesystem contained in them has its own dangers.

But that's the extent of my knowledge (excuse the unintended pun). I'm genuinely curious, and hoping to learn something new.
 
  • Like
Reactions: melroy89
Also thank you very much for the solution! I did it like em3034 described and could solve the problem within half an hour - but in the first moment i got the error i was frightened :)
Hope this will not come soon again... good to have such a forum!
 
I had exactly the same issue and I've been busy the whole day to fix this.
The solution was as follow:
- Execute all the steps from here: https://pve.proxmox.com/wiki/Recover_From_Grub_Failure
- the update-grub will fail. At this point you need to lvextend your logical volume with 2G for example.
lvextend -L 298G /dev/pve/root
resize2fs /dev/pve/root 298G
- Now you can try again the update-grub. And magically it doesn't give any error
- now you can finally run the grub-install. In my case it was grub-install /dev/nvme1n1 (my second ssd where proxmox is installed. I have a first ssd with Windows installed)

Thanks to Miquel and his post here https://forum.proxmox.com/threads/system-unbootable-grub-error-disk-lvmid-not-found.98761/ for the solution. I'm attaching a screen shot where we can see the grub-update before and after the lvextend. After this I ran grub-install /dev/nvme1n1 and reboot and promox was back.

View attachment 31296
Greate!
 
I had exactly the same issue and I've been busy the whole day to fix this.
The solution was as follow:
- Execute all the steps from here: https://pve.proxmox.com/wiki/Recover_From_Grub_Failure
- the update-grub will fail. At this point you need to lvextend your logical volume with 2G for example.
lvextend -L 298G /dev/pve/root
resize2fs /dev/pve/root 298G
- Now you can try again the update-grub. And magically it doesn't give any error
- now you can finally run the grub-install. In my case it was grub-install /dev/nvme1n1 (my second ssd where proxmox is installed. I have a first ssd with Windows installed)

Thanks to Miquel and his post here https://forum.proxmox.com/threads/system-unbootable-grub-error-disk-lvmid-not-found.98761/ for the solution. I'm attaching a screen shot where we can see the grub-update before and after the lvextend. After this I ran grub-install /dev/nvme1n1 and reboot and promox was back.

View attachment 31296
Exactly the same here, today!

Again, thanks for sharing!
 
I had exactly the same issue and I've been busy the whole day to fix this.
The solution was as follow:
- Execute all the steps from here: https://pve.proxmox.com/wiki/Recover_From_Grub_Failure
- the update-grub will fail. At this point you need to lvextend your logical volume with 2G for example.
lvextend -L 298G /dev/pve/root
resize2fs /dev/pve/root 298G
- Now you can try again the update-grub. And magically it doesn't give any error
- now you can finally run the grub-install. In my case it was grub-install /dev/nvme1n1 (my second ssd where proxmox is installed. I have a first ssd with Windows installed)

Thanks to Miquel and his post here https://forum.proxmox.com/threads/system-unbootable-grub-error-disk-lvmid-not-found.98761/ for the solution. I'm attaching a screen shot where we can see the grub-update before and after the lvextend. After this I ran grub-install /dev/nvme1n1 and reboot and promox was back.

View attachment 31296
Thanks so much for posting this as it helped me out today!!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!