[SOLVED] PVE 7.4-x to 7.4-latest: grub failed to write to /dev/sda (then grub-install gives Disk Not Found) Debian bug 987008

linux

Member
Dec 14, 2020
96
36
23
Australia
Hi there,

Looking back at these posts and wondering what to do:

https://forum.proxmox.com/threads/proxmox-update-from-7-2-4-grub-update-failure.114951/page-2
https://forum.proxmox.com/threads/problem-with-grub-upgrading-pve-6-4-15.115376/

The first being the most relevant. On just 1 machine of all, we hit this issue where it popped up during update to latest 7.4-x ahead of 7to8, that grub was unable to install onto /dev/sda and that this may result in an unbootable system. If we chose no, it would just wind back to that same screen/prompt.

However if we exit then it finished the update and says please reboot for the new kernel. But we don't want to... what if it doesn't come up?

Doing an apt autoremove gets rid of 1 kernel but doesn't change the behaviour, same if we try to run grub.install-real /dev/sda we get:

Code:
# grub-install.real /dev/sda
Installing for i386-pc platform.
grub-install.real: error: disk `lvmid/hSR5FY-F3Nd-Dr52-Dqlk-ooBd-W5rj-r9c9kP/smq4EU-juQG-ft5n-gqMV-7NjL-aUSW-bHJxCW' not found.

df shows disk usage as minimal. The server isn't overloaded in that sense. This feels like an odd bug potentially, or a configuration issue?

We have pressed pause for now, as we don't want to carry on if there are risks ahead. Hopefully we can get to the bottom of it!

Thank you for any assistance. :)

Screenshot 2023-10-20 at 10.55.22 pm.png
Screenshot 2023-10-20 at 10.55.47 pm.png

Then after the OK prompt with all the text, it listed the 3 options.
/dev/sda or /dev/sda3 (LVM) or the pve-root one (dm-8?). Choosing /dev/sda did the same thing.
 
Hi, this might be an indication that grub won't come up after a reboot, due to a bug in its LVM metadata parser [1]. Can you please post the output of the following commands, so we can check if this is indeed the case?

Code:
dpkg -l | grep grub
vgscan -vvv 2>&1 | grep metadata

Indeed I would advise against rebooting right now. Are you booting in legacy mode or UEFI mode?

EDIT: If you are on PVE 7 and the output of vgscan contains a line like Reading metadata ... (+N) where N is anything but 0, there is a wraparound in the LVM metadata buffer and grub will most likely fail to boot after a reboot. In that case, please follow the suggestion in [1] and trigger an LVM metadata update, e.g. lvcreate -L 4M pve -n grubtemp. Afterwards, there should be no wraparound anymore (vgscan shows Reading metadata ... (+0)), re-installing grub should succeed, and the host should reboot fine.

[1] https://pve.proxmox.com/wiki/Recove...disk_not_found.22_error_when_booting_from_LVM
 
Last edited:
  • Like
Reactions: linux
Thank you Friedrich, that sounds likely to be related.

I spent time reading https://pve.proxmox.com/wiki/Recover_From_Grub_Failure (bottom section) however the below may also relate(?):
https://pve.proxmox.com/wiki/Upgrade_from_7_to_8#Unable_to_boot_due_to_grub_failure

Which talks to proxmox-boot-tool needing installing versus the old-school pre-6.4 configuration.

https://pve.proxmox.com/wiki/ZFS:_Switch_Legacy-Boot_to_Proxmox_Boot_Tool

Once fixed, what is the best-practice grub-install update-grub etc method?

dpkg output:

Code:
# dpkg -l | grep grub
ii  grub-common                          2.06-3~deb11u6                 amd64        GRand Unified Bootloader (common files)
ii  grub-efi-amd64-bin                   2.06-3~deb11u6                 amd64        GRand Unified Bootloader, version 2 (EFI-AMD64 modules)
ii  grub-efi-ia32-bin                    2.06-3~deb11u6                 amd64        GRand Unified Bootloader, version 2 (EFI-IA32 modules)
ii  grub-pc                              2.06-3~deb11u6                 amd64        GRand Unified Bootloader, version 2 (PC/BIOS version)
ii  grub-pc-bin                          2.06-3~deb11u6                 amd64        GRand Unified Bootloader, version 2 (PC/BIOS modules)
ii  grub2-common                         2.06-3~deb11u6                 amd64        GRand Unified Bootloader (common files for version 2)

vgscan output:

Code:
# vgscan -vvv 2>&1 | grep metadata
  metadata/record_lvs_history not found in config: defaulting to 0
  File locking settings: readonly:0 sysinit:0 ignorelockingfailure:0 global/metadata_read_only:0 global/wait_for_locks:1.
  Reading metadata summary from /dev/sda3 at 1042944 size 5632 (+2485)
  Found metadata summary on /dev/sda3 at 1042944 size 8117 for VG pve
  Reading VG pve metadata from /dev/sda3 4096
  VG pve metadata check /dev/sda3 mda 4096 slot0 offset 1038848 size 8117
  Reading metadata from /dev/sda3 at 1042944 size 5632 (+2485)
  Logical volume pve/lvol0_pmspare is pool metadata spare.
  Found metadata on /dev/sda3 at 1042944 size 8117 for VG pve
  metadata/lvs_history_retention_time not found in config: defaulting to 0
  Found volume group "pve" using metadata type lvm2

So these snippets would confirm your hunch?

grub* 2.06-3~deb11u6 on PVE 7.4-x Reading metadata summary from /dev/sda3 at 1042944 size 5632 (+2485) Reading metadata from /dev/sda3 at 1042944 size 5632 (+2485)

Legacy booting. XFS so the ZFS case doesn't apply. proxmox-boot-tool not yet in-use, should we do this before/after this "fix reboot"?

Code:
# efibootmgr -v
EFI variables are not supported on this system.

# findmnt /
TARGET SOURCE               FSTYPE OPTIONS
/      /dev/mapper/pve-root xfs    rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota

# proxmox-boot-tool status
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
E: /etc/kernel/proxmox-boot-uuids does not exist.

The fix in the documentation that you've mentioned re: creating a 4MB LV for grub to single-boot with, then removing it, however what about the grub-install? As there is no working config as it advises, is there not also a need to perform that install before the reboot? Or the LV is enough?

Also, linking to the directly-related thread now, which asks for debugging info - are you still unable to replicate this?
https://forum.proxmox.com/threads/error-disk-lvmid-not-found-grub-rescue.123512/post-587764

To that end, here are additional pieces of information as you requested, even though this is PVE7. In case it helps, as it is pre-fix.

pvdisplay & vgdisplay:

Code:
# pvdisplay
  --- Physical volume ---
  PV Name               /dev/sda3
  VG Name               pve
  PV Size               3.49 TiB / not usable 2.98 MiB
  Allocatable           yes
  PE Size               4.00 MiB
  Total PE              915311
  Free PE               4192
  Allocated PE          911119
  PV UUID               pxtsvo-ydEJ-1ocN-dgCp-MPJ7-ovHe-1Zk6u7
 
  # vgdisplay
  --- Volume group ---
  VG Name               pve
  System ID     
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  289
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                20
  Open LV               11
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               3.49 TiB
  PE Size               4.00 MiB
  Total PE              915311
  Alloc PE / Size       911119 / <3.48 TiB
  Free  PE / Size       4192 / <16.38 GiB
  VG UUID               hSR5FY-F3Nd-Dr52-Dqlk-ooBd-W5rj-r9c9kP

pvck output:

Code:
# pvck /dev/sda3 --dump headers
  label_header at 512
  label_header.id LABELONE
  label_header.sector 1
  label_header.crc 0x16f39267
  label_header.offset 32
  label_header.type LVM2 001
  pv_header at 544
  pv_header.pv_uuid pxtsvoydEJ1ocNdgCpMPJ7ovHe1Zk6u7
  pv_header.device_size 3839095717376
  pv_header.disk_locn[0] at 584 # location of data area
  pv_header.disk_locn[0].offset 1048576
  pv_header.disk_locn[0].size 0
  pv_header.disk_locn[1] at 600 # location list end
  pv_header.disk_locn[1].offset 0
  pv_header.disk_locn[1].size 0
  pv_header.disk_locn[2] at 616 # location of metadata area
  pv_header.disk_locn[2].offset 4096
  pv_header.disk_locn[2].size 1044480
  pv_header.disk_locn[3] at 632 # location list end
  pv_header.disk_locn[3].offset 0
  pv_header.disk_locn[3].size 0
  pv_header_extension at 648
  pv_header_extension.version 2
  pv_header_extension.flags 1
  pv_header_extension.disk_locn[0] at 656 # location list end
  pv_header_extension.disk_locn[0].offset 0
  pv_header_extension.disk_locn[0].size 0
  mda_header_1 at 4096 # metadata area
  mda_header_1.checksum 0x8738c460
  mda_header_1.magic 0x204c564d3220785b35412572304e2a3e
  mda_header_1.version 1
  mda_header_1.start 4096
  mda_header_1.size 1044480
  mda_header_1.raw_locn[0] at 4136 # commit wrapped
  mda_header_1.raw_locn[0].offset 1038848
  mda_header_1.raw_locn[0].size 8117
  mda_header_1.raw_locn[0].checksum 0x7bc97246
  mda_header_1.raw_locn[0].flags 0x0
  mda_header_1.raw_locn[1] at 4160 # precommit
  mda_header_1.raw_locn[1].offset 0
  mda_header_1.raw_locn[1].size 0
  mda_header_1.raw_locn[1].checksum 0x0
  mda_header_1.raw_locn[1].flags 0x0
  metadata text at 1042944 crc 0x7bc97246 # vgname pve seqno 289

grub outputs:

Code:
# grub-fstest --version
grub-fstest (GRUB) 2.06-3~deb11u6

# grub-fstest -v /dev/sda3 ls
grub-fstest: info: Scanning for DISKFILTER devices on disk proc.
grub-fstest: info: Scanning for mdraid1x devices on disk proc.
grub-fstest: info: Scanning for mdraid09 devices on disk proc.
grub-fstest: info: Scanning for mdraid09_be devices on disk proc.
grub-fstest: info: Scanning for dmraid_nv devices on disk proc.
grub-fstest: info: Scanning for lvm devices on disk proc.
grub-fstest: info: Scanning for ldm devices on disk proc.
grub-fstest: info: scanning proc for LDM.
grub-fstest: info: no LDM signature found.
grub-fstest: info: Scanning for DISKFILTER devices on disk loop0.
grub-fstest: info: Scanning for mdraid1x devices on disk loop0.
grub-fstest: info: Scanning for mdraid09 devices on disk loop0.
grub-fstest: info: Scanning for mdraid09_be devices on disk loop0.
grub-fstest: info: Scanning for dmraid_nv devices on disk loop0.
grub-fstest: info: Scanning for lvm devices on disk loop0.
grub-fstest: info: unknown LVM type thin-pool.
grub-fstest: info: unknown LVM type thin.
grub-fstest: info: unknown LVM type thin.
grub-fstest: info: unknown LVM type thin.
grub-fstest: info: unknown LVM type thin.
grub-fstest: info: unknown LVM type thin.
grub-fstest: info: unknown LVM type thin.
grub-fstest: info: unknown LVM type thin.
grub-fstest: info: unknown LVM type thin.
grub-fstest: info: unknown LVM type thin.
grub-fstest: info: unknown LVM type thin.
grub-fstest: info: unknown LVM type thin.
grub-fstest: info: unknown LVM type thin.
grub-fstest: info: unknown LVM type thin.
grub-fstest: info: Scanning for ldm devices on disk loop0.
grub-fstest: info: scanning loop0 for LDM.
grub-fstest: info: no LDM signature found.
grub-fstest: info: Scanning for DISKFILTER devices on disk host.
grub-fstest: info: Scanning for mdraid1x devices on disk host.
grub-fstest: info: Scanning for mdraid09 devices on disk host.
grub-fstest: info: Scanning for mdraid09_be devices on disk host.
grub-fstest: info: Scanning for dmraid_nv devices on disk host.
grub-fstest: info: Scanning for lvm devices on disk host.
grub-fstest: info: Scanning for ldm devices on disk host.
grub-fstest: info: scanning host for LDM.
grub-fstest: info: no LDM signature found.
(proc) (loop0) (host)

We are refraining from running the fix, as we'd like to assist should you want further information. (EDIT: I see this is more for PVE 8 you want it for)

Please advise when you have the file and if you want it (as this is PVE 7.4 with older grub ver), so we can delete it from this post ASAP. Thank you!
 
Last edited:
OMG Friedrich, I knew you lauched dark mode to the softwares, but come on - I had no idea about the forum!

It's fantastic with the orange. You should consider using orange outlines in the GUIs more? Love your stuff. :)
 
  • Like
Reactions: fweber
Hi, thanks for the data!
So these snippets would confirm your hunch?
Please advise when you have the file and if you want it (as this is PVE 7.4 with older grub ver), so we can delete it from this post ASAP. Thank you!
Yes, I would say it is clear that the host is affected by the grub bug I mentioned, because it is running PVE 7 (and the version of grub-pc is 2.06-3~deb11u6), and there is a wraparound in the metadata ring buffer:
Reading metadata summary from /dev/sda3 at 1042944 size 5632 (+2485)
So feel free to remove the data. Thanks!

I have just updated the wiki page [1] with more detailed information regarding this bug and the differences between PVE 7 and 8, and posted an update to the other thread [2].

In your case, as you are booting in legacy mode, I would suggest the following:
  1. Create a small logical volume to trigger a metadata update: lvcreate -L 4M pve -n grubtemp, verify that vgscan -vvv does not indicate a wraparound anymore, i.e., prints Reading metadata summary from /dev/sda3 ... (+0)
  2. Reinstall grub-pc to trigger a grub-install to /dev/sda: apt install --reinstall grub-pc. This should now work without errors.
Then, a reboot should be safe. As noted in [1], this is only a temporary workaround for PVE 7 hosts, as grub will fail to boot when there is a wraparound again. The only permanent fix is to upgrade to PVE 8.

[1] https://pve.proxmox.com/wiki/Recove...disk_not_found.22_error_when_booting_from_LVM
[2] https://forum.proxmox.com/threads/123512/post-598646
 
Then, a reboot should be safe
Thank you, I've done that to the impacted machine and it reports no wraparound now.

However, as we need to reboot into latest 7.4.x and then into 8.0.x (ie. 2x, not 1x, reboots), should we do anything abnormal re: PVE 7 up to 8?

Just to be sure we don't end up stuck with an un-bootable machine after the 2nd reboot.

I'd think it will be OK unless otherwise reported. Maybe check the wraparound before?
 
As you are booting in legacy mode, rebooting after the upgrade to PVE 8 should properly boot the new grub code in which the LVM metadata parsing bug is fixed. If the system was booting in UEFI mode, manual steps would be necessary as described (now) in the upgrade guide [1].
I'd think it will be OK unless otherwise reported. Maybe check the wraparound before?
You can do that -- if vgscan reports no wraparound, even the old version of grub should have no problem parsing the LVM metadata.

[1] https://pve.proxmox.com/wiki/Upgrade_from_7_to_8#GRUB_Might_Fail_To_Boot_From_LVM_in_UEFI_Mode
 
  • Like
Reactions: linux
rebooting after the upgrade to PVE 8 should properly boot the new grub code in which the LVM metadata parsing bug is fixed
Okay great, thanks for the additional clarity!

Remove the grubtemp LV at which stage - after 1st reboot to PVE 7.4-latest? Then check vgscan before 2nd reboot?

Also, did you see my comment about orange outlines on forum vs GUIs, and that the software would benefit from it?
 
Okay great, thanks for the additional clarity!

Remove the grubtemp LV at which stage - after 1st reboot to PVE 7.4-latest? Then check vgscan before 2nd reboot?
You can also keep the LV until after the successful upgrade to PVE 8. When there is a wraparound in the metadata ring buffer, the point of creating a small LV is to make some change to the LVM metadata, so that the newly written metadata (with that new LV) does not wraparound anymore and grub 2.06-3 has no problem parsing it. If you want to check the vgscan output for the wraparound, I would do so right immediately before the reboot. I would recommend against making any LVM metadata changes between checking vgscan and rebooting, as those might in the worst case even push the metadata to the wraparound case again.

Also, did you see my comment about orange outlines on forum vs GUIs, and that the software would benefit from it?
Sorry, overlooked that one. If you have feedback regarding the dark mode in Proxmox products or the forum, I'd say this thread [1] would be a better place to post it.

[1] https://forum.proxmox.com/threads/official-darkmode.74609/page-9
 
  • Like
Reactions: linux
You can also keep the LV until after the successful upgrade to PVE 8.

If you want to check the vgscan output for the wraparound, I would do so right immediately before the reboot.
Thanks, that makes sense. Did this and removed it afterwards. Couldn't see any wraparound during any check-point.

Both reboots went well, nothing to report. Thanks very much for your help with this, glad to know it's already covered.

If you have feedback regarding the dark mode in Proxmox products or the forum, I'd say this thread [1] would be a better place to post it.
Thanks, have done that now: https://forum.proxmox.com/threads/official-darkmode.74609/post-598696
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!