Servers stuck at grub rescue (attempt to read or write outside of disk hd0)

mmitech

Member
Mar 12, 2022
16
1
8
38
Half of my servers are stuck at grub rescue after the latest update (pve-manager/7.2-11/b76d3178 kernel 5.15.60-1-pve) and I am afraid to even restart the other servers.
I spent all day yesterday and today trying to fix the grub issues to no avail, I tried this https://pve.proxmox.com/wiki/Recover_From_Grub_Failure and a bunch of similar grub-install/grub update methods which none worked.
this is what I get "error: attempt to read or write outside of disk hd0"

1665240120615.png
 
I am hoping that there is a way to fix this without having to reinstall proxmox, I have multiple VMs on these servers that would take forever to recover
 
multiple grub installs, config changes..etc with no help.

I don't want to reinstall the OS, I have a lot of data that I'd need to migrate on at least 3 servers, so I would rather fix this issue instead.

1665392835067.png
 
this is usually the result of
- big disks or hardware raid (the latter might just "lie" about the disk size)
- in combination with legacy bios

grub gets the disk size from bios/HW raid firmware and honors that, even if the reported size might be lower than the actual disk size. if any of the files it needs to read (grub modules/.., kernel, initrd) are outside of this boundary you will get this error.

there are a few solutions/workarounds:
- switch to UEFI which handles modern disk sizes (but might not help if you are using HW raid)
- move /boot to another disk/partition that is small enough to not bother grub
- rewrite the affected files (workaround, requires "luck" so they end up in the front part of the disk)
 
  • Like
Reactions: robetus and alexvyi
Indeed I am using a hardware Raid10 (16 x 900GB SAS disks with a Dell PERC H710 controller), this configuration has been working for a year without any issue, until the last update, 3 of the 5 servers I restarted were stuck at grub rescue.

- move /boot to another disk/partition that is small enough to not bother grub
- rewrite the affected files (workaround, requires "luck" so they end up in the front part of the disk)

I would be interested in those 2 solutions, however, I tried to move grub on a USB drive, I created 2 partitions ( EF02 from sector 34 to 2047 and the other EF00 from sector 2048 to sector 1050623) then I booted with live cd mounted the old disk and did grub-install --boot-directory=/mnt /dev/sdb (sdb being the USB)

when I try to boot I get the same error but now hd1 instead of hd0, am I missing something, is there a guide I should follow for that?

what about rewriting files? which files do I need to rewrite?
 
Indeed I am using a hardware Raid10 (16 x 900GB SAS disks with a Dell PERC H710 controller), this configuration has been working for a year without any issue, until the last update, 3 of the 5 servers I restarted were stuck at grub rescue.
yeah, then likely the raid controller tells Grub something like "the disk is 4TB", while the virtual disk it exposes is bigger. the issue has always been there, you just avoided to trigger it so far.
I would be interested in those 2 solutions, however, I tried to move grub on a USB drive, I created 2 partitions ( EF02 from sector 34 to 2047 and the other EF00 from sector 2048 to sector 1050623) then I booted with live cd mounted the old disk and did grub-install --boot-directory=/mnt /dev/sdb (sdb being the USB)

when I try to boot I get the same error but now hd1 instead of hd0, am I missing something, is there a guide I should follow for that?
grub-install will only install the bootloader to that device, you also need to move /boot to that and then re-run grub-install and update-grub.
what about rewriting files? which files do I need to rewrite?
helpfully, grub doesn't tell you what it tries to read ;) like I said, it can be any of the grub files in /boot, or the kernel, or the initrd.
 
grub-install will only install the bootloader to that device, you also need to move /boot to that and then re-run grub-install and update-grub.
OK, any idea how would I do that? Are there any docs out there I could read on/follow? I am trying to save these servers without having to reinstall (avoiding that at all costs :D)
 
this is usually the result of
- big disks or hardware raid (the latter might just "lie" about the disk size)
- in combination with legacy bios

grub gets the disk size from bios/HW raid firmware and honors that, even if the reported size might be lower than the actual disk size. if any of the files it needs to read (grub modules/.., kernel, initrd) are outside of this boundary you will get this error.

there are a few solutions/workarounds:
- switch to UEFI which handles modern disk sizes (but might not help if you are using HW raid)
- move /boot to another disk/partition that is small enough to not bother grub
- rewrite the affected files (workaround, requires "luck" so they end up in the front part of the disk)
I can confirm that switching to UEFI works WITH HW Raid.
 
Last edited:
I can confirm that switching to UEFI works WITH HW Raid.
yeah, it depends on the exact hardware and firmware involved, that's why I qualified it like that ;)
 
  • Like
Reactions: petru

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!