Grub versions newer than 2.02-pve4 fail to load

Mike Nolan

New Member
Jul 27, 2017
2
0
1
29
Problem:

If I upgrade from grub 2.02-pve4 to anything newer (in Jessie or Stretch repos), after a reboot the system will freeze when attempting to boot grub, this freeze happens before any grub menu or rescue shell.

There are no errors from grub when installing new versions, and no errors are printed to the console when booting from disk, grub just does not appear to load at all (viewed from iDrac console). However grub 2.02-pve4 and earlier work as expected, I have tested this with ext4, xfs, and zfs as the root filesystem.

I have also verified that on Debian 9 I experience the same problem.

My current workaround is to install Proxmox 4.1 (last iso that has a working grub), pin the grub packages, and then upgrade to 5.0.

System:

Dell PowerEdge R710, with an LSI/Avago 9211-8i (IT mode) HBA - booting from BIOS mode.
All firmware is up to date on both the R710 and the 9211-8i as of yesterday (7/26/2017).
I have run hardware diagnostics and I have not found any errors with the physical components of the server.

Attached are some logs with more details on my hardware, but I'm not sure how useful they will be.

Any help or advice on troubleshooting would be appreciated.
 

Attachments

  • dmesg.txt
    71.2 KB · Views: 1
  • dmidecode.txt
    1.7 KB · Views: 0
  • inxi.txt
    2.4 KB · Views: 0
Does Proxmox always try to install grub-efi-amd64-bin regardless of whether or not the system is in UEFI mode? I'm starting to wonder if that is the problem.
 
the installer does install all three -bin packages, but you can safely remove those you are not using. I doubt this is the reason for your issue though. could you post your /boot/grub/grub.cfg and the output of 'grub-install -vv /dev/sdX' where sdX is the disk you are booting from, both for the old working and the new non-working grub packages?
 
I had already gone through this pain when 2.02-pve5 came out. Even posted here back then: https://forum.proxmox.com/threads/grub2-2-02-pve5-breaks-on-gpt-efi-disks-like-zfs-roots.30944

I had no choice but to reinstall 2.02-pve4 and then `apt-mark hold grub-pc=2.02-pve4`. Honestly, if the system booted properly, and I am in my production rootfs doing upgrades, why the hell would I want to upgrade the bootloader? The proxmox team should really review their upgrade policy for such packages during a stable line lifecycle.

For the jessie to stretch upgrade (4.x to 5.x), one would expect such breaking. I even removed my "apt hold" just before the upgrade thinking that this might be an old issue... turned out the issue may be with the combination of Dell PowerEdge with LSI (Avago) HBA cards.

So I ran into the exact same issue after the 5.x upgrade, after several hours of learning (30 reboots at 4 min each!!), I learned that:
- the R510 EFI boot mode doesn't see ANY of the HBA disks (even the disk projected as the boot disk)
- Booting in BIOS mode will only work using grub-pc=2.02-pve4, on future versions, it will hang for possibly hours before booting.

So to fix it, I ended up using the latest linux mint live disk on a USB key (or the virtual drive), install zfs tools, mount my pool (https://github.com/zfsonlinux/zfs/wiki/Ubuntu-16.04-Root-on-ZFS#rescuing-using-a-live-cd), then do the reinstall old grub-pc and apt-mark hold dance thingy.

Hope this helps someone!
 
Figured out the Dell problem...

If you enable the new feature flags on your pool (zpool upgrade), the old grub 2.02-pve4 will not see your pool anymore.

error: [1] incorrect dnode type: 196 != 16
(thanks for this great error message GRUB!)

Hence this topic creeping up again! Upgrading iDRAC fixes that (in my case at least).

Check if your iDRAC is at the latest version. From the klunky EFI system management module, there's an update component which looks to ftp.dell.com and auto-find all updates to apply. Slow as hell, but works.

The issue with the hang seems to be related to a virtual floppy disk (fd0) which grub constantly scans for discovering the pool and instead of instantly failing, it probably takes a second to timeout (it's "empty"/no "floppy" inserted). This causes the discovery process to take forever.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!