Grub Rescue: checksum verification failed

ca_maer · Jan 9, 2018

I have an issue after applying the latest update for Proxmox 5.1. The server is a a ProLiant DL380 G6 using ZFS and an HBA.It no longer boot and is stuck on grub recovery with the error: checksum verification failed

Is there a way to fix this ? Is booting to the old kernel from the grub rescue shell possible ? Pretty sure my last working kernel was 4.13.8-2-pve with Proxmox 5.1 latest version

Here are my current options:

Booting from the live ISO repair doesn't work either saying: Unable to find boot device
zpool list from the debug installation shows no pools available

Thanks

fabian · Jan 9, 2018

what does "zpool import" say when booted using a live cd with ZFS support (or the installer in debug mode)? "zpool list" is only for already imported pools.

ca_maer · Jan 9, 2018

Hey Fabian,

Thanks for the quick response. Here's what my zpool import report

Screen Shot 2018-01-09 at 9.16.23 AM.png

This can't be good. Everything was working correctly before the update so I'm assuming it can't be hardware related

fabian · Jan 9, 2018

what about "zpool import -d /dev" ? the installer environment and ZFS don't like the full by-id paths..

ca_maer · Jan 9, 2018

Seems to be better.

Screen Shot 2018-01-09 at 9.30.28 AM.png

fabian · Jan 9, 2018

then the next step would be to actually import it (use -N and -R!) and do a scrub to see if there actually is a checksum which cannot be verified. you can also dump all the pool and dataset properties while you have it imported and post them here

ca_maer · Jan 9, 2018

Scrub found nothing:

Code:

zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0h12m with 0 errors on Tue Jan  9 15:23:49 2018
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdc2    ONLINE       0     0     0
        sdf2    ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        sde2    ONLINE       0     0     0
        sdd2    ONLINE       0     0     0
    logs
      mirror-2  ONLINE       0     0     0
        sda2    ONLINE       0     0     0
        sdb2    ONLINE       0     0     0
    cache
      sda3      ONLINE       0     0     0
      sdb3      ONLINE       0     0     0

errors: No known data errors

zpool list

Code:

NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool  1.09T   112G  1000G         -    15%    10%  1.00x  ONLINE  /mnt

zpool properties:

Code:

NAME   PROPERTY                       VALUE                          SOURCE
rpool  size                           1.09T                          -
rpool  capacity                       10%                            -
rpool  altroot                        /mnt                           local
rpool  health                         ONLINE                         -
rpool  guid                           10146041935939188713           -
rpool  version                        -                              default
rpool  bootfs                         rpool/ROOT/pve-1               local
rpool  delegation                     on                             default
rpool  autoreplace                    off                            default
rpool  cachefile                      none                           local
rpool  failmode                       wait                           default
rpool  listsnapshots                  off                            default
rpool  autoexpand                     off                            default
rpool  dedupditto                     0                              default
rpool  dedupratio                     1.00x                          -
rpool  free                           1000G                          -
rpool  allocated                      112G                           -
rpool  readonly                       off                            -
rpool  ashift                         12                             local
rpool  comment                        -                              default
rpool  expandsize                     -                              -
rpool  freeing                        0                              -
rpool  fragmentation                  15%                            -
rpool  leaked                         0                              -
rpool  multihost                      off                            default
rpool  feature@async_destroy          enabled                        local
rpool  feature@empty_bpobj            active                         local
rpool  feature@lz4_compress           active                         local
rpool  feature@multi_vdev_crash_dump  enabled                        local
rpool  feature@spacemap_histogram     active                         local
rpool  feature@enabled_txg            active                         local
rpool  feature@hole_birth             active                         local
rpool  feature@extensible_dataset     active                         local
rpool  feature@embedded_data          active                         local
rpool  feature@bookmarks              enabled                        local
rpool  feature@filesystem_limits      enabled                        local
rpool  feature@large_blocks           enabled                        local
rpool  feature@large_dnode            enabled                        local
rpool  feature@sha512                 enabled                        local
rpool  feature@skein                  enabled                        local
rpool  feature@edonr                  enabled                        local
rpool  feature@userobj_accounting     active                         local

dataset properties:
(See attached file. Too long to use tag)

This node was mainly used as a ZFS replication slave

ca_maer · Jan 9, 2018

Ok after exporting the pool and rebooting everything is fine. I'm not sure what might have caused this. Any idea ? I have multiple servers to update and not wanting this to happen again.

Thanks

fabian · Jan 10, 2018

that sounds very strange indeed. maybe some kind of feature upgrade was still in progress and Grub does not like that (e.g., userobj_accounting runs as a kind of background job on the existing datasets when you activate it). did you reboot right after running "zpool upgrade" ? or was this 0.7 install from the beginning?

ca_maer · Jan 10, 2018

It was a a 0.7 pool from the beginning. What I did is this in order:
1. Import pool with -d
2. Scrub pool (No error found)
3. Export Pool
4. Reboot

fabian · Jan 10, 2018

ca_maer said:
It was a a 0.7 pool from the beginning. What I did is this in order:
1. Import pool with -d
2. Scrub pool (No error found)
3. Export Pool
4. Reboot

very strange. if it occurs again, can you try the following in the grub rescue shell

Code:

cat (hd0,gpt2)/ROOT/pve-1/@/boot/grub/grub.cfg

if that prints an error, setup grub to use a serial console, and repeat the command after running

Code:

set debug=zfs

and post the resulting dump.

ca_maer · Jan 10, 2018

I was able to boot into a grub shell on the same server and run the following command which gives some error but the server boot fine so I'm not sure if this is related. You'll find the output attached

fabian · Jan 12, 2018

I'll see if I find some time next week to investigate this - thanks for the dump!

ca_maer · Jan 12, 2018

No problem ! Thanks for your help.

Cheers

myzamri · Nov 25, 2018

In my case, i cannot run scrub because it is resilvering on e of the disk. I tried rebooting but to no avail. What should i do?

jcrsoto · Mar 23, 2019

Hi, i did the same steps like ca_maer, no errors on the scrub, but when i rebooted, i found with another error:
Do you have any ideas?
Thank you

bvlgy-ple · Mar 31, 2019

Hi jcrsoto,
are you also running your rootfs on the ZFS rpool? I'm experiencing the same behaviour at the moment and I fear that there is no solution. I've really done a lot of research and have also opened a thread on here: https://forum.proxmox.com/threads/another-hpe-microserver-gen8-grub-rescue-thread.52883 Feel free to try what's on there.

Joshua Harding · Apr 5, 2019

I am also in the same boat. JBOD disk configuration with a HBA in IT mode.. After a graceful shutdown, we have a grub error "no such device". It doesn't seem any patches were done before-hand and the shutdown was clean. ZFS root was functioning for a long time with no issues. After hitting this, I wish I had been using my old config of a RAID1 + EXT4 for root and the remainder for ZFS. I am not sure how to fix this either. I may just tediously back up all my data (as I can still access the rpool) and reformat and restore, switching things up to the ext4 config mentioned before. As the pool is rather large, this is going to be painful. If someone has a quick fix, it would be great.

bvlgy-ple · Apr 6, 2019

I'd suggest to leave the zpool as it is and just do a clean install to ext4. From there you can use the zpool as storage, which will regain you access to the VM files. After transferring the PVE configuration files from the rootfs inside the zpool, you might be able to use your stuff as it was before. I'm not sure how clean this is but I did it this way and I still like the thought of having the VM files on a zpool.

exochris7 · Jun 8, 2019

Same issue, solution for me https://forum.proxmox.com/threads/grub-rescue-error-checksum-verification-failed.52730/

Grub Rescue: checksum verification failed

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Attachments

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Attachments

Proxmox Staff Member

Well-Known Member

Renowned Member

New Member

Attachments

New Member

New Member

New Member

Active Member

We value your privacy