Grub Rescue: checksum verification failed

ca_maer

Well-Known Member
Dec 5, 2017
181
14
58
44
I have an issue after applying the latest update for Proxmox 5.1. The server is a a ProLiant DL380 G6 using ZFS and an HBA.It no longer boot and is stuck on grub recovery with the error: checksum verification failed

Is there a way to fix this ? Is booting to the old kernel from the grub rescue shell possible ? Pretty sure my last working kernel was 4.13.8-2-pve with Proxmox 5.1 latest version

Here are my current options:


Booting from the live ISO repair doesn't work either saying: Unable to find boot device
zpool list
from the debug installation shows no pools available


Thanks
 
what does "zpool import" say when booted using a live cd with ZFS support (or the installer in debug mode)? "zpool list" is only for already imported pools.
 
Hey Fabian,

Thanks for the quick response. Here's what my zpool import report
Screen Shot 2018-01-09 at 9.16.23 AM.png

This can't be good. Everything was working correctly before the update so I'm assuming it can't be hardware related
 
what about "zpool import -d /dev" ? the installer environment and ZFS don't like the full by-id paths..
 
then the next step would be to actually import it (use -N and -R!) and do a scrub to see if there actually is a checksum which cannot be verified. you can also dump all the pool and dataset properties while you have it imported and post them here ;)
 
Scrub found nothing:
Code:
zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0h12m with 0 errors on Tue Jan  9 15:23:49 2018
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdc2    ONLINE       0     0     0
        sdf2    ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        sde2    ONLINE       0     0     0
        sdd2    ONLINE       0     0     0
    logs
      mirror-2  ONLINE       0     0     0
        sda2    ONLINE       0     0     0
        sdb2    ONLINE       0     0     0
    cache
      sda3      ONLINE       0     0     0
      sdb3      ONLINE       0     0     0

errors: No known data errors

zpool list
Code:
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool  1.09T   112G  1000G         -    15%    10%  1.00x  ONLINE  /mnt

zpool properties:
Code:
NAME   PROPERTY                       VALUE                          SOURCE
rpool  size                           1.09T                          -
rpool  capacity                       10%                            -
rpool  altroot                        /mnt                           local
rpool  health                         ONLINE                         -
rpool  guid                           10146041935939188713           -
rpool  version                        -                              default
rpool  bootfs                         rpool/ROOT/pve-1               local
rpool  delegation                     on                             default
rpool  autoreplace                    off                            default
rpool  cachefile                      none                           local
rpool  failmode                       wait                           default
rpool  listsnapshots                  off                            default
rpool  autoexpand                     off                            default
rpool  dedupditto                     0                              default
rpool  dedupratio                     1.00x                          -
rpool  free                           1000G                          -
rpool  allocated                      112G                           -
rpool  readonly                       off                            -
rpool  ashift                         12                             local
rpool  comment                        -                              default
rpool  expandsize                     -                              -
rpool  freeing                        0                              -
rpool  fragmentation                  15%                            -
rpool  leaked                         0                              -
rpool  multihost                      off                            default
rpool  feature@async_destroy          enabled                        local
rpool  feature@empty_bpobj            active                         local
rpool  feature@lz4_compress           active                         local
rpool  feature@multi_vdev_crash_dump  enabled                        local
rpool  feature@spacemap_histogram     active                         local
rpool  feature@enabled_txg            active                         local
rpool  feature@hole_birth             active                         local
rpool  feature@extensible_dataset     active                         local
rpool  feature@embedded_data          active                         local
rpool  feature@bookmarks              enabled                        local
rpool  feature@filesystem_limits      enabled                        local
rpool  feature@large_blocks           enabled                        local
rpool  feature@large_dnode            enabled                        local
rpool  feature@sha512                 enabled                        local
rpool  feature@skein                  enabled                        local
rpool  feature@edonr                  enabled                        local
rpool  feature@userobj_accounting     active                         local

dataset properties:
(See attached file. Too long to use tag)

This node was mainly used as a ZFS replication slave
 

Attachments

  • zfs.txt
    184.8 KB · Views: 41
Ok after exporting the pool and rebooting everything is fine. I'm not sure what might have caused this. Any idea ? I have multiple servers to update and not wanting this to happen again.

Thanks
 
that sounds very strange indeed. maybe some kind of feature upgrade was still in progress and Grub does not like that (e.g., userobj_accounting runs as a kind of background job on the existing datasets when you activate it). did you reboot right after running "zpool upgrade" ? or was this 0.7 install from the beginning?
 
It was a a 0.7 pool from the beginning. What I did is this in order:
1. Import pool with -d
2. Scrub pool (No error found)
3. Export Pool
4. Reboot
 
It was a a 0.7 pool from the beginning. What I did is this in order:
1. Import pool with -d
2. Scrub pool (No error found)
3. Export Pool
4. Reboot

very strange. if it occurs again, can you try the following in the grub rescue shell
Code:
cat (hd0,gpt2)/ROOT/pve-1/@/boot/grub/grub.cfg

if that prints an error, setup grub to use a serial console, and repeat the command after running
Code:
set debug=zfs

and post the resulting dump.
 
I was able to boot into a grub shell on the same server and run the following command which gives some error but the server boot fine so I'm not sure if this is related. You'll find the output attached
 

Attachments

  • grub.txt
    17.8 KB · Views: 91
I'll see if I find some time next week to investigate this - thanks for the dump!
 
In my case, i cannot run scrub because it is resilvering on e of the disk. I tried rebooting but to no avail. What should i do?
 
Hi, i did the same steps like ca_maer, no errors on the scrub, but when i rebooted, i found with another error:
Do you have any ideas?
Thank you
 

Attachments

  • error servidor mar 23 .jpg
    error servidor mar 23 .jpg
    17.3 KB · Views: 68
I am also in the same boat. JBOD disk configuration with a HBA in IT mode.. After a graceful shutdown, we have a grub error "no such device". It doesn't seem any patches were done before-hand and the shutdown was clean. ZFS root was functioning for a long time with no issues. After hitting this, I wish I had been using my old config of a RAID1 + EXT4 for root and the remainder for ZFS. I am not sure how to fix this either. I may just tediously back up all my data (as I can still access the rpool) and reformat and restore, switching things up to the ext4 config mentioned before. As the pool is rather large, this is going to be painful. If someone has a quick fix, it would be great.
 
I'd suggest to leave the zpool as it is and just do a clean install to ext4. From there you can use the zpool as storage, which will regain you access to the VM files. After transferring the PVE configuration files from the rootfs inside the zpool, you might be able to use your stuff as it was before. I'm not sure how clean this is but I did it this way and I still like the thought of having the VM files on a zpool.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!