GRUB error on reboot - device not found

fabian · Apr 26, 2018

tomte76 said:
Thank you. Unfortunately I have no spare disk slots or space on existing disks to set up a separate boot partition. I'll backup the data now with zfs send. Does the recommendation to have a non-zfs /boot partition mean, that it is not recommended to boot from ZFS any more? The I woul take to disks out of the zpool to set up as non-zfs system disks.

there are broken systems were booting is not possible or not stable - most notable involving hardware raid controllers.

tomte76 · Apr 26, 2018

ok. I'm wondering why it was possible to boot the machine at least 7 times after updates (according to our system documentation) since it was installed but now it won't boot up again and seems to remain "unfixable". When I do a "set debug=all" in grub and then try to do e.g. an "ls (hd0)" I can see that grub is searching for zfs. It probes for labels 4 times ( vdev_disk_read_rootlabel) and then fails without further debug output. I am afraid to boot the other servers to compare the behavior in case the also won't come up again.

fabian · Apr 26, 2018

tomte76 said:
ok. I'm wondering why it was possible to boot the machine at least 7 times after updates (according to our system documentation) since it was installed but now it won't boot up again and seems to remain "unfixable". When I do a "set debug=all" in grub and then try to do e.g. an "ls (hd0)" I can see that grub is searching for zfs. It probes for labels 4 times ( vdev_disk_read_rootlabel) and then fails without further debug output. I am afraid to boot the other servers to compare the behavior in case the also won't come up again.

I unfortunately haven't had the chance to debug that specific issue yet, as we don't have any affected hardware in our lab.

tomte76 · Apr 26, 2018

I meanwhile nulled all disks on the server, installed latest bios, raidcontroller firmware, disk firmware. Then I did a clean install with Proxmox 5.1 latest ISO. Install works fine, but the system is not able to boot. Same error as before. Are you interested in debugging this issue? I could provide any information you want. Or even discuss, if it is possible to give iLO access to you so you can take a look?

tom · Apr 26, 2018

tomte76 said:
I meanwhile nulled all disks on the server, installed latest bios, raidcontroller firmware, disk firmware. Then I did a clean install with Proxmox 5.1 latest ISO. Install works fine, but the system is not able to boot. Same error as before. Are you interested in debugging this issue? I could provide any information you want. Or even discuss, if it is possible to give iLO access to you so you can take a look?

You are using a HP 410 raid controller? I highly recommend to not use a raid controller with ZFS, just go with a simple HBA or if available, connect your disk directly on the motherboard SATA connectors.

ZFS with raid controllers is not a supported setup.

tomte76 · Apr 26, 2018

HP P410i. The disks are exported as a RAID0 each, because P410i is not able to do JBOD. I now understand, that it is not supported. But I am still wondering why it worked before for a long time and stopped working now. Need to have a look if it is possible to connect the backplane to another HBA or the internal connectors. The drives are 2.5 SAS 15K. 8 drives in each server. Maybe I just install a USB Key or SD-Card to boot from.

I meanwhile tried installing from different ISO images down to Proxmox 4.0 and I am not able to boot. This is very strange as I can say for sure that I installed these boxes with Proxmox 4.1 ISO years ago and it worked fine.

osteoboon · Apr 26, 2018

@tomte76 I think I can help you with this. Send email to accounts@osteoboon.com referencing this thread and I'll try to help via email.

fabian · Apr 27, 2018

tomte76 said:
HP P410i. The disks are exported as a RAID0 each, because P410i is not able to do JBOD. I now understand, that it is not supported. But I am still wondering why it worked before for a long time and stopped working now. Need to have a look if it is possible to connect the backplane to another HBA or the internal connectors. The drives are 2.5 SAS 15K. 8 drives in each server. Maybe I just install a USB Key or SD-Card to boot from.

I meanwhile tried installing from different ISO images down to Proxmox 4.0 and I am not able to boot. This is very strange as I can say for sure that I installed these boxes with Proxmox 4.1 ISO years ago and it worked fine.

with such issues, it is often up to chance whether grub finds all needed data or not. ZFS re-writes stuff a lot (it's a CoW filesystem after all), if you are lucky all the parts are there and boot is no problem, if you are unlucky something is missing and you trigger various stages of failure.

tomte76 · Apr 27, 2018

Yes. I understand. Thank you. So the interest in debugging the issue is low. I did a last proxmox 5.1 installation after wipeing all disk in a separate system without RAID controller to ensure all blocks are wiped. Still the same behavior. I'll attach two screenshots from grub with debug all. Maybe this helps.

I'll try to find out why grub is failing to detect ZFS as it seems to probe the labels ok and then fails more or less silently. But I understand that we need to find another solution for production. Means new HBA, booting from SD-Card or buy new servers without RAID controllers to have reliable operation.

tomte76 · Apr 27, 2018

Additional research substantiated the assumption that the problem is related to the disk geometry. A closer view to the raid parameters with HP SmartStart showed, that the created RAID0 arrays and logical drives have a Sector/Track setting of 32. I checked some plain SATA disks and found a setting of 63 Sector/Track on all samples. So I recreated the array/ld with Sector/Track of 63. Second point is the stripe size of 256KB on the RAID0. As it is not possible to disable striping at all on a single disk RAID0 I set this to 128KB which is a recommended size I found for normal SATA disk layouts. I also globally disabled "Array Accelerator" which is the P410i cache magic. And I disabled the internal drive cache of the disks. As far as I know ZFS will also do this on a bare drive if possible. "Magically" the size of the disk exposed to the OS changed for about 30MB and it is significant closer to the size I see if I use a dump SAS HBA. I assume the missing bytes are use by the RAID controller to store the array data on in.

Afterwards I installed Proxmox 5.1 and the system is able to boot up. I rebooted 3 times and then installed the latest updates. I was still able to boot after installation for another 10 times. I wrote some 100GBs to the pool and created and deleted some VMs. Did some more reboots and everything works fine at the moment. Speed is not much less then before with "Array Accelerator" enabled. Now I'll ZFS receive the backupped data and restore the VMs.

It is still unclear why the system stopped booting up and also, why the reinstallation with the PVE 4.1 ISO used before did not result in a running system. I meanwhile found out, that since the last server reboot 2 disks have been replaced using hpssacli and ZFS replace without booting the server. So maybe one of these disks got a different configuration regarding Sector/Track, Stripe or else.

fabian · Apr 30, 2018

tomte76 said:
Additional research substantiated the assumption that the problem is related to the disk geometry. A closer view to the raid parameters with HP SmartStart showed, that the created RAID0 arrays and logical drives have a Sector/Track setting of 32. I checked some plain SATA disks and found a setting of 63 Sector/Track on all samples. So I recreated the array/ld with Sector/Track of 63. Second point is the stripe size of 256KB on the RAID0. As it is not possible to disable striping at all on a single disk RAID0 I set this to 128KB which is a recommended size I found for normal SATA disk layouts. I also globally disabled "Array Accelerator" which is the P410i cache magic. And I disabled the internal drive cache of the disks. As far as I know ZFS will also do this on a bare drive if possible. "Magically" the size of the disk exposed to the OS changed for about 30MB and it is significant closer to the size I see if I use a dump SAS HBA. I assume the missing bytes are use by the RAID controller to store the array data on in.

Afterwards I installed Proxmox 5.1 and the system is able to boot up. I rebooted 3 times and then installed the latest updates. I was still able to boot after installation for another 10 times. I wrote some 100GBs to the pool and created and deleted some VMs. Did some more reboots and everything works fine at the moment. Speed is not much less then before with "Array Accelerator" enabled. Now I'll ZFS receive the backupped data and restore the VMs.

I hope it runs more stable now - but please be aware that there is still no guarantee at all that this won't break in X days/weeks/months..

It is still unclear why the system stopped booting up and also, why the reinstallation with the PVE 4.1 ISO used before did not result in a running system. I meanwhile found out, that since the last server reboot 2 disks have been replaced using hpssacli and ZFS replace without booting the server. So maybe one of these disks got a different configuration regarding Sector/Track, Stripe or else.

likely the raid controller overwrote some part where ZFS had something important, or ZFS wrote to some part that the controller does not (correctly) expose when in boot mode.

bitfactory · Jan 13, 2021

We experienced the same error like in the initial post today.

The server was cleanly shut down via the proxmox web gui and before all vms and containers were off. Trying to boot up again we were thrown into the grub rescue shell "device not found". The proxmox rescue boot also did not succeed "no such device: rpool"

The solution in our case was simple:

We used the zfs forked systemrescuecd (https://github.com/nchevsky/systemrescue-zfs/releases/tag/v7.01+2.0.0) for the following commands:

Code:

mkdir /mnt/rescue
zpool import -f -R /mnt/rescue rpool
for dir in sys proc dev; do mount –bind /$dir /mnt/rescue/$dir; done
chroot /mnt/rescue
zpool status
exit
for dir in sys proc dev; do umount /mnt/ubuntu/$dir; done
zpool export rpool

Based on: https://rageek.wordpress.com/2015/07/06/zfs-on-linux-emergency-boot-cd/

Nothing else was done to rescue the system. Our best guess atm is that the zpool export was the key to the solution as it might have fixed some inconsistencies that prevented proxmox from booting.

Exporting a pool, writes all the unwritten data to pool and remove all the information of the pool from the source system.

Source: https://www.thegeekdiary.com/zfs-tutorials-creating-zfs-pools-and-file-systems/

mosesjohann · Feb 4, 2021

I've the same problem with a HP Proliant Microserver G8 with SATA Mode enabled (AHCI on or off) in Bios. Before I configured the 4 3TB WD Red drives as seperate Raid0 devices on the integraded Fake-Raid-controller but shouldn'tmatter when the Bios is set to Sata and I don't see anything about a Raid controller, just the sata controler.
iI tried fhe Steps from bitfactory but doesn't help for me. As its just a NAS-Experiment its not critical but want to solve it for feature experience. I tried different RAMs and different BIOS options.
One time the Server booted as it should but was not repuduceable - didn't restart again...

kenzim · Feb 5, 2021

I have this same problem, HP sl4540 gen8, dual e5 2470v2 192gb. two hdds on the integrated raid controller for os, other raid controller with ceph disks. after reboot wont boot again, tried bitfactory steps, turns out one drive in the zpool is failed except its still there, try replacing the disk with itself since it appears to still be working, got error 'cant create partitions'. followed this guys instructions who is also using hp raid card https://github.com/openzfs/zfs/issues/3478 but no dice. even though it should function with 1 drive since its raid 1, tried exporting like bitfactory says but still sent straight to grub.

mosesjohann · Feb 9, 2021

bitfactory said:
....

Code:

mkdir /mnt/rescue zpool import -f -R /mnt/rescue rpool for dir in sys proc dev; do mount –bind /$dir /mnt/rescue/$dir; done chroot /mnt/rescue zpool status exit for dir in sys proc dev; do umount /mnt/ubuntu/$dir; done zpool export rpool

Based on: https://rageek.wordpress.com/2015/07/06/zfs-on-linux-emergency-boot-cd/

....

May there are others searching - for me I had to edit it a little bit:

Bash:

cd /
mkdir /mnt/rescue
zpool import -f -R /mnt/rescue rpool
for dir in sys proc dev; do mount --bind /$dir /mnt/rescue/$dir; done
chroot /mnt/rescue
zpool status
exit
for dir in sys proc dev; do umount /mnt/rescue/$dir; done
zpool export rpool

mosesjohann · Feb 9, 2021

So I tried another few things including changing RAM and Updating BIOS (was not successful as have no active Support Subscription). For others may be interested my steps i tried additionaly to the zpool mount:

Bash:

cd /
mkdir /mnt/rescue
zpool import -f -R /mnt/rescue rpool
for dir in sys proc dev; do mount --bind /$dir /mnt/rescue/$dir; done
chroot /mnt/rescue
zpool status
source /etc/profile
grub-install /dev/sda
grub-install /dev/sdb
update-grub2
update-initramfs -u
exit
for dir in sys proc dev; do umount /mnt/ubuntu/$dir; done
zpool export rpool

With this info https://op-co.de/blog/posts/microserver_gen8_fix_boot_order/ i thought that the Raid0-Config over all disks may be the problem so I removed the Raid0-config, switched back tho SATA-AHCI and reinstalled the system.
If it fails again i'd try to install grub on a USB-Stick (with the boot-info from here https://blog.linuxserver.io/2015/03...ng-the-hp-proliant-microserver-gen8-g1610t-3/).

For now I made updates (from community repository) and restarted serveral times, no problem so far.

kenzim · Feb 10, 2021

i managed to fix this changing sata config in bios to legacy and back to sata-ahci. not sure why or how this works, because it was on ahci before, but it did so I'm not going to complain. exporting the pool may have played a part but I'm in no rush to try and replicate the issue again.

mosesjohann · Feb 11, 2021

I also had the AHCI-Mode enabled, disabled it and around that try it was when it booted once but then it occured again...

mosesjohann · Feb 12, 2021

Today I had another error at boot:

Code:

error: checksum verification failed.
Entering rescue mode...

I tried to import the pool via Proxmox Debug-Mode but it said:

Bash:

cannot import 'rpool': pool was previously in use from antoher system.
Last accesd by (none) (hostid=3e48ec5c) at Fri Feb 12 07_13:37 2021
The pool can be impored, use 'zpool import -f' to import the pool

So I did this:

Code:

zpool import -f -N -R /mnt rpool
zpool export rpool

and rebooted but nothing changed. What I did: I installed more RAM on the system but after install it booted fine and normal, there I changed the ip address in the config and rebooted by typing reboot and everything seemed fine. Didn't change the AHCI-Mode this time or anything else on the system beside RAM and the network-config...

The Steps from Manfred Heubach helped me and solved the problem: https://forum.proxmox.com/threads/grub-rescue-error-checksum-verification-failed.52730/post-248073 But don't know why this happens.

Maybe thats really some write-erros in combination with this integrated HP Sata-Controller? Maybe here's sombebody who can tell the difference between "checksum verification failed" and "device not found" and if this two could point on one and the same problem?

Kaijia · Feb 19, 2021

I managed to temporary fix it after some detours (mostly due to my very slow IPMI). My issue turns out to be related to org.zfsonlinux:large_dnode = active.

This was an old HP BL460C G6 with a p410i (not an ideal setting to run ZFS). I setup two HW RAID0 on the raid card (because p410i does not allow SATA-ACHI/JBOD mode) and formed a standard ZFS RAID1 PVE install via ISO in Nov. 2020. For a while it was fine, then until last week when I reboot the server, it goes to grub rescue with the same behavior in this post.

Initially, I thought this is an old-HP-specific issue and tried the USB boot approach, but didn't works (although I'd suspect my slow IPMI is to blame). So I then googled similar grub failure in ZFS and found some post citing the root cause to dnodesize in zpool is not set to legacy, which will cause grub to failed to read the ZFS partition when large_dnode is active, resulted in unknown filesystem:

https://lucatnt.com/2020/05/grub-unknown-filesystem-on-zfs-based-proxmox/
https://forum.proxmox.com/threads/grub-probe-error-unknown-filesystem.52436/
https://www.reddit.com/r/zfs/comments/g9mtll/linux_zfs_root_issue_grub2_hates_dnodesizeauto/

Then I check the zpool on my broken server:

# zpool get feature@large_dnode
NAME PROPERTY VALUE SOURCE
rpool feature@large_dnode active local

And on another working install:

root@good-one:~# zpool get feature@large_dnode
NAME PROPERTY VALUE SOURCE
rpool feature@large_dnode enabled local

From ZFS doc, it stated:

This feature becomes active once a dataset contains an object with a dnode larger than 512B, which occurs as a result of setting the dnodesize dataset property to a value other than legacy. The feature will return to being enabled once all filesystems that have ever contained a dnode larger than 512B are destroyed.

So it seems to me, PVE installer set dnodesize=auto not legacy. By default, large_dnode=enabled, at one point between my last reboot and this, a dataset contains an object with a dnode larger than 512B, then large_dnode=active, then grub failed at this reboot. However, I cannot find this active dataset with zfs get -r dnodesize rpool, all my datasets are legacy except snapshots so I cannot simply just remove a dataset to fix the issue (still searching on this one).

So following this, I patched the grub to ignore reading large_dnode. Here's my step if anyone want to build a patched grub again:

Bash:

apt install git build-essential quilt debhelper patchutils flex bison po-debconf help2man texinfo gcc-8-multilib xfonts-unifont libfreetype6-dev libdevmapper-dev libsdl1.2-dev xorriso parted libfuse-dev ttf-dejavu-core liblzma-dev mtools pkg-config libefiboot-dev libefivar-dev
git clone git://git.proxmox.com/git/zfs-grub.git
cd zfs-grub
wget https://savannah.gnu.org/bugs/download.php?file_id=45313 -O pvepatches/ignore-large-dnode.patch
echo "ignore-large-dnode.patch" >> pvepatches/series
DEB_BUILD_OPTIONS=nocheck make deb

Then on the server with issue, use apt list --installed | grep grub to see the list of package u need to upload and replace, then run dpkg -i * and update-grub. After that, reboot should work.

Note 1: make sure your rpool/ROOT/pve-1 has dnodesize=legacy in zfs get -r dnodesize rpool before applying the patch.
Note 2: This is a temporary fix. For more permanent fix, either find way to change large_dnode back to enabled. Or switch from grub to systemd-boot.

GRUB error on reboot - device not found

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

New Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

New Member

Renowned Member

Member

Renowned Member

Renowned Member

Member

Renowned Member

Renowned Member

Active Member

We value your privacy