GRUB error on reboot - device not found

Rob Loan · Dec 20, 2017

I bet
GRUB_CMDLINE_LINUX_DEFAULT="rootdelay=1"
in /etc/default/grub
then run update-grub (or add rootdelay=1 to the linux line of /boot/grub/grub.cfg )

would fix everything. the kernels in pve 5.x tries to load vmlinuz-*-pve faster than 4.x and faster than they might be ready. I hit this with a NVMe device.

Edit: oops, this is for the OP who's boot loader worked, but grub couldn't find vmlinuz-*-pve. For cases with "no OS found" that's a boot loader issue, perhaps GPT and legacy BIOS, most commonly seen with a "I did a full clean install, but the reboot failed to find an OS". in those cases, turn off legacy boot and let the bios do the uEFI thing. (sometimes will fail to find the cdrom image on the USB stick, but this is all for another thread.)

fabian · Dec 21, 2017

OH24 said:
pveversion -v

=> 4.x, cannot check at the moment since I booted the Proxmox Debug Console

zpool layout

=> 2 drives as ZFS mirror

disks as seen by grub ('ls', and 'ls (hdX,gptY)' for all X and Y)

except for (hd0,gpt2) and (hd1,gpt2) "unknow filesystem"

boot with (hd0,gpt2) => checksum verification failed

variables set for grub ('set')

like the ones above from euant

is the pool importable when booting from a live-CD?

yes, backing up the image at the moment

any error messages

except for the grub message none

At the moment I am running a "zpool scrub rpool" which take about 10h. Any ideas what I can do else to fix the boot issue?

if you import the pool, bind mount /dev /proc and /sys into it and then chroot, what do

grub-probe /
grub-probe -vvvv /
update-grub
grub-install /dev/sda
grub-install /dev/sdb

report?

you should also be able to collect 'pveversion -v' that way..

OH24 · Dec 21, 2017

Thanks @fabian for your answer but I did a clean install yesterday. Do you have any hunch what the cause of the non-booting grub might be?

fabian · Dec 21, 2017

activation of some zpool feature that grub does not support, corrupt BIOS boot partition (or overwritten with an old Grub version), BIOS/disk controller lying about disk sizes. but the first and last (usually) lead to different error messages.. a wrong/broken/corrupt Grub stage1 in the BIOS boot partition should be fixed by re-running grub-install.

lankaster · Dec 21, 2017

@fabian After system upgrade proxmox 5.0 -> 5.1 one of 4 nodes can't boot:

Code:

error: no such device xxxxxxx
error: unknown filesystem
Entering rescue mode ...

upgrade grub-probe, grub update via chroot / after booting via usb stick doesn't help. This node ran with hand made usb-boot sick one month.

pveversion -v

Code:

proxmox-ve: 5.1-26 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.4-1-pve: 4.13.4-26
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9

zpool layout: zpool on HW raid1
(hd0)(hd0,gpt9)(hd0,gpt2)(hd0,gpt1)
variables -> cannot show (we're reinstall node and migrate to ceph)
We can import/export/scub a pool via proxmox installer without problem.

HW Info: HP microserver gen 10, AMD Opteron(tm) X3421 APU, RAM: 16 GB ECC, RAID: Fujitsu D2607-A1, HDDs x2 10TB - ST10000NM0016

Rob Loan · Dec 22, 2017

> We can import/export/scub a pool via proxmox installer without problem.

sorry my note above wasn't clear, but choose emergency boot in the PVE install iso, or your USB stick to get the box up. (or zpool import ; exit to continue the stuck boot)

zpool remove rpool sdc1 # simplify pool design, add back when auto importing is working
zpool remove rpool sdc2 # simplify pool design, add back when auto importing is working
zpool scrub rpool # mainly to update /etc/zfs/zpool.cache but is not a bad idea anyway
zpool get bootfs rpool # verify it returns rpool/ROOT/pve-1
update-initramfs -u # to commit /etc/zfs/zpool.cache to /boot/initrd.img*
vi /etc/default/grub # add "rootdelay=1" in GRUB_CMDLINE_LINUX_DEFAULT
update-grub # to commit changes to /boot/grub/grub.cfg
grub-probe / # verify it returns zfs
grub-install /dev/sda
grub-install /dev/sdb
reboot # test it

OH24 · Dec 22, 2017

Maybe it has something to do with the RAID controller of the HP microserver, even if we use the drives as SATA AHCI ones. I never had this problem with other server hardware (e.g. Supermicro with no RAID controller) so far.

fabian · Dec 22, 2017

seems likely since that seems to be a common factor..

lankaster · Dec 22, 2017

@Rob Loan I am sorry my English really poor.
* At first we try to boot a kernel via grub rescue, but it doesnt work -> ls (hd0,gpt1), ls (hd0,gpt2), ls (hd0,gpt9) shows "unknown filesystem"
* Second: We try to repair a grub via PVE install iso -> chroot->mount /dev,/proc, /sys etc.-> update-initramfs, update-grub, grub-install -> it doesn't work, but pool was OK
* After that we get the node via sub stick up and try again to repair a grub loader -> it doesn't work, but node ran and pool was ok

cserem · Jan 8, 2018

I do not mean to hijack this thread, but there was no solution posted yet, and I seem to be suffering from the same problem.

The system I am using is a Dell PowerEdge R530 server.
After rebooting for a kernel upgrade it took around 1-2 minutes to reach the

Code:

grub>

prompt.
There was no error message, it just displayed this prompt instead of the regular menu.

Trying the recovery option of the PVE installer again took 1-2 minutes to execute, but it failed with

Code:

error: attempt to read or write outside of disk 'hd2'
Press any key to continue
(then, without pressing a key, after a couple seconds it re-displayed the boot menu of the installer)

After succesfully booting from an external device (kernel and initrd on an usb stick) I have executed the requested commans: https://pastebin.com/1zadrmQA

After rebooting again I get dropped to the "grub rescue>" (notice before it was plain "grub>") prompt and running "insmod normal" fails again with the outside of disk error.

I am now I am only able to boot into the system with an external grub and external kernel disk.

fabian · Jan 8, 2018

that error message indicates that your motherboard (or scsi controller) lies about the disk size. you will need to boot from an external small boot device until PVE supports ZFS and UEFI boot (or play the lottery on every kernel and grub update and hope that grub files, kernel and initrd are within the wrong limits imposed by your hardware).

osteoboon · Jan 9, 2018

Hi All. Thanks very much to everyone for making and supporting proxmox. I'm still pretty new to proxmox and I'm very impressed with it, but I've encountered three problems in my first two weeks that make me begin to wonder if this is really the right solution for me.

The first problem (which occurred last week) was an apparent inability to change the node's ip address (see previous posts to this forum on that topic). I haven't figured out how to solve that, but it's moot for me now because the node is back in its original network.

The second problem (occurred this past weekend) was that the node could not perform automatic updates (I saw error messages in the web interface indicating this and solved it by changing the /etc/apt/sources.list.d/ file to point to the non-enterprise non-subscription sources).

But fixing the second problem seems to have created the third problem I've encountered (occurred today) which is the exact same problem as the original poster (euant) in this thread had a month ago. Their screenshot looks identical to mine except that (to be expected, I know) my volume ID or disk ID is different than theirs.

My hardware is an HP ProLiant ML10 v2 with 5 physical disks (each 4TB total space as advertised) connected to the HP Smart Array B120i RAID controller set to AHCI mode in the BIOS. The proxmox 5.1 install image worked beautifully from a USB stick and seems to have created a RAID array and logical volumes with ZFS all very much to my satisfaction and with minimal time and effort during installation. The install was truly a quick, simple, and problem-free process. But with the first upgrade, I can now no longer boot from either the hard drive or the install image (rescue mode) on the USB stick.

I managed to successfully reboot this server at least 8 times between original installation and encountering this third problem, but nightly automatic updates (apt-get update) were failing before because the sources were pointing only to the enterprise update archives, and I have not yet subscribed. So as soon as I fixed that problem (apt-get update failing), the node apparently updated itself, and in the process, broke itself so it no longer boots.

I'd like to try booting from lankaster's boot stick that euant used successfully, but I don't see an image to write to my USB stick?

And of course ideally, I'd like to again be able to boot from the hard drives.

Since there seems to be several people experiencing this same problem, is there anything I can do to help find the root cause so that someone can fix it?

Thank you fabian for asking euant for some details. For me, the output of "set" in the grub rescue shell is identical to that of euant:

Code:

> set
cmdpath=(hd0)
prefix=(hd0)/ROOT/pve-1@/boot/grub
root=hd0

And the output of "ls" in the grub rescue shell is also similar, but I have 5 identical physical disks, so:

Code:

> ls
(hdX)  (hdX,gpt9)  (hdX,gpt2)  (hdX,gpt1)...
where X=0,1,2,3,4

Not being able to boot into the node, I don't know how to give you my ZFS pool layout.

Thanks for any suggestions on how to resolve, and I'll be happy to try anything else to help troubleshoot the root cause.

-Kevin

Rob Loan · Jan 10, 2018

GRUB2 gets confused. While the legacy boot (grub-install /dev/sda) seems simple, it isn't when it doesn't work. a good read is http://www.rodsbooks.com/efi-bootloaders/

When I had an issue with a nvme ZFS boot device, I switched to uEFI boot and used http://www.rodsbooks.com/efi-bootloaders/refit.html as a boot manager and then grub's boot loader. I can only hope showing what I did, might help your case.

I see three ways ZFS is partitioned:

Code:

root@pve1:/etc/pve/rob# fdisk -l
Disk /dev/nvme0n1: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 16A405B5-D719-41F0-81CB-2BEE62B7FF91

Device              Start        End    Sectors   Size Type
/dev/nvme0n1p1         34       2047       2014  1007K BIOS boot
/dev/nvme0n1p2       2048 1953508749 1953506702 931.5G Solaris /usr & Apple ZFS
/dev/nvme0n1p9 1953508750 1953525134      16385     8M Solaris reserved 1

root@pve2:~# fdisk -l
Disk /dev/nvme0n1: 477 GiB, 512110190592 bytes, 1000215216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00000000

root@pvebak:~# fdisk -l 
Disk /dev/sda: 55.9 GiB, 60022480896 bytes, 117231408 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 19148AE5-C2D0-4860-9E4B-E4EF88E4510B

Device         Start       End   Sectors  Size Type
/dev/sda1         34      2047      2014 1007K BIOS boot
/dev/sda2       2048 117214989 117212942 55.9G Solaris /usr & Apple ZFS
/dev/sda9  117214990 117231374     16385    8M Solaris reserved 1

Disk /dev/sdb: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 6B235178-354D-B741-8335-C2FF49FFA6AF

Device           Start         End     Sectors  Size Type
/dev/sdb1         2048 11721027583 11721025536  5.5T Solaris /usr & Apple ZFS
/dev/sdb9  11721027584 11721043967       16384    8M Solaris reserved 1

The top case, pve1 was converted from legacy to uEFI while the other two are booting legacy. Note sdb on pvebak skips the space at the begging of the disk so there is still room for `grub-install /dev/sdb` even if there is no partition for it.

make the legacy partition a tiny uEFI partition

Code:

root@pve1# mkdosfs -F 32 -n EFI /dev/nvme0n1p1
root@pve1# mount /boot/efi
root@pve1# cd /boot/efi
install rEFIt

that got a boot manager, but I'm always loading proxmox so I wanted a hard coded boot loader

Code:

root@pve1# cp /boot/grub/x86_64-efi/grub.efi /boot/efi/EFI/proxmox/grubx64.efi
root@pve1# grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck --no-floppy

and it boots straight into proxmox. for reference this is the mess I have in the tiny FAT uEFI partition:

Code:

root@pve1:/boot/grub# find /boot/efi -print
/boot/efi
/boot/efi/EFI
/boot/efi/EFI/proxmox
/boot/efi/EFI/proxmox/grubx64.efi
/boot/efi/EFI/BOOT
/boot/efi/EFI/BOOT/BOOT.CSV
/boot/efi/EFI/BOOT/BOOTx64.EFI
/boot/efi/EFI/BOOT/refind.conf
/boot/efi/EFI/BOOT/icons
/boot/efi/EFI/BOOT/icons/arrow_left.png
/boot/efi/EFI/BOOT/icons/arrow_right.png
/boot/efi/EFI/BOOT/icons/boot_linux.png
/boot/efi/EFI/BOOT/icons/boot_win.png
/boot/efi/EFI/BOOT/icons/func_about.png
/boot/efi/EFI/BOOT/icons/func_csr_rotate.png
/boot/efi/EFI/BOOT/icons/func_exit.png
/boot/efi/EFI/BOOT/icons/func_firmware.png
/boot/efi/EFI/BOOT/icons/func_hidden.png
/boot/efi/EFI/BOOT/icons/func_reset.png
/boot/efi/EFI/BOOT/icons/func_shutdown.png
/boot/efi/EFI/BOOT/icons/mouse.png
/boot/efi/EFI/BOOT/icons/os_debian.png
/boot/efi/EFI/BOOT/icons/os_proxmox.png
/boot/efi/EFI/BOOT/icons/README
/boot/efi/EFI/BOOT/icons/tool_apple_rescue.png
/boot/efi/EFI/BOOT/icons/tool_fwupdate.png
/boot/efi/EFI/BOOT/icons/tool_memtest.png
/boot/efi/EFI/BOOT/icons/tool_mok_tool.png
/boot/efi/EFI/BOOT/icons/tool_netboot.png
/boot/efi/EFI/BOOT/icons/tool_part.png
/boot/efi/EFI/BOOT/icons/tool_rescue.png
/boot/efi/EFI/BOOT/icons/tool_shell.png
/boot/efi/EFI/BOOT/icons/tool_windows_rescue.png
/boot/efi/EFI/BOOT/icons/transparent.png
/boot/efi/EFI/BOOT/icons/vol_external.png
/boot/efi/EFI/BOOT/icons/vol_internal.png
/boot/efi/EFI/BOOT/icons/vol_net.png
/boot/efi/EFI/BOOT/icons/vol_optical.png
/boot/efi/EFI/BOOT/keys
/boot/efi/EFI/BOOT/keys/altlinux.cer
/boot/efi/EFI/BOOT/keys/canonical-uefi-ca.der
/boot/efi/EFI/BOOT/keys/centos.cer
/boot/efi/EFI/BOOT/keys/fedora-ca.cer
/boot/efi/EFI/BOOT/keys/microsoft-kekca-public.der
/boot/efi/EFI/BOOT/keys/microsoft-pca-public.der
/boot/efi/EFI/BOOT/keys/microsoft-uefica-public.der
/boot/efi/EFI/BOOT/keys/openSUSE-UEFI-CA-Certificate-4096.cer
/boot/efi/EFI/BOOT/keys/openSUSE-UEFI-CA-Certificate.cer
/boot/efi/EFI/BOOT/keys/refind.cer
/boot/efi/EFI/BOOT/keys/refind_local.cer
/boot/efi/EFI/BOOT/keys/refind_local.crt
/boot/efi/EFI/BOOT/keys/SLES-UEFI-CA-Certificate.cer

None of this is proxmox or even linux fault, but rather motherboard or disk card BIOS errors.

osteoboon · Jan 10, 2018

lankaster said:
1. Extract proxmoxusbboot.dd.gz with gunzip/7zip etc.
2. Write proxmoxusbboot.dd with Linux dd or Windows win32image to USB
3. try to boot from USB

Thanks for your several contributions to this thread, lankaster.

Where can I find your proxmoxusbboot.dd.gz ?

Once I can boot into my node, I'll be sure to execute your guidance on creating a bootable usb with proxmox kernels with your small script too, but at this point, I can't even boot into it, and it's the only machine I have with proxmox installed on it.

osteoboon · Jan 10, 2018

Wow Rob, thanks so much for the detailed follow-up! I really appreciate it! I agree that Rod Smith's articles on booting are extremely helpful, and I've used rEFIt in the past with other machines.

But after having invested something like 30 hours creating and configuring virtual machines in my node, I'm not yet ready to modify my installed hard drives. All the partitioning on each of my 5 HDD was done by the proxmox installer, and I'm thinking that manually modifying one or more of them would most likely break the node even worse than just having a boot problem as I have now.

I'm going to try booting into my node from a USB stick first so I can make backups of these virtual machines, but I'll definitely try your suggestion after I have backups.

Thanks again!

-Kevin

tomte76 · Apr 25, 2018

Are there any news on that? Having the same problem on a HP DL360G7 with Proxmox 4.x (latest updates) RAIDz1 after a crash. The disks are one RAID0 array each on a P410i controller and handed to ZFS like this (/dev/sda, /dev/sdb etc.). This worked fine since a long time but after the last crash and reboot it does not come up any more. It shows

error: no such device xxxxxxx
error: unknown filesystem
Entering rescue mode ...

Already booted with different rescue disks (PVE, Ubuntu) and the zpool is ok. All disks are fine, all data seems to be in place. I also compared it with working systems for /etc/default/grub and /boot and it looks fine. Also reinstalled grub on all disks multiple times without success. Any other chances to get this working again or do I need to reinstall the whole system?

tomte76 · Apr 25, 2018

interesting. using an older proxmox 4.4 iso and rescue boot does not work. It cannot find rpool. If i boot ubuntu 17.10 and do a "zpool import rpool" everything works fine.

tomte76 · Apr 25, 2018

tried downgrading grub without success.
tried recreating /etc/zfs/zpool.cache and update initramfs without success.
scrubed rpool without any error
reinstalled grub on all devices afterwards
tried to boot from all disks by setting the bootdisks
error is still the same

Any other ideas except reinstalling?

fabian · Apr 26, 2018

tomte76 said:
tried downgrading grub without success.

tried recreating /etc/zfs/zpool.cache and update initramfs without success.

scrubed rpool without any error

reinstalled grub on all devices afterwards

tried to boot from all disks by setting the bootdisks

error is still the same

Any other ideas except reinstalling?

use a separate /boot partition that is not on ZFS.

tomte76 · Apr 26, 2018

Thank you. Unfortunately I have no spare disk slots or space on existing disks to set up a separate boot partition. I'll backup the data now with zfs send. Does the recommendation to have a non-zfs /boot partition mean, that it is not recommended to boot from ZFS any more? The I woul take to disks out of the zpool to set up as non-zfs system disks.

GRUB error on reboot - device not found

Renowned Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Renowned Member

New Member

Proxmox Staff Member

New Member

Renowned Member

Proxmox Staff Member

New Member

Renowned Member

New Member

New Member

Member

Member

Member

Proxmox Staff Member

Member

We value your privacy