GRUB error on reboot - device not found

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
5,025
821
163
Thank you. Unfortunately I have no spare disk slots or space on existing disks to set up a separate boot partition. I'll backup the data now with zfs send. Does the recommendation to have a non-zfs /boot partition mean, that it is not recommended to boot from ZFS any more? The I woul take to disks out of the zpool to set up as non-zfs system disks.

there are broken systems were booting is not possible or not stable - most notable involving hardware raid controllers.
 

tomte76

New Member
Mar 6, 2015
20
0
1
ok. I'm wondering why it was possible to boot the machine at least 7 times after updates (according to our system documentation) since it was installed but now it won't boot up again and seems to remain "unfixable". When I do a "set debug=all" in grub and then try to do e.g. an "ls (hd0)" I can see that grub is searching for zfs. It probes for labels 4 times ( vdev_disk_read_rootlabel) and then fails without further debug output. I am afraid to boot the other servers to compare the behavior in case the also won't come up again.
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
5,025
821
163
ok. I'm wondering why it was possible to boot the machine at least 7 times after updates (according to our system documentation) since it was installed but now it won't boot up again and seems to remain "unfixable". When I do a "set debug=all" in grub and then try to do e.g. an "ls (hd0)" I can see that grub is searching for zfs. It probes for labels 4 times ( vdev_disk_read_rootlabel) and then fails without further debug output. I am afraid to boot the other servers to compare the behavior in case the also won't come up again.

I unfortunately haven't had the chance to debug that specific issue yet, as we don't have any affected hardware in our lab.
 

tomte76

New Member
Mar 6, 2015
20
0
1
I meanwhile nulled all disks on the server, installed latest bios, raidcontroller firmware, disk firmware. Then I did a clean install with Proxmox 5.1 latest ISO. Install works fine, but the system is not able to boot. Same error as before. Are you interested in debugging this issue? I could provide any information you want. Or even discuss, if it is possible to give iLO access to you so you can take a look?
 

tom

Proxmox Staff Member
Staff member
Aug 29, 2006
14,757
678
133
I meanwhile nulled all disks on the server, installed latest bios, raidcontroller firmware, disk firmware. Then I did a clean install with Proxmox 5.1 latest ISO. Install works fine, but the system is not able to boot. Same error as before. Are you interested in debugging this issue? I could provide any information you want. Or even discuss, if it is possible to give iLO access to you so you can take a look?

You are using a HP 410 raid controller? I highly recommend to not use a raid controller with ZFS, just go with a simple HBA or if available, connect your disk directly on the motherboard SATA connectors.

ZFS with raid controllers is not a supported setup.
 

tomte76

New Member
Mar 6, 2015
20
0
1
HP P410i. The disks are exported as a RAID0 each, because P410i is not able to do JBOD. I now understand, that it is not supported. But I am still wondering why it worked before for a long time and stopped working now. Need to have a look if it is possible to connect the backplane to another HBA or the internal connectors. The drives are 2.5 SAS 15K. 8 drives in each server. Maybe I just install a USB Key or SD-Card to boot from.

I meanwhile tried installing from different ISO images down to Proxmox 4.0 and I am not able to boot. This is very strange as I can say for sure that I installed these boxes with Proxmox 4.1 ISO years ago and it worked fine.
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
5,025
821
163
HP P410i. The disks are exported as a RAID0 each, because P410i is not able to do JBOD. I now understand, that it is not supported. But I am still wondering why it worked before for a long time and stopped working now. Need to have a look if it is possible to connect the backplane to another HBA or the internal connectors. The drives are 2.5 SAS 15K. 8 drives in each server. Maybe I just install a USB Key or SD-Card to boot from.

I meanwhile tried installing from different ISO images down to Proxmox 4.0 and I am not able to boot. This is very strange as I can say for sure that I installed these boxes with Proxmox 4.1 ISO years ago and it worked fine.

with such issues, it is often up to chance whether grub finds all needed data or not. ZFS re-writes stuff a lot (it's a CoW filesystem after all), if you are lucky all the parts are there and boot is no problem, if you are unlucky something is missing and you trigger various stages of failure.
 

tomte76

New Member
Mar 6, 2015
20
0
1
Yes. I understand. Thank you. So the interest in debugging the issue is low. I did a last proxmox 5.1 installation after wipeing all disk in a separate system without RAID controller to ensure all blocks are wiped. Still the same behavior. I'll attach two screenshots from grub with debug all. Maybe this helps.

Screenshot from 2018-04-27 11-41-00.png Screenshot from 2018-04-27 11-46-31.png

I'll try to find out why grub is failing to detect ZFS as it seems to probe the labels ok and then fails more or less silently. But I understand that we need to find another solution for production. Means new HBA, booting from SD-Card or buy new servers without RAID controllers to have reliable operation.
 

tomte76

New Member
Mar 6, 2015
20
0
1
Additional research substantiated the assumption that the problem is related to the disk geometry. A closer view to the raid parameters with HP SmartStart showed, that the created RAID0 arrays and logical drives have a Sector/Track setting of 32. I checked some plain SATA disks and found a setting of 63 Sector/Track on all samples. So I recreated the array/ld with Sector/Track of 63. Second point is the stripe size of 256KB on the RAID0. As it is not possible to disable striping at all on a single disk RAID0 I set this to 128KB which is a recommended size I found for normal SATA disk layouts. I also globally disabled "Array Accelerator" which is the P410i cache magic. And I disabled the internal drive cache of the disks. As far as I know ZFS will also do this on a bare drive if possible. "Magically" the size of the disk exposed to the OS changed for about 30MB and it is significant closer to the size I see if I use a dump SAS HBA. I assume the missing bytes are use by the RAID controller to store the array data on in.

Afterwards I installed Proxmox 5.1 and the system is able to boot up. I rebooted 3 times and then installed the latest updates. I was still able to boot after installation for another 10 times. I wrote some 100GBs to the pool and created and deleted some VMs. Did some more reboots and everything works fine at the moment. Speed is not much less then before with "Array Accelerator" enabled. Now I'll ZFS receive the backupped data and restore the VMs.

It is still unclear why the system stopped booting up and also, why the reinstallation with the PVE 4.1 ISO used before did not result in a running system. I meanwhile found out, that since the last server reboot 2 disks have been replaced using hpssacli and ZFS replace without booting the server. So maybe one of these disks got a different configuration regarding Sector/Track, Stripe or else.
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
5,025
821
163
Additional research substantiated the assumption that the problem is related to the disk geometry. A closer view to the raid parameters with HP SmartStart showed, that the created RAID0 arrays and logical drives have a Sector/Track setting of 32. I checked some plain SATA disks and found a setting of 63 Sector/Track on all samples. So I recreated the array/ld with Sector/Track of 63. Second point is the stripe size of 256KB on the RAID0. As it is not possible to disable striping at all on a single disk RAID0 I set this to 128KB which is a recommended size I found for normal SATA disk layouts. I also globally disabled "Array Accelerator" which is the P410i cache magic. And I disabled the internal drive cache of the disks. As far as I know ZFS will also do this on a bare drive if possible. "Magically" the size of the disk exposed to the OS changed for about 30MB and it is significant closer to the size I see if I use a dump SAS HBA. I assume the missing bytes are use by the RAID controller to store the array data on in.

Afterwards I installed Proxmox 5.1 and the system is able to boot up. I rebooted 3 times and then installed the latest updates. I was still able to boot after installation for another 10 times. I wrote some 100GBs to the pool and created and deleted some VMs. Did some more reboots and everything works fine at the moment. Speed is not much less then before with "Array Accelerator" enabled. Now I'll ZFS receive the backupped data and restore the VMs.

I hope it runs more stable now - but please be aware that there is still no guarantee at all that this won't break in X days/weeks/months..

It is still unclear why the system stopped booting up and also, why the reinstallation with the PVE 4.1 ISO used before did not result in a running system. I meanwhile found out, that since the last server reboot 2 disks have been replaced using hpssacli and ZFS replace without booting the server. So maybe one of these disks got a different configuration regarding Sector/Track, Stripe or else.

likely the raid controller overwrote some part where ZFS had something important, or ZFS wrote to some part that the controller does not (correctly) expose when in boot mode.
 
Jan 13, 2021
1
0
1
39
We experienced the same error like in the initial post today.

The server was cleanly shut down via the proxmox web gui and before all vms and containers were off. Trying to boot up again we were thrown into the grub rescue shell "device not found". The proxmox rescue boot also did not succeed "no such device: rpool"

The solution in our case was simple:

We used the zfs forked systemrescuecd (https://github.com/nchevsky/systemrescue-zfs/releases/tag/v7.01+2.0.0) for the following commands:

Code:
mkdir /mnt/rescue
zpool import -f -R /mnt/rescue rpool
for dir in sys proc dev; do mount –bind /$dir /mnt/rescue/$dir; done
chroot /mnt/rescue
zpool status
exit
for dir in sys proc dev; do umount /mnt/ubuntu/$dir; done
zpool export rpool
Based on: https://rageek.wordpress.com/2015/07/06/zfs-on-linux-emergency-boot-cd/

Nothing else was done to rescue the system. Our best guess atm is that the zpool export was the key to the solution as it might have fixed some inconsistencies that prevented proxmox from booting.

Exporting a pool, writes all the unwritten data to pool and remove all the information of the pool from the source system.
Source: https://www.thegeekdiary.com/zfs-tutorials-creating-zfs-pools-and-file-systems/
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!