[SOLVED] Grub boot ZFS problem

Vladimir Bulgaru

Well-Known Member
Jun 1, 2019
216
62
48
37
Moscow, Russia
Hey guys!

Inquiring about the Grub boot ZFS problem issue: https://pve.proxmox.com/wiki/ZFS:_Tips_and_Tricks#Grub_boot_ZFS_problem

I have a couple of Dell R620 and some boot normally and some don't. The setup is identical. Has anybody managed to get to the root of this issue?

Details:
The server initialises all the devices and firmware and reaches the point when it normally boots into the OS. In my case the server is stuck for 15 minutes or so and most interestingly it boots normally (no errors of zfs pool issues). The filesystem is ZFS in RAID10 on 4 SAS drives.

What i tested:
  1. Changed the RAID controller (my 4 drives are each in RAID0)
  2. Changed the hard drives
  3. Tested the Proxmox OS on an ext4 fs (no issues and boots normally)
  4. Updated the firmware (iDRAC)
  5. Did an iDRAC reset
  6. Restored the iDRAC to factory settings
  7. Included rootdelay into grub
The problem description in wiki states
  • Symptoms: stuck at boot with an blinking prompt.
  • Reason: If you ZFS raid it could happen that your mainboard does not initial all your disks correctly and Grub will wait for all RAID disk members - and fails. It can happen with more than 2 disks in ZFS RAID configuration - we saw this on some boards with ZFS RAID-0/RAID-10
I was wondering if there are ways to mitigate the issue, other than setting the Proxmox OS on a separate drive with ext4 fs.
 
Hi,
I've some R620 too but only two disks as zfs below an perc (raid-0 construct).
All don't have any boot trouble... but an similiar effect i see at an R410 with enabled sata (bios). After disable sata in bios the boot went well without 10 minutes dead time.

Udo
 
Thank you @udo

To be honest, i'm not sure where to start digging. I have 4 Dell R620s. All have identical setup. All have identical hardware (give-or-take). The drives are from the same batch. The RAID controllers are of the same type. BIOS settings are identical and so are the iDRAC settings. The weird story is - i had issue with long boot on machine #3. After updating Lifecycle Controller to the latest version, machine #3 started to boot quickly, while machine #1 STARTED booting slowly. Moreover, this is happening with ZFS file system only. Every machine works fine on ext4 (how so?)

I will try to get to the physical location in a couple of weeks and play around with resets and hardcore power-offs. Hope it's a weird hardware glitch, but doubt it. Do you have DELLs from 12th gen with Proxmox on ZFS?
 
Ok, leaving this for posterity. No, the issue was not related to an error with BIOS, hardware, firmware, etc.
Turns out - the external HDD connected via USB to the server was causing the issue.
It still leaves a couple of questions unresolved:
  1. why was it booting normally in the past?
  2. why even now if i reinstall Proxmox on ext4 fs it boots normally?
My best guess - a combination of factors. I assume:
  1. the fs of the external drive got corrupted (degradation, write errors) (this explains why ZFS in the past worked well with the external HDD)
  2. since ZFS has to assess all the drives, it may get stuck with the external drive (this is the reason why on ext4 Proxmox boots normally)
In any case, removing the external drive normalised the boot process.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!