[SOLVED] Grub boot ZFS problem

Jun 1, 2019
135
23
18
33
Moscow, Russia
Hey guys!

Inquiring about the Grub boot ZFS problem issue: https://pve.proxmox.com/wiki/ZFS:_Tips_and_Tricks#Grub_boot_ZFS_problem

I have a couple of Dell R620 and some boot normally and some don't. The setup is identical. Has anybody managed to get to the root of this issue?

Details:
The server initialises all the devices and firmware and reaches the point when it normally boots into the OS. In my case the server is stuck for 15 minutes or so and most interestingly it boots normally (no errors of zfs pool issues). The filesystem is ZFS in RAID10 on 4 SAS drives.

What i tested:
  1. Changed the RAID controller (my 4 drives are each in RAID0)
  2. Changed the hard drives
  3. Tested the Proxmox OS on an ext4 fs (no issues and boots normally)
  4. Updated the firmware (iDRAC)
  5. Did an iDRAC reset
  6. Restored the iDRAC to factory settings
  7. Included rootdelay into grub
The problem description in wiki states
  • Symptoms: stuck at boot with an blinking prompt.
  • Reason: If you ZFS raid it could happen that your mainboard does not initial all your disks correctly and Grub will wait for all RAID disk members - and fails. It can happen with more than 2 disks in ZFS RAID configuration - we saw this on some boards with ZFS RAID-0/RAID-10
I was wondering if there are ways to mitigate the issue, other than setting the Proxmox OS on a separate drive with ext4 fs.
 

udo

Famous Member
Apr 22, 2009
5,869
162
83
Ahrensburg; Germany
Hi,
I've some R620 too but only two disks as zfs below an perc (raid-0 construct).
All don't have any boot trouble... but an similiar effect i see at an R410 with enabled sata (bios). After disable sata in bios the boot went well without 10 minutes dead time.

Udo
 
Jun 1, 2019
135
23
18
33
Moscow, Russia
Thank you @udo

To be honest, i'm not sure where to start digging. I have 4 Dell R620s. All have identical setup. All have identical hardware (give-or-take). The drives are from the same batch. The RAID controllers are of the same type. BIOS settings are identical and so are the iDRAC settings. The weird story is - i had issue with long boot on machine #3. After updating Lifecycle Controller to the latest version, machine #3 started to boot quickly, while machine #1 STARTED booting slowly. Moreover, this is happening with ZFS file system only. Every machine works fine on ext4 (how so?)

I will try to get to the physical location in a couple of weeks and play around with resets and hardcore power-offs. Hope it's a weird hardware glitch, but doubt it. Do you have DELLs from 12th gen with Proxmox on ZFS?
 
Jun 1, 2019
135
23
18
33
Moscow, Russia
Ok, leaving this for posterity. No, the issue was not related to an error with BIOS, hardware, firmware, etc.
Turns out - the external HDD connected via USB to the server was causing the issue.
It still leaves a couple of questions unresolved:
  1. why was it booting normally in the past?
  2. why even now if i reinstall Proxmox on ext4 fs it boots normally?
My best guess - a combination of factors. I assume:
  1. the fs of the external drive got corrupted (degradation, write errors) (this explains why ZFS in the past worked well with the external HDD)
  2. since ZFS has to assess all the drives, it may get stuck with the external drive (this is the reason why on ext4 Proxmox boots normally)
In any case, removing the external drive normalised the boot process.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!