[SOLVED] Grub boot ZFS problem

Discussion in 'Proxmox VE: Installation and configuration' started by Vladimir Bulgaru, Jun 16, 2019.

  1. Vladimir Bulgaru

    Joined:
    Jun 1, 2019
    Messages:
    109
    Likes Received:
    15
    Hey guys!

    Inquiring about the Grub boot ZFS problem issue: https://pve.proxmox.com/wiki/ZFS:_Tips_and_Tricks#Grub_boot_ZFS_problem

    I have a couple of Dell R620 and some boot normally and some don't. The setup is identical. Has anybody managed to get to the root of this issue?

    Details:
    The server initialises all the devices and firmware and reaches the point when it normally boots into the OS. In my case the server is stuck for 15 minutes or so and most interestingly it boots normally (no errors of zfs pool issues). The filesystem is ZFS in RAID10 on 4 SAS drives.

    What i tested:
    1. Changed the RAID controller (my 4 drives are each in RAID0)
    2. Changed the hard drives
    3. Tested the Proxmox OS on an ext4 fs (no issues and boots normally)
    4. Updated the firmware (iDRAC)
    5. Did an iDRAC reset
    6. Restored the iDRAC to factory settings
    7. Included rootdelay into grub
    The problem description in wiki states
    I was wondering if there are ways to mitigate the issue, other than setting the Proxmox OS on a separate drive with ext4 fs.
     
  2. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
    Hi,
    I've some R620 too but only two disks as zfs below an perc (raid-0 construct).
    All don't have any boot trouble... but an similiar effect i see at an R410 with enabled sata (bios). After disable sata in bios the boot went well without 10 minutes dead time.

    Udo
     
  3. Vladimir Bulgaru

    Joined:
    Jun 1, 2019
    Messages:
    109
    Likes Received:
    15
    Thank you @udo

    To be honest, i'm not sure where to start digging. I have 4 Dell R620s. All have identical setup. All have identical hardware (give-or-take). The drives are from the same batch. The RAID controllers are of the same type. BIOS settings are identical and so are the iDRAC settings. The weird story is - i had issue with long boot on machine #3. After updating Lifecycle Controller to the latest version, machine #3 started to boot quickly, while machine #1 STARTED booting slowly. Moreover, this is happening with ZFS file system only. Every machine works fine on ext4 (how so?)

    I will try to get to the physical location in a couple of weeks and play around with resets and hardcore power-offs. Hope it's a weird hardware glitch, but doubt it. Do you have DELLs from 12th gen with Proxmox on ZFS?
     
  4. Vladimir Bulgaru

    Joined:
    Jun 1, 2019
    Messages:
    109
    Likes Received:
    15
    Ok, leaving this for posterity. No, the issue was not related to an error with BIOS, hardware, firmware, etc.
    Turns out - the external HDD connected via USB to the server was causing the issue.
    It still leaves a couple of questions unresolved:
    1. why was it booting normally in the past?
    2. why even now if i reinstall Proxmox on ext4 fs it boots normally?
    My best guess - a combination of factors. I assume:
    1. the fs of the external drive got corrupted (degradation, write errors) (this explains why ZFS in the past worked well with the external HDD)
    2. since ZFS has to assess all the drives, it may get stuck with the external drive (this is the reason why on ext4 Proxmox boots normally)
    In any case, removing the external drive normalised the boot process.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice