Hello!
I recently had 2 servers crash - both had been used in a PVE v3 cluster for a few years and never had an issue, but after a clean wipe and fresh install of PVE v4, both have experienced crashes, most likely under high I/O workloads. They have a HW RAID-6 with a ZFS raid-0 install as the root. Last weekend one crashed, and came up with a simple "grub rescue> " prompt, and while I was restoring some CTs onto the other one, it suddenly rebooted and came up with the same grub rescue prompt. I did see another server crash while I was restoring another large CT onto it, but luckily, it came up cleanly, and I did not attempt any further major I/O onto it.
Some details on the crashes:
* nothing was written in any of the logs or the hardware logs - the system just starts logging about the boot process
* these were slightly out of date - pve-manager/4.4-5/c43015a5 (running kernel: 4.4.35-2-pve)
* I believe these two machines have not ever been updated - they were installed with the above version a few months ago and after some brief testing, joined the PVE v4 cluster
The machines come up after the bios immediately (no delay) drop to:
error: unknown filesystem.
Entering rescue mode...
grub rescue>
When I type 'ls' I see:
(hd0) (hd0,gpt9) (hd0,gpt2) (hd0,gpt1) (fd0)
(I'm assuming that fd0 is just the virtual media from the IPMI - no floppies here! )
When I type 'set', I get:
cmdpath=(hd0)
prefix=(hd0,gpt2)/ROOT/pve-1@/boot/grub
root=hd0,gpt2
I can type 'insmod zfs' and it returns right away, but I still don't get any valid output from 'ls (hd0,gpt2)' or 'insmod normal'. It hangs for several seconds and then reports:
error: unknown filesystem
I was a little stuck getting a ZFS rescue disk, but ultimately discovered that the v3.4 ISO file worked with my IPMI keyboard. I tried using the Rescue mode on the v4.4 and v5.0 beta ISOs, but after hanging for several seconds, both reported:
error: failure reading sector 0x0 from `fd0'.
error: no such device: rpool.
ERROR: unable to find boot disk automatically.
I used the commands from https://forum.proxmox.com/threads/grub2-recovery-on-zfs-proxmox-ve-3-4.21306/ to import the zpool and ran a scrub on it (which reported no errors), but when I reinstalled and updated grub and the initramfs, I still get dumped back to the same "error: unknown filesystem" on boot.
I did some looking around and found some possible bugs, but didn't find a particular scenario that matched what I'm seeing...
Any help or suggestions?
I recently had 2 servers crash - both had been used in a PVE v3 cluster for a few years and never had an issue, but after a clean wipe and fresh install of PVE v4, both have experienced crashes, most likely under high I/O workloads. They have a HW RAID-6 with a ZFS raid-0 install as the root. Last weekend one crashed, and came up with a simple "grub rescue> " prompt, and while I was restoring some CTs onto the other one, it suddenly rebooted and came up with the same grub rescue prompt. I did see another server crash while I was restoring another large CT onto it, but luckily, it came up cleanly, and I did not attempt any further major I/O onto it.
Some details on the crashes:
* nothing was written in any of the logs or the hardware logs - the system just starts logging about the boot process
* these were slightly out of date - pve-manager/4.4-5/c43015a5 (running kernel: 4.4.35-2-pve)
* I believe these two machines have not ever been updated - they were installed with the above version a few months ago and after some brief testing, joined the PVE v4 cluster
The machines come up after the bios immediately (no delay) drop to:
error: unknown filesystem.
Entering rescue mode...
grub rescue>
When I type 'ls' I see:
(hd0) (hd0,gpt9) (hd0,gpt2) (hd0,gpt1) (fd0)
(I'm assuming that fd0 is just the virtual media from the IPMI - no floppies here! )
When I type 'set', I get:
cmdpath=(hd0)
prefix=(hd0,gpt2)/ROOT/pve-1@/boot/grub
root=hd0,gpt2
I can type 'insmod zfs' and it returns right away, but I still don't get any valid output from 'ls (hd0,gpt2)' or 'insmod normal'. It hangs for several seconds and then reports:
error: unknown filesystem
I was a little stuck getting a ZFS rescue disk, but ultimately discovered that the v3.4 ISO file worked with my IPMI keyboard. I tried using the Rescue mode on the v4.4 and v5.0 beta ISOs, but after hanging for several seconds, both reported:
error: failure reading sector 0x0 from `fd0'.
error: no such device: rpool.
ERROR: unable to find boot disk automatically.
I used the commands from https://forum.proxmox.com/threads/grub2-recovery-on-zfs-proxmox-ve-3-4.21306/ to import the zpool and ran a scrub on it (which reported no errors), but when I reinstalled and updated grub and the initramfs, I still get dumped back to the same "error: unknown filesystem" on boot.
I did some looking around and found some possible bugs, but didn't find a particular scenario that matched what I'm seeing...
Any help or suggestions?