premature end of file /initrd.img-*-pve

RhubarbBread

Member
Nov 4, 2023
3
0
6
United States
Hello! I have a small homelab cluster which I installed updates on today and ran into trouble on the second of three nodes. I'm hoping to get some advice about how to fix this without completely rebuilding the node.

All three nodes started at pve-manager/8.4.1/2a5fa54a8503f96d (running kernel: 6.8.12-9-pve) and I updated to the latest available packages through the UI.

The first node completed upgrades just fine and is now running pve-manager/8.4.6/c5b55b84d1f84ea6 (running kernel: 6.8.12-13-pve)

After upgrading and rebooting the second node, I noticed it took longer to reboot than expected (~15 minutes?) and I wasn't getting any response from it over the network, so I made my way to the basement, hard powered the node off, and hooked up a display and keyboard to begin diagnostics.

On boot, the GRUB menu came up, and when the default Proxmox Virtual Environment option started, I received this error message:

Code:
Loading Linux 6.8.12-13-pve ...
error: invalid cluster 0.
Loading initial ramdisk ...
error: you need to load the kernel first.

Press any key to continue...

Initially I was able to go back to the previous kernel from the Advanced options menu in GRUB, and then I did some googling to figure out how to reinstall/repair the new kernel version, which led me to a combination of update-initramfs and update-grub commands, and later on proxmox-boot-tool refresh commands.

I was able to boot into the new kernel once, but after rebooting to verify that the fix would stick, it failed to load again, and this time the old kernel ALSO was missing. That led me down a path of troubleshooting from the Proxmox Installer in debug mode, fsck on the boot partition, mounting the zfs rpool and chrooting into it to run additional commands, etc.

At any rate, I am now stuck and would appreciate any advice for getting unstuck.

When I boot the node, I now receive this message:

Code:
Loading Linux 6.8.12-13-pve ...
Loading initial ramdisk ...
error: premature end of file /initrd.img-6.8.12-13-pve.

Press any key to continue...

I suspect something is off about my boot partition, but I'm a little out of my depth here so any advice is appreciate.

A bit more about my setup, in case it proves helpful:
  • Three-node cluster of HP Z2 G4 desktops, identical hardware (except some ceph OSD differences, but we're not getting to that point so I doubt it's relevant)
  • ZFS mirror boot pool on NVMe drives
  • UEFI with Secure Boot has been in use since the cluster was installed; no migration of boot methods has been performed, as far as I remember.
 
I took another stab at fixing this today and got it resolved by doing this:
  1. If enabled, turn off secure boot enforcement in the server's UEFI/BIOS settings
  2. Boot to Proxmox Installer USB disk and entering Advanced > Terminal Installer (Debug Mode)
  3. Press Ctrl+D to drop to debug shell after launching a debug mode installer
  4. Mount the zfs boot pool and ancillary file systems
    1. zpool import -f -R /mnt rpool (-f forces import of a pool owned by another system and -R sets the mount point, rpool is the default name of the ZFS boot pool)
    2. mount -o rbind /proc /mnt/proc
    3. mount -o rbind /sys /mnt/sys
    4. mount -o rbind /dev /mnt/dev
    5. mount -o rbind /run /mnt/run
  5. chroot to the mounted file system and launch a bash shell: chroot /mnt /bin/bash
  6. Validate that chroot was successful: ls /mnt should show pve at a minimum, and maybe other directories you had present on the host if any existed.
The EFI system partition is where the initramfs and other bootloader files are stored, and in a mirrored ZFS pool, these will be synced FAT file systems located on a 512MB or 1GB partition on each of the ZFS pool disks. In my case, they are nvme0n1p2 and nvme1n1, but for you, they may be sda2 and sdb2. To locate the EFI partition on your system:
  1. Run lsblk -o +FSTYPE and look for the vfat FSTYPE on a pair of partitions which are 512MB or 1GB in size. You should also find a zfs_member partition on the same disk.
  2. Run blkid /dev/<device> e.g. blkid /dev/nvme0n1p2 and note the UUID. Repeat for the second partition.
  3. Run cat /etc/kernel/proxmox-boot-uuids and verify that the same UUIDs are listed here.
Next I reinitialized the two EFI partitions using proxmox-boot-tool.
  1. proxmox-boot-tool reinit
  2. proxmox-boot-tool refresh
Then I closed up the chroot and rebooted the system.
  1. exit
  2. umount /proc
  3. umount /sys (if you get a message saying the device is busy, re-run the command with -l for a lazy unmount)
  4. umount /dev
  5. umount /run
  6. zpool export rpool
  7. Ctrl+Alt+Del to reboot the server immediately
Boot into GRUB and select the default Proxmox option. You may get an (initramfs) prompt. If you do, run zpool import -f rpool and exit to mount the ZFS boot pool and continue booting. This should only need to be done during this boot.

Once booted, run uname -r to verify the kernel version in use.

Reboot the server and verify that Proxmox loads without interaction.

Finally, if desired, reboot and re-enable Secure Boot enforcement in the system UEFI/BIOS setttings.