Proxmox Install Fails to Boot After zfs Boot Mirror Replacement

rbeard.js

Member
Aug 11, 2022
61
2
13
Hi there,
I have a boot mirror of ssd drives that were giving me disk errors. The wearout percentage was at 99% for both devices in the mirror so I attempted to replace them. Here is the steps I took and the commands ran:

Code:
sgdisk --zap-all /dev/sdk
sgdisk --zap-all /dev/sdl
wipefs -a /dev/sdk
wipefs -a /dev/sdl

sgdisk /dev/sdi -R=/dev/sdk
sgdisk /dev/sdj -R=/dev/sdl

sgdisk -G /dev/sdk
sgdisk -G /dev/sdl

zpool replace -f rpool  ata-CT240BX500SSD1_2244E680A27B-part3 /dev/sdk3
watch zpool status

zpool replace -f rpool  ata-CT240BX500SSD1_2244E6809BEA-part3 /dev/sdl3
watch zpool status

proxmox-boot-tool format /dev/sdl2
proxmox-boot-tool format /dev/sdk2

proxmox-boot-tool init /dev/sdl2
proxmox-boot-tool init /dev/sdk2

Everything looked right but when I tried booting to the new disks, I get a boot failed message in the dell BIOS.
I went back to the old drives and those boot, but I hit an initramfs prompt

Did I do something wrong in this process?
How can I recover from here? Either getting the old disks to work to try again or fixing the new ones?
 
From the steps you shared, you did almost everything correctly for replacing the failed boot mirror in a Proxmox ZFS setup, but there are a couple of critical points that explain why your system doesn’t boot. Let me walk you through what likely went wrong and how you can recover.
First, cloning the partition tables with sgdisk -R and then randomizing GUIDs with sgdisk -G was the right approach. You also correctly used zpool replace so the new disks should now contain the ZFS rpool data. The issue is not ZFS itself, but the bootloader. Proxmox systems that use ZFS boot require that GRUB (on legacy BIOS systems) or systemd-boot (on UEFI systems) is properly installed on the EFI or BIOS boot partitions of the new drives. Running proxmox-boot-tool format and init was the right intention, but you need to confirm the target setup: if your server is BIOS-only you must ensure GRUB is installed into the MBR/boot partition of each disk; if it is UEFI, then the EFI System Partition must be correctly populated and visible in the Dell BIOS boot menu.
The “boot failed” message on the new SSDs means the firmware didn’t find a valid bootloader. The initramfs prompt on the old SSDs suggests either corruption of /etc/zfs/zpool.cache or a mismatch between devices and your initramfs after the disk errors.
To fix this cleanly, here is what I recommend:
  1. Boot again from the old drives, even if they drop to initramfs. From the initramfs shell, import your rpool manually with zpool import -R /root rpool and then exit. If that doesn’t work, boot a Proxmox installation ISO in rescue mode, import the pool, and chroot into your system.
  2. Once inside your system with the pool imported, make sure /etc/kernel/proxmox-boot-uuids contains the new disks’ EFI partition UUIDs. If not, run proxmox-boot-tool format /dev/sdk2 and proxmox-boot-tool format /dev/sdl2 again, then proxmox-boot-tool init /dev/sdk2 and /dev/sdl2.
  3. If you are using legacy BIOS boot (not UEFI), you must explicitly reinstall GRUB to each new disk with grub-install /dev/sdk and grub-install /dev/sdl, followed by update-grub.
  4. If you are using UEFI boot, enter your BIOS setup and make sure the EFI entries for the new disks exist. Sometimes Dell firmware does not automatically detect them; you may need to recreate the UEFI boot entry pointing to \EFI\proxmox\grubx64.efi on the new drives.
  5. Finally, rebuild the initramfs with update-initramfs -u -k all and rerun proxmox-boot-tool refresh so that the boot partitions are fully updated.
  6. From there, test a reboot. If the new SSDs still fail to boot, your fallback is to reattach the old ones, boot with the ISO in rescue mode, and properly reinstall the bootloader as above. The data on the pool should remain intact because ZFS is already resilient.

Can you confirm whether your Proxmox installation is using UEFI or legacy BIOS boot? That determines whether you need to fix GRUB in the MBR or systemd-boot in the EFI partition.
 
From the steps you shared, you did almost everything correctly for replacing the failed boot mirror in a Proxmox ZFS setup, but there are a couple of critical points that explain why your system doesn’t boot. Let me walk you through what likely went wrong and how you can recover.
First, cloning the partition tables with sgdisk -R and then randomizing GUIDs with sgdisk -G was the right approach. You also correctly used zpool replace so the new disks should now contain the ZFS rpool data. The issue is not ZFS itself, but the bootloader. Proxmox systems that use ZFS boot require that GRUB (on legacy BIOS systems) or systemd-boot (on UEFI systems) is properly installed on the EFI or BIOS boot partitions of the new drives. Running proxmox-boot-tool format and init was the right intention, but you need to confirm the target setup: if your server is BIOS-only you must ensure GRUB is installed into the MBR/boot partition of each disk; if it is UEFI, then the EFI System Partition must be correctly populated and visible in the Dell BIOS boot menu.
The “boot failed” message on the new SSDs means the firmware didn’t find a valid bootloader. The initramfs prompt on the old SSDs suggests either corruption of /etc/zfs/zpool.cache or a mismatch between devices and your initramfs after the disk errors.
To fix this cleanly, here is what I recommend:
  1. Boot again from the old drives, even if they drop to initramfs. From the initramfs shell, import your rpool manually with zpool import -R /root rpool and then exit. If that doesn’t work, boot a Proxmox installation ISO in rescue mode, import the pool, and chroot into your system.
  2. Once inside your system with the pool imported, make sure /etc/kernel/proxmox-boot-uuids contains the new disks’ EFI partition UUIDs. If not, run proxmox-boot-tool format /dev/sdk2 and proxmox-boot-tool format /dev/sdl2 again, then proxmox-boot-tool init /dev/sdk2 and /dev/sdl2.
  3. If you are using legacy BIOS boot (not UEFI), you must explicitly reinstall GRUB to each new disk with grub-install /dev/sdk and grub-install /dev/sdl, followed by update-grub.
  4. If you are using UEFI boot, enter your BIOS setup and make sure the EFI entries for the new disks exist. Sometimes Dell firmware does not automatically detect them; you may need to recreate the UEFI boot entry pointing to \EFI\proxmox\grubx64.efi on the new drives.
  5. Finally, rebuild the initramfs with update-initramfs -u -k all and rerun proxmox-boot-tool refresh so that the boot partitions are fully updated.
  6. From there, test a reboot. If the new SSDs still fail to boot, your fallback is to reattach the old ones, boot with the ISO in rescue mode, and properly reinstall the bootloader as above. The data on the pool should remain intact because ZFS is already resilient.

Can you confirm whether your Proxmox installation is using UEFI or legacy BIOS boot? That determines whether you need to fix GRUB in the MBR or systemd-boot in the EFI partition.


I do have proxmox booting via UEFI.
I think I forgot the GRUB flag on the end of this command: # proxmox-boot-tool init <new disk's ESP> [grub]
I at least think this is the case as there is no grubx64.efi file on the new drives

I wasnt able to import the rpool on the only drives. I used your command above and I get an error saying the drives do not exist.

I loaded up a pbs iso and booted into rescue mode, but I get an error there saying that it cant locate rpool and then auto exits.
 
Thanks, that clarifies a lot. Since you are on UEFI boot, the missing grubx64.efi on the new drives is exactly the reason why the Dell BIOS shows “boot failed.” The proxmox-boot-tool init call without the grub argument only prepares the ESP for systemd-boot. On a UEFI + ZFS rpool setup, Proxmox by default still uses grub on ZFS as the bootloader. That’s why the new ESPs are missing the grubx64.efi loader. On top of that, if you can’t import rpool on the old drives anymore, that suggests the ZFS labels are damaged or the drives are flapping in and out due to wear. Let’s break this into recovery steps.
# Boot from a Proxmox VE ISO, choose "Rescue mode" or "Shell"

# First, check if your pool is visible
zpool import

# If rpool is listed, import it into /mnt
zpool import -R /mnt rpool

# If it does not show, try forcing it
zpool import -f -R /mnt rpool

# If still not visible, try readonly import (may allow recovery)
zpool import -o readonly=on -f -R /mnt rpool

# If none of the above shows rpool, check labels on each disk
zdb -l /dev/sdX # replace X with each disk letter (old and new) to see if ZFS metadata is readable

# Once rpool is imported into /mnt, chroot into it
mount --rbind /dev /mnt/dev
mount --rbind /proc /mnt/proc
mount --rbind /sys /mnt/sys
chroot /mnt /bin/bash

# Inside chroot: reinstall bootloader to the new ESPs
proxmox-boot-tool format /dev/sdk2
proxmox-boot-tool format /dev/sdl2
proxmox-boot-tool init /dev/sdk2 grub
proxmox-boot-tool init /dev/sdl2 grub
proxmox-boot-tool refresh

# Make sure initramfs is rebuilt so rpool devices are recognized
update-initramfs -u -k all

# Also update GRUB configuration
update-grub

# Create UEFI boot entries manually (important on some Dell servers)
efibootmgr -c -d /dev/sdk -p 2 -L "Proxmox-SDK" -l '\EFI\proxmox\grubx64.efi'
efibootmgr -c -d /dev/sdl -p 2 -L "Proxmox-SDL" -l '\EFI\proxmox\grubx64.efi'

# Exit chroot and unmount
exit
umount -R /mnt

# Reboot the system and select one of the Proxmox entries in the Dell BIOS menu
reboot


With this flow you first make sure rpool can be imported, then reinstall GRUB into the EFI partitions of both new SSDs, refresh the boot configuration, and explicitly add UEFI boot entries so the Dell firmware finds the loaders. This should restore a clean boot from the new mirror.
 
  • Like
Reactions: Kingneutron
So unless Im missng something, I choose the rescue mode from the ISO and it immediately says rpool not found and I have no options to do anything. Any key press bumps me back to the main screen.

Call me crazy but Im curious if I can copy the missing grubx64.efi file from the old drives and paste it in the proper directory on the new drives? I assume there might be a reason this wont work but considering I cant access the rpool on the old drives, could that work?