Server Unbootable After ZFS Mirror Failure — Missing ESP on Remaining Drive?

Coffeeri

Active Member
Jun 8, 2019
29
4
43
29
Hi Proxmox Community,

I'm facing an issue with a Proxmox VE server that has become unbootable after one of the drives in the boot ZFS mirror failed. I've confirmed the remaining drive lacks an ESP partition and need advice on the best recovery path.

Setup:
  • Proxmox VE (Version 8.x) installed on a ZFS mirror (rpool) across two NVMe SSDs (intended for UEFI boot).
Problem:
  • One of the NVMe SSDs in the mirror failed completely.
  • The server no longer boots.
Goal:
  • Make the server bootable again using the single remaining functional NVMe SSD.
  • When the replacement NVMe has arrived, recover mirror.
Troubleshooting Steps Taken:
  1. Booted from Proxmox VE Installation USB: Used the official installer USB stick.
  2. Selected Rescue Boot: Navigated to Advanced Options -> Rescue Boot.
  3. Rescue Boot Error: The rescue boot process failed to automatically find the boot disk (error: no such device: rpool. ERROR: unable to find boot disk automatically.).
  4. Entered Shell via TUI Installer: Successfully entered the basic rescue command-line shell.
  5. Checked Disks: Ran lsblk and sgdisk -p /dev/nvme0n1 to identify the remaining good NVMe SSD (nvme0n1) and its partitions.
The Core Issue - ESP Confirmed Missing

The output of sgdisk -p /dev/nvme0n1 confirms that there is no EFI System Partition (ESP) on the remaining working drive:

Disk /dev/nvme0n1: 1953525168 sectors, 931.5 GiB
Model: WD_BLACK SN850X 1000GB
[...]
Number Start (sector) End (sector) Size Code Name
1 2048 1953507327 931.5 GiB BF01 zfs-251b95c7537d3a65 <-- ZFS Data rpool
9 1953507328 1953523711 8.0 MiB BF07 <-- Solaris reserved / bios_grub?

(No partition with code EF00 exists)

This means proxmox-boot-tool cannot be used as it has no target partition.

My Questions:
  1. Given the confirmed absence of an ESP, is there any supported method to make this drive bootable without a full reinstall, or is attempting to manually create/resize partitions too risky?
  2. Is the recommended (and safest) path forward now to:
  • Use the rescue shell to import rpool read-only and back up /etc/pve and any other critical data from the ZFS partition (nvme0n1p1).
  • Perform a clean Proxmox reinstall onto nvme0n1, allowing the installer to correctly partition the drive (including creating a new ESP).
  • Restore the backed-up configuration and data?
Any confirmation or alternative suggestions would be greatly appreciated.
Thank you!
 
  1. Given the confirmed absence of an ESP, is there any supported method to make this drive bootable without a full reinstall, or is attempting to manually create/resize partitions too risky?
Only if there is free partitioned space to create an additional ESP. If rescue boot works for you, you can add an ESP (even at the end of the physical drive) and use proxmox-boot-tool to format and init it. Please see the manual for details: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysboot

  1. Is the recommended (and safest) path forward now to:
  • Use the rescue shell to import rpool read-only and back up /etc/pve and any other critical data from the ZFS partition (nvme0n1p1).
  • Perform a clean Proxmox reinstall onto nvme0n1, allowing the installer to correctly partition the drive (including creating a new ESP).
  • Restore the backed-up configuration and data?
A new install is not necessary if the rescue boot works and boots you into the existing Proxmox. Then you can partition the new drive (and make it part from the mirror) from there. Make sure not to forget the ESP. See "Changing a failed bootable device" in the manual: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_zfs_change_failed_dev . After that you can add an ESP to the old drive (as mentioned above) or simply redo the entire old drive.
 
Hi @leesteken,

Thank you for your suggestions and the links to the documentation.

Regarding adding an ESP:

Only if there is free partitioned space to create an additional ESP.
I double-checked the partition table on the remaining drive (nvme0n1) using sgdisk -p /dev/nvme0n1. It confirmed that there is only 1.7 MiB of total free unpartitioned space on the drive.


Code:
root@proxmox:/# sgdisk -p /dev/nvme0n1
[ 591.524120] nvme0n1: p1 p9

Disk /dev/nvme0n1: 1953525168 sectors, 931.5 GiB
Model: WD_BLACK SN850X 1000GB
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): [...]
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 1953525134
Partitions will be aligned on 2048-sector boundaries
Total free space is 3437 sectors (1.7 MiB)

Number Start (sector)    End (sector)  Size       Code  Name
   1            2048      1953507327   931.5 GiB  BF01  zfs-251b95c7537d3a65
   9      1953507328      1953523711     8.0 MiB  BF07

Unfortunately, this is far too small to create the needed ESP (which seems to require ~512 MiB for Proxmox). Creating the necessary space would require shrinking the main ZFS data partition (nvme0n1p1), which seems very risky and potentially destructive to the rpool data.

Which seems that I am only left with a new install and transferring the data?

Thanks again for your input!