[SOLVED] Replace Failed Drive in ZFS

utkonos

Active Member
Apr 11, 2022
145
36
33
The documentation is clear on the command needed to replace the failed drive after the new one is installed:

Bash:
zpool replace -f <pool> <old-device> <new-device>

However, I am unsure of what exact device name to use. The output of zpool list -v and zpool status both show device names that are formatted differently than /dev/<device>

Code:
# zpool list -v
NAME                                                  SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool                                                2.78T   148G  2.64T        -         -    10%     5%  1.00x  DEGRADED  -
  raidz1-0                                           2.78T   148G  2.64T        -         -    10%  5.18%      -  DEGRADED
    nvme-eui.e8238fa6bf530001001b448b462c33f4-part3   953G      -      -        -         -      -      -      -    ONLINE
    nvme-eui.e8238fa6bf530001001b448b4629b03c-part3   953G      -      -        -         -      -      -      -    ONLINE
    nvme-eui.e8238fa6bf530001001b448b462c32d9-part3      -      -      -        -         -      -      -      -   REMOVED

How can I find the new device name to use here? Or is this incorrect and a different pair of device names needs to be used?

Bash:
zpool replace -f rpool nvme-eui.e8238fa6bf530001001b448b462c32d9-part3 <new-device>

The PVE instance is booting from ZFS too, so I want to be 100% sure that what I do will repair the system to where it started from:
1716910634643.png
 
Since this is the rpool (which drives also contain the boot ESP), you probably want to partition the drive first and add the second partition to proxmox-boot-tool (and remove the old ESP from it) and replace the removed disk (/dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b462c32d9-part3) with the third partition.
See the manual for more details: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_zfs_change_failed_dev
 
Here is what I have done.

Bash:
# proxmox-boot-tool status
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
System currently booted with uefi
67A1-0BFE is configured with: uefi (versions: 6.5.13-5-pve, 6.8.4-3-pve)
67A4-533A is configured with: uefi (versions: 6.5.13-5-pve, 6.8.4-3-pve)
WARN: /dev/disk/by-uuid/67A4-9AA7 does not exist - clean '/etc/kernel/proxmox-boot-uuids'! - skipping

According to the documentation, this is the command to determine if the system is "using systemd-boot or GRUB through proxmox-boot-tool, or plain GRUB as bootloader". The documentation doesn't, however, say what it is that I am supposed to look for in the output of the command to determine which of these is being used. I am making an assumption that "System currently booted with uefi" means "systemd-boot" but I am not 100% sure of this. It would be helpful if the documentation were more specific about exactly what the output of that command would be in each of those three cases.

The docs then say the first steps are the same for all three cases. I ran these commands and no problems:

Bash:
sgdisk /dev/nvme0n1 -R /dev/nvme2n1
sgdisk -G /dev/nvme2n1

Then, to make sure that I get exactly the correct device, I ran the following two commands:

Bash:
zpool list -v
ls -l /dev/disk/by-id/

From the output of these two commands, I can determine the ID of the old, gone device as well as the newly installed device. This results in the following command:

Bash:
zpool replace -f rpool /dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b462c32d9-part3 /dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b4977bb4f-part3

This results in the most important part of this process being completed successfully after resilvering is done:
scan: resilvered 41.4G in 00:02:04 with 0 errors on Tue May 28 18:15:30 2024
1716922003681.png

And the last steps in the documentation are the following:

Bash:
proxmox-boot-tool format /dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b4977bb4f-part2
proxmox-boot-tool init /dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b4977bb4f-part2

Unfortunately, the second of these two commands failed:

Bash:
# proxmox-boot-tool init /dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b4977bb4f-part2
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
UUID="FD75-5705" SIZE="536870912" FSTYPE="vfat" PARTTYPE="c12a7328-f81f-11d2-ba4b-00a0c93ec93b" PKNAME="nvme2n1" MOUNTPOINT=""
Mounting '/dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b4977bb4f-part2' on '/var/tmp/espmounts/FD75-5705'.
Installing systemd-boot..
E: bootctl is not available - make sure systemd-boot is installed

I think the problem is that I don't know what to look for in the output of proxmox-boot-tool status because the documentation doesn't actually say what to look for.

It seems that I may have "GRUB through proxmox-boot-tool" but the output of the status command doesn't say anything to that effect that I can see.

Any help on this last step that is failing?
 
Last edited:
According to the documentation, this is the command to determine if the system is "using systemd-boot or GRUB through proxmox-boot-tool, or plain GRUB as bootloader". The documentation doesn't, however, say what it is that I am supposed to look for in the output of the command to determine which of these is being used. I am making an assumption that "System currently booted with uefi" means "systemd-boot" but I am not 100% sure of this.
ZFS rpool with UEFI is systemd-boot unless you use Secure Boot, but I think proxmox-boot-tool calles systemd-boot uefi (which I consider to be confusing). But I don't think it matters, as I don't know where this makes a difference (in any steps that you have to do).
It would be helpful if the documentation were more specific about exactly what the output of that command would be in each of those three cases.

The docs then say the first steps are the same for all three cases. I ran these commands and no problems:

Bash:
sgdisk /dev/nvme0n1 -R /dev/nvme2n1
sgdisk -G /dev/nvme2n1

Then, to make sure that I get exactly the correct device, I ran the following two commands:

Bash:
zpool list -v
ls -l /dev/disk/by-id/

From the output of these two commands, I can determine the ID of the old, gone device as well as the newly installed device. This results in the following command:

Bash:
zpool replace -f rpool /dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b462c32d9-part3 /dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b4977bb4f-part3

And the last steps in the documentation are the following:

Bash:
proxmox-boot-tool format /dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b4977bb4f-part2
proxmox-boot-tool init /dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b4977bb4f-part2
That all look fine.
Unfortunately, the second of these two commands failed:

Bash:
# proxmox-boot-tool init /dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b4977bb4f-part2
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
UUID="FD75-5705" SIZE="536870912" FSTYPE="vfat" PARTTYPE="c12a7328-f81f-11d2-ba4b-00a0c93ec93b" PKNAME="nvme2n1" MOUNTPOINT=""
Mounting '/dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b4977bb4f-part2' on '/var/tmp/espmounts/FD75-5705'.
Installing systemd-boot..
E: bootctl is not available - make sure systemd-boot is installed
That's unexpected. Fortunately, you still have two ESP that should boot fine on the other two drives.
I think the problem is that I don't know what to look for in the output of proxmox-boot-tool status because the documentation doesn't actually say what to look for.

It seems that I may have "GRUB through proxmox-boot-tool" but the output of the status command doesn't say anything to that effect that I can see.

Any help on this last step that is failing?
Try running bootctl yourself (and it should show some information about the boot loader). If the command does not exist then run apt install systemd-boot yourself. Maybe that will help or someone else might know.
Most important thing is to secure the data on the rpool by replacing the remove part3 of the RAIDz1 with the new one. You can figure out the ESP/systemd-boot/bootctl problem later.
 
  • Like
Reactions: utkonos
Yes, agreed about boot problems later. The rpool is fixed and all three disks are online. The most important problem is past.

I made an assumption that I'm using grub and that appending grub to the second command would be the answer. The command did succeed, but resulted in a worse state:

Bash:
# proxmox-boot-tool status
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
System currently booted with uefi
67A1-0BFE is configured with: uefi (versions: 6.5.13-5-pve, 6.8.4-3-pve)
67A4-533A is configured with: uefi (versions: 6.5.13-5-pve, 6.8.4-3-pve)
WARN: /dev/disk/by-uuid/67A4-9AA7 does not exist - clean '/etc/kernel/proxmox-boot-uuids'! - skipping
FD75-5705 is configured with: grub (versions: 6.5.13-5-pve, 6.8.4-3-pve)

Now I need to figure out how to undo what this just did. And then figure out the correct command.
 
Last edited:
I made an assumption that I'm using grub and that appending grub to the second command would be the answer. The command did succeed, but resulted in a worse condition:
Is the boot menu blue with text in the top-left (GRUB) or is it black with text in the middle of the screen (systemd-boot)?
Bash:
# proxmox-boot-tool status
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
System currently booted with uefi
67A1-0BFE is configured with: uefi (versions: 6.5.13-5-pve, 6.8.4-3-pve)
67A4-533A is configured with: uefi (versions: 6.5.13-5-pve, 6.8.4-3-pve)
WARN: /dev/disk/by-uuid/67A4-9AA7 does not exist - clean '/etc/kernel/proxmox-boot-uuids'! - skipping
FD75-5705 is configured with: grub (versions: 6.5.13-5-pve, 6.8.4-3-pve)
You can always proxmox-boot-tool format it again and remove it from /etc/kernel/proxmox-boot-uuids (or run proxmox-boot-tool clean).
I'm quite convinced now that your Proxmox uses systemd-boot (compare cat /proc/cmdline with /etc/kernel/cmdline and /etc/default/grub for example).
Did you try bootctl and installing systemd-boot as I suggested?
 
I think I am almost there. I did apt install systemd-boot and ran the following commands:

Bash:
proxmox-boot-tool format /dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b4977bb4f-part2 --force
proxmox-boot-tool init /dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b4977bb4f-part2

Now there are three uefi entries that look correct according to proxmox-boot-tool status . But there are also two warnings for entries that do not exist. I think one is from the dead drive and the other is a remnant of the mistaken grub init from before.

Bash:
# proxmox-boot-tool status
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
System currently booted with uefi
67A1-0BFE is configured with: uefi (versions: 6.5.13-5-pve, 6.8.4-3-pve)
67A4-533A is configured with: uefi (versions: 6.5.13-5-pve, 6.8.4-3-pve)
WARN: /dev/disk/by-uuid/67A4-9AA7 does not exist - clean '/etc/kernel/proxmox-boot-uuids'! - skipping
94FB-CB1C is configured with: uefi (versions: 6.5.13-5-pve, 6.8.4-3-pve)
WARN: /dev/disk/by-uuid/FD75-5705 does not exist - clean '/etc/kernel/proxmox-boot-uuids'! - skipping

How can I remove these two entries? This looks to be the last thing to cleanup.
 
I think I am almost there. I did apt install systemd-boot and ran the following commands:
I guess that's the fix you needed.
Bash:
proxmox-boot-tool format /dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b4977bb4f-part2 --force
proxmox-boot-tool init /dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b4977bb4f-part2

Now there are three uefi entries that look correct according to proxmox-boot-tool status . But there are also two warnings for entries that do not exist. I think one is from the dead drive and the other is a remnant of the mistaken grub init from before.
Yes. Each time you format, a new filesystem with a new ID is written.
Bash:
# proxmox-boot-tool status
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
System currently booted with uefi
67A1-0BFE is configured with: uefi (versions: 6.5.13-5-pve, 6.8.4-3-pve)
67A4-533A is configured with: uefi (versions: 6.5.13-5-pve, 6.8.4-3-pve)
WARN: /dev/disk/by-uuid/67A4-9AA7 does not exist - clean '/etc/kernel/proxmox-boot-uuids'! - skipping
94FB-CB1C is configured with: uefi (versions: 6.5.13-5-pve, 6.8.4-3-pve)
WARN: /dev/disk/by-uuid/FD75-5705 does not exist - clean '/etc/kernel/proxmox-boot-uuids'! - skipping

How can I remove these two entries? This looks to be the last thing to cleanup.
proxmox-boot-tool clean (which you can find in the manual or by running proxmox-boot-tool help) or edit /etc/kernel/proxmox-boot-uuids manual (as I wrote before).
 
Last edited:
Yes, proxmox-boot-tool clean is exactly what was needed. Everything is fixed and cleaned up. It would be nice if the documentation included that uefi means proxmox-boot-tool and that grub means GRUB through proxmox-boot-tool in the output of proxmox-boot-tool status also, it would be nice if the documentation included a mention of needing to manually install apt install systemd-boot or maybe there is a bug to fix there?

Anyway, this is solved, thanks for all the help.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!