Replace degraded zfs drive with itself

Nov 29, 2024
4
2
3
Greetings,

we had a server with a dead sata cable. It caused alot checksum errors on a drive, degrading the zpool.
After replacing the cable, i wanted to tell zfs to resilver using the same drive by formatting it.
Unfortunatly generating new partition-IDs using sgdisk -G /dev/sdc, gave me the same ids and i were unable to replace the faulty drive with the same one.

Because it was a new server i took a new drive and threw it in there.

Am i able to somehow resilver on an already known drive, or to manipulate the ids i get by sgdisk?
Is there some easyer way to do that?

The drive is a boot drive for the pve using systemd boot.

I pretty much did this:

1. remove the drive
2. gdisk /dev/sdd -> create new guid partition table on another system, clearing the drive
3. inserted the drive and made sure, it wasn't showing old partitions using lsblk
4. sgdisk /dev/sdd -R /dev/sdc
5. sgdisk -G /dev/sdc

6. zpool replace rpool /dev/disk/by-id/OldDeadPartitionID /dev/disk/by-id/NewHealthyPartitionID (<- Those where identical)
I Also tried -f
 
After replacing the cable, i wanted to tell zfs to resilver using the same drive by formatting it.
Unnecessary. You don't need resilver, if it is the same disk. A scrub would have fixed that.

Here is my snippet, be careful to choose the correct disks!
Code:
Wipe the ex"faulty" disk first:
wipefs -a /dev/sdX

The first steps of copying the partition table, reissuing GUIDs and replacing the ZFS partition are the same. To make the system bootable from the new disk, different steps are needed which depend on the bootloader in use.

zpool status
zpool offline rpool /dev/source # (=failed disk)
## shutdown install the new disk or replace the disks
sgdisk /dev/source -R /dev/target
sgdisk --randomize-guids /dev/target

ls -l /dev/disk/by-id/*

zpool replace rpool /dev/source /dev/target
zpool status  #-> resilver

With proxmox-boot-tool:
get uuid from new disk:

blkid
proxmox-boot-tool format /dev/sdf2
proxmox-boot-tool init /dev/sdf2
proxmox-boot-tool refresh


proxmox-boot-tool status
proxmox-boot-tool clean
proxmox-boot-tool status    #double check
 
Unnecessary. You don't need resilver, if it is the same disk. A scrub would have fixed that.
I will try this the next time that happens, im sure i did that but in another slot. All those cables looked a little mangled so maybe it was instantly degraded again. The board was replaced last month because of some known production issue.


Does sgdisk --randomize-guids /dev/target always generate complete new guids?
After getting the same id again i thought it gets generated from the SSDs manufactures ids.
I'm sure i got the same id again using sgdisk -G.
Maybe i didn't even clean it up properly without wipefs and the system just picked up the old ids.
 
Does sgdisk --randomize-guids /dev/target always generate complete new guids?
It should. But I don't know how it is calculated.
Also in my notes I have that zpool replace rpool /dev/source /dev/targetshould even work with the same name. Maybe setting the old device in vdev offline did the trick.

Maybe i didn't even clean it up properly without wipefs and the system just picked up the old ids.
ZFS has redundant information on start and end, that is a security feature. But on the other hand ZFS behaves differently on Linux vs FreeBSD.

So...did it work now?
 
  • Like
Reactions: meeps