ZFS: Correct procedure to replace faulted disks in a zpool with spares?

Pyromancer

Member
Jan 25, 2021
27
6
8
47
I have a pool, configured as a raidz-1, of 5 SSD drives plus 2 spares, and in the most recent scrub, two of the main pool SSDs have been marked FAULTED. The spares have automatically taken over so at the moment the pool still has full redundancy, and I want to replace the faulty drives. However googling has produced conflicting information on how I should go about it.

zpool status output:

Code:
 zpool: tank2
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: resilvered 129G in 0 days 05:37:58 with 0 errors on Thu May 13 00:01:03 2021
config:


        NAME         STATE     READ WRITE CKSUM
        tank2        DEGRADED     0     0     0
          raidz1-0   DEGRADED     0     0     0
            sdf      ONLINE       0     0     0
            spare-1  DEGRADED     0     0     0
              sdg    FAULTED    216     1     0  too many errors
              sdv    ONLINE       0     0     0
            sdh      ONLINE       0     0     0
            sdi      ONLINE       0     2     0
            spare-4  DEGRADED     0     0     0
              sdj    FAULTED    130     1     0  too many errors
              sdu    ONLINE       0     0     0
        spares
          sdu        INUSE     currently in use
          sdv        INUSE     currently in use


errors: No known data errors

I'm aware that we should try and move away from /dev/sdX and instead use the disk identifiers, to protect against drive letter IDs changing if a disk is removed and the host rebooted, however I'd like to have a clean zpool before attempting that. In the mean time I have updated our drive map spreadsheet (which replicates the physical disk layout, and already includes drive serial numbers) with the "Disk identifier" field shown for each disk by fdisk -l, just in case things go wrong and we need to rebuild a pool.

I have spare identical drives with which to replace /dev/sdg and /dev/sdj, and as all are SATA3 SSDs (hence hot-swap by default) I'm not expecting to have to reboot the host, which is a live PVE production environment, although all bar one of the VMs are currently on the other pool tank1, which is based on much larger HDDs.

Some sources suggest that all I need to do (using the first one, /dev/sdg, for example) is:

1. Physically hot-swap the faulted disk for a new one.
2. zpool replace tank2 /dev/sdg

However other sources suggest we need to first zpool detatch tank2 /dev/sdg before doing the hot-swap, and or offline it, and at least one source suggests that partition information needs to be cloned from one of the other disks in the pool.

Can anyone clarify the correct steps to take to replace the faulted drives?

Aside: Interestingly, SMART has also flagged for the drive at /dev/sdg, however it's not triggered for /dev/sdj, leaving me to wonder if I should simply clear that and then scrub again and see if any faults return ?
 
I don't think that ZFS likes it when you moving disks around without telling it that you changed it. I think that zfs detach of the broken drive would work, as it already is using a spare. I have not had this situation before, so maybe wait for a more experience person to post. Physically replace the drives when not in a pool anymore.
I do have had disks fail partially and also completely without SMART complaining. A few errors can happen (on power loss or a bad sector), like your sdi? With several tens faults, I would physically reconnect or replace power and data cables and do a SMART short and long self-test on the drive.
 
I don't think that ZFS likes it when you moving disks around without telling it that you changed it. I think that zfs detach of the broken drive would work, as it already is using a spare. I have not had this situation before, so maybe wait for a more experience person to post. Physically replace the drives when not in a pool anymore.
I do have had disks fail partially and also completely without SMART complaining. A few errors can happen (on power loss or a bad sector), like your sdi? With several tens faults, I would physically reconnect or replace power and data cables and do a SMART short and long self-test on the drive.

That makes sense, detach so spare takes over fully, swap drives, re-add new as new spare.

In this case no cables, 16 drive backplane, so that should be OK. I'll bring the replaced drives back and bench-test them on a different system.

Thanks for reply, was getting hopelessly confused by conflicting answers when googling it. Ta for the ServerFault link too.
 
  • Like
Reactions: leesteken

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!