[TUTORIAL] Help with replacing 1 disk in ZFS pool

masteryoda

Active Member
Jun 28, 2020
52
4
28
55
I run a zfs health check and this morning i got this message

The number of I/O errors associated with a ZFS device exceeded acceptable levels. ZFS has marked the device as faulted.

Code:
impact: Fault tolerance of the pool may be compromised.
    eid: 154
  class: statechange
  state: FAULTED
   host: andromeda
   time: 2024-10-06 23:32:55-0400
  vpath: /dev/disk/by-id/ata-SPCC_Solid_State_Disk_AA230715S301KG05621-part1
  vphys: pci-0000:0b:00.0-sas-phy2-lun-0
  vguid: 0x61568A6815B38368
  devid: ata-SPCC_Solid_State_Disk_AA230715S301KG05621-part1
   pool: local-zfs (0x2D70F8286CA24DF2)

I guess i need to replace the failed disk. Is there a guide that i can use so that i can swap the faulty disk out and will zpool automatically recognize the new disk or do i need to run a few commands?
 
Please give me the output of

Code:
zpool status
and
Code:
zpool list
 
Here you go, thank you for the prompt reply

Code:
zpool status
  pool: local-zfs
 state: ONLINE
  scan: resilvered 14.7M in 00:00:02 with 0 errors on Tue Oct  8 11:38:58 2024
config:


        NAME                                               STATE     READ WRITE CKSUM
        local-zfs                                          ONLINE       0     0     0
          raidz1-0                                         ONLINE       0     0     0
            ata-SPCC_Solid_State_Disk_AA230715S301KG05744  ONLINE       0     0     0
            ata-SPCC_Solid_State_Disk_AA230715S301KG05622  ONLINE       0     0     0
            ata-SPCC_Solid_State_Disk_AA230715S301KG05621  ONLINE       0     0     0
            ata-SPCC_Solid_State_Disk_AA230715S301KG05795  ONLINE       0     0     0


errors: No known data errors

Code:
zpool list
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
local-zfs  3.62T   399G  3.24T        -         -    16%    10%  1.00x    ONLINE  -

All the disks are the same brand and size. the S/N printer on the disk is different from what is shown above. How do i know which one is bad & need to be replaced?
NVM - Figured it out. I used CrystalDiskInfo to find that out.

I also figured out how to swap the disk out thanks to ChatGPT

Step 1: Identify the Failed Disk
Check the ZFS Pool Status: First, verify the current status of your ZFS pool and identify the failed or degraded disk by running the following command:

Code:
zpool status

Look for lines indicating a degraded or faulted status like this:

Code:
pool: mypool
 state: DEGRADED
 status: One or more devices has been taken offline by the administrator.
 action: Replace the faulted device.
   NAME        STATE     READ WRITE CKSUM
   mypool      DEGRADED     0     0     0
     raidz2-0  DEGRADED     0     0     0
       sda     FAULTED      0     0     0  too many errors
       sdb     ONLINE       0     0     0
This output shows that sda is faulty and needs to be replaced.

Step 2: Find the UUID of the Failed Disk
In Proxmox, disks are often mounted using UUIDs rather than device names like /dev/sda to avoid issues with device renaming upon reboots.

Find the UUID of the failed disk: Use the following command to check the UUIDs of the disks in your ZFS pool:

Code:
ls -l /dev/disk/by-id/
You'll see a list of symbolic links that map the disk IDs (UUIDs) to device names:

Code:
ata-WDC_WD10EFRX-68PJCN0_WD-WCC4J1X9V7X0 -> ../../sda
ata-WDC_WD10EFRX-68PJCN0_WD-WCC4J1X9V7X0-part1 -> ../../sda1
Identify which UUID corresponds to the failed disk (sda in our example).

Detach the Faulty Disk: Before physically replacing the disk, offline it in the ZFS pool:

Code:
zpool offline mypool /dev/disk/by-id/{UUID-of-failed-disk}
Replace {UUID-of-failed-disk} with the actual UUID of the failed disk.

Step 3: Physically Replace the Failed Disk
Shut down or ensure your server can safely remove the failed disk (use hot-swap functionality if your hardware supports it).
Replace the failed disk with the new disk.
Boot the server back up if it was shut down.

Step 4: Identify the New Disk
Scan for the new disk: After replacing the disk, check the newly attached disk's UUID using the following command:

Code:
ls -l /dev/disk/by-id/
Look for the new disk’s UUID or identifier, which should differ from the old one. This UUID will typically include the manufacturer and model of the disk, such as ata-WDC_WD10EFRX.

Ensure the system recognizes the new disk: You can also verify that the system has detected the new disk by running:

Code:
fdisk -l

Step 5: Attach the New Disk to the ZFS Pool
Replace the old disk with the new disk in the ZFS pool using the new disk's UUID. Run the following command:

Code:
zpool replace mypool /dev/disk/by-id/{UUID-of-failed-disk} /dev/disk/by-id/{UUID-of-new-disk}
Replace {UUID-of-failed-disk} with the UUID of the failed disk, and {UUID-of-new-disk} with the UUID of the newly installed disk.

Rebuild the ZFS pool: ZFS will now start rebuilding the pool by replicating the data to the new disk. You can monitor the progress by running:

Code:
zpool status
The output should show the pool in a resilvering state:

Code:
pool: mypool
 state: DEGRADED
 status: One or more devices is currently being resilvered.
 action: Wait for the resilver process to complete.
   NAME        STATE     READ WRITE CKSUM
   mypool      DEGRADED     0     0     0
     raidz2-0  DEGRADED     0     0     0
       sda     FAULTED      0     0     0  too many errors
       sdc     ONLINE       0     0     0  (resilvering)
Wait for resilvering to complete: The resilvering process can take some time, depending on the size of your pool and the performance of your system.

Step 6: Verify the Disk Replacement
After the resilvering is complete, verify that the ZFS pool is now in a healthy state by running:

Code:
zpool status
The output should show the pool as ONLINE:

Step 7: Clean Up
Once the replacement and resilvering are complete, you may want to remove any references to the old disk if necessary:

Code:
zpool remove mypool /dev/disk/by-id/{UUID-of-failed-disk}
Now the failed disk has been successfully replaced, and your ZFS pool should be back to normal operation!
 
Last edited: