[TUTORIAL] Help with replacing 1 disk in ZFS pool

masteryoda

Active Member
Jun 28, 2020
39
2
28
54
I run a zfs health check and this morning i got this message

The number of I/O errors associated with a ZFS device exceeded acceptable levels. ZFS has marked the device as faulted.

Code:
impact: Fault tolerance of the pool may be compromised.
    eid: 154
  class: statechange
  state: FAULTED
   host: andromeda
   time: 2024-10-06 23:32:55-0400
  vpath: /dev/disk/by-id/ata-SPCC_Solid_State_Disk_AA230715S301KG05621-part1
  vphys: pci-0000:0b:00.0-sas-phy2-lun-0
  vguid: 0x61568A6815B38368
  devid: ata-SPCC_Solid_State_Disk_AA230715S301KG05621-part1
   pool: local-zfs (0x2D70F8286CA24DF2)

I guess i need to replace the failed disk. Is there a guide that i can use so that i can swap the faulty disk out and will zpool automatically recognize the new disk or do i need to run a few commands?
 
Please give me the output of

Code:
zpool status
and
Code:
zpool list
 
Here you go, thank you for the prompt reply

Code:
zpool status
  pool: local-zfs
 state: ONLINE
  scan: resilvered 14.7M in 00:00:02 with 0 errors on Tue Oct  8 11:38:58 2024
config:


        NAME                                               STATE     READ WRITE CKSUM
        local-zfs                                          ONLINE       0     0     0
          raidz1-0                                         ONLINE       0     0     0
            ata-SPCC_Solid_State_Disk_AA230715S301KG05744  ONLINE       0     0     0
            ata-SPCC_Solid_State_Disk_AA230715S301KG05622  ONLINE       0     0     0
            ata-SPCC_Solid_State_Disk_AA230715S301KG05621  ONLINE       0     0     0
            ata-SPCC_Solid_State_Disk_AA230715S301KG05795  ONLINE       0     0     0


errors: No known data errors

Code:
zpool list
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
local-zfs  3.62T   399G  3.24T        -         -    16%    10%  1.00x    ONLINE  -

All the disks are the same brand and size. the S/N printer on the disk is different from what is shown above. How do i know which one is bad & need to be replaced?
NVM - Figured it out. I used CrystalDiskInfo to find that out.

I also figured out how to swap the disk out thanks to ChatGPT

Step 1: Identify the Failed Disk
Check the ZFS Pool Status: First, verify the current status of your ZFS pool and identify the failed or degraded disk by running the following command:

Code:
zpool status

Look for lines indicating a degraded or faulted status like this:

Code:
pool: mypool
 state: DEGRADED
 status: One or more devices has been taken offline by the administrator.
 action: Replace the faulted device.
   NAME        STATE     READ WRITE CKSUM
   mypool      DEGRADED     0     0     0
     raidz2-0  DEGRADED     0     0     0
       sda     FAULTED      0     0     0  too many errors
       sdb     ONLINE       0     0     0
This output shows that sda is faulty and needs to be replaced.

Step 2: Find the UUID of the Failed Disk
In Proxmox, disks are often mounted using UUIDs rather than device names like /dev/sda to avoid issues with device renaming upon reboots.

Find the UUID of the failed disk: Use the following command to check the UUIDs of the disks in your ZFS pool:

Code:
ls -l /dev/disk/by-id/
You'll see a list of symbolic links that map the disk IDs (UUIDs) to device names:

Code:
ata-WDC_WD10EFRX-68PJCN0_WD-WCC4J1X9V7X0 -> ../../sda
ata-WDC_WD10EFRX-68PJCN0_WD-WCC4J1X9V7X0-part1 -> ../../sda1
Identify which UUID corresponds to the failed disk (sda in our example).

Detach the Faulty Disk: Before physically replacing the disk, offline it in the ZFS pool:

Code:
zpool offline mypool /dev/disk/by-id/{UUID-of-failed-disk}
Replace {UUID-of-failed-disk} with the actual UUID of the failed disk.

Step 3: Physically Replace the Failed Disk
Shut down or ensure your server can safely remove the failed disk (use hot-swap functionality if your hardware supports it).
Replace the failed disk with the new disk.
Boot the server back up if it was shut down.

Step 4: Identify the New Disk
Scan for the new disk: After replacing the disk, check the newly attached disk's UUID using the following command:

Code:
ls -l /dev/disk/by-id/
Look for the new disk’s UUID or identifier, which should differ from the old one. This UUID will typically include the manufacturer and model of the disk, such as ata-WDC_WD10EFRX.

Ensure the system recognizes the new disk: You can also verify that the system has detected the new disk by running:

Code:
fdisk -l

Step 5: Attach the New Disk to the ZFS Pool
Replace the old disk with the new disk in the ZFS pool using the new disk's UUID. Run the following command:

Code:
zpool replace mypool /dev/disk/by-id/{UUID-of-failed-disk} /dev/disk/by-id/{UUID-of-new-disk}
Replace {UUID-of-failed-disk} with the UUID of the failed disk, and {UUID-of-new-disk} with the UUID of the newly installed disk.

Rebuild the ZFS pool: ZFS will now start rebuilding the pool by replicating the data to the new disk. You can monitor the progress by running:

Code:
zpool status
The output should show the pool in a resilvering state:

Code:
pool: mypool
 state: DEGRADED
 status: One or more devices is currently being resilvered.
 action: Wait for the resilver process to complete.
   NAME        STATE     READ WRITE CKSUM
   mypool      DEGRADED     0     0     0
     raidz2-0  DEGRADED     0     0     0
       sda     FAULTED      0     0     0  too many errors
       sdc     ONLINE       0     0     0  (resilvering)
Wait for resilvering to complete: The resilvering process can take some time, depending on the size of your pool and the performance of your system.

Step 6: Verify the Disk Replacement
After the resilvering is complete, verify that the ZFS pool is now in a healthy state by running:

Code:
zpool status
The output should show the pool as ONLINE:

Step 7: Clean Up
Once the replacement and resilvering are complete, you may want to remove any references to the old disk if necessary:

Code:
zpool remove mypool /dev/disk/by-id/{UUID-of-failed-disk}
Now the failed disk has been successfully replaced, and your ZFS pool should be back to normal operation!
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!