I have a pool, configured as a raidz-1, of 5 SSD drives plus 2 spares, and in the most recent scrub, two of the main pool SSDs have been marked FAULTED. The spares have automatically taken over so at the moment the pool still has full redundancy, and I want to replace the faulty drives. However googling has produced conflicting information on how I should go about it.
zpool status output:
I'm aware that we should try and move away from /dev/sdX and instead use the disk identifiers, to protect against drive letter IDs changing if a disk is removed and the host rebooted, however I'd like to have a clean zpool before attempting that. In the mean time I have updated our drive map spreadsheet (which replicates the physical disk layout, and already includes drive serial numbers) with the "Disk identifier" field shown for each disk by fdisk -l, just in case things go wrong and we need to rebuild a pool.
I have spare identical drives with which to replace /dev/sdg and /dev/sdj, and as all are SATA3 SSDs (hence hot-swap by default) I'm not expecting to have to reboot the host, which is a live PVE production environment, although all bar one of the VMs are currently on the other pool tank1, which is based on much larger HDDs.
Some sources suggest that all I need to do (using the first one, /dev/sdg, for example) is:
1. Physically hot-swap the faulted disk for a new one.
2. zpool replace tank2 /dev/sdg
However other sources suggest we need to first zpool detatch tank2 /dev/sdg before doing the hot-swap, and or offline it, and at least one source suggests that partition information needs to be cloned from one of the other disks in the pool.
Can anyone clarify the correct steps to take to replace the faulted drives?
Aside: Interestingly, SMART has also flagged for the drive at /dev/sdg, however it's not triggered for /dev/sdj, leaving me to wonder if I should simply clear that and then scrub again and see if any faults return ?
zpool status output:
Code:
zpool: tank2
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: resilvered 129G in 0 days 05:37:58 with 0 errors on Thu May 13 00:01:03 2021
config:
NAME STATE READ WRITE CKSUM
tank2 DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
sdf ONLINE 0 0 0
spare-1 DEGRADED 0 0 0
sdg FAULTED 216 1 0 too many errors
sdv ONLINE 0 0 0
sdh ONLINE 0 0 0
sdi ONLINE 0 2 0
spare-4 DEGRADED 0 0 0
sdj FAULTED 130 1 0 too many errors
sdu ONLINE 0 0 0
spares
sdu INUSE currently in use
sdv INUSE currently in use
errors: No known data errors
I'm aware that we should try and move away from /dev/sdX and instead use the disk identifiers, to protect against drive letter IDs changing if a disk is removed and the host rebooted, however I'd like to have a clean zpool before attempting that. In the mean time I have updated our drive map spreadsheet (which replicates the physical disk layout, and already includes drive serial numbers) with the "Disk identifier" field shown for each disk by fdisk -l, just in case things go wrong and we need to rebuild a pool.
I have spare identical drives with which to replace /dev/sdg and /dev/sdj, and as all are SATA3 SSDs (hence hot-swap by default) I'm not expecting to have to reboot the host, which is a live PVE production environment, although all bar one of the VMs are currently on the other pool tank1, which is based on much larger HDDs.
Some sources suggest that all I need to do (using the first one, /dev/sdg, for example) is:
1. Physically hot-swap the faulted disk for a new one.
2. zpool replace tank2 /dev/sdg
However other sources suggest we need to first zpool detatch tank2 /dev/sdg before doing the hot-swap, and or offline it, and at least one source suggests that partition information needs to be cloned from one of the other disks in the pool.
Can anyone clarify the correct steps to take to replace the faulted drives?
Aside: Interestingly, SMART has also flagged for the drive at /dev/sdg, however it's not triggered for /dev/sdj, leaving me to wonder if I should simply clear that and then scrub again and see if any faults return ?