Replacement faulty HDD which is part of a ZFS data volume

edwin2024

Member
Mar 12, 2024
16
1
8
Replacement faulty HDD which is part of a ZFS data volume

For years i am using vmware but since the took-over by broadsoft i'm finding my way over to ProxMox.
First step is running an older HPE ML30 Gen 10 (non-hot-swap) for private use.
The ML30 is build upon
64Gb internal memory
1x NVME (HPE VR000480KXLX) 480 Gb for pve with periodic backup
2x SSD (Seagate_IronWolf_ZA2000NM10002) in a LFS volume for VM's OS
2x HDD (ST6000NT001-3M1101) in a LFS volume only for data use
Running one linux lxc's and 3 windows vm's

PVE
prompt:~# pveversion
pve-manager/9.1.6/71482d1833ded40a (running kernel: 6.17.13-1-pve)
Running on the NVME

ZFS
Memory capped with
options zfs zfs_arc_max=8589934592
No minimum set

One of the HDD (ST6000NT001) is definitly failing with 2215 reallocated sectors and 212 pending sectors and needs to be replaced.
I ordered a new HDD and waiting for delivery. But then, how to replace the disk (linux newbie)?
I searched the forum and internet general and found bit's and pieces so i asked Claude for advice (i know, never trust AI)

Claude has produced the following manual based on information i send, i would sincerely appriciate it if someone would give this a check.
Thnx in advance, edwin.

1. Verify Current Pool and Backup Status
Confirm pool health and check the last backup timestamp:
zpool status
pvesm status
Expected: VMDATA-HDD-ZFS shows ONLINE with both mirror drives present and errors: No known data errors.
!! Perform, and verify, a backup of PVE, XLC, VM's before continue

2.Gracefully Shut Down All VMs and Containers
The ML30 Gen10 does not support hot-swap. All VMs and containers must be stopped before powering down the server.
# List all running VMs
qm list
# Shutdown each running VM (repeat for each VMID)
qm shutdown <VMID>
# List all running containers
pct list
# Shutdown each running container (repeat for each CTID)
pct shutdown <CTID>
Wait until all VMs and containers report stopped before proceeding.

3. Power Down the Server
Once all VMs and containers are stopped, shut down the server:
shutdown -h now
Wait for the server to fully power off before opening the chassis.
!! DO NOT pull the drive while the server is running — the ML30 Gen10 does not support hot-swap and doing so risks data loss or hardware damage. !!

4. Physically Replace the Drive
With the server powered off:
1. Locate sdc — the Seagate ST6000NT001 with S/N xxxxxx
2. Remove the failing drive from its bay
3. Insert the new 6TB (or larger) replacement drive
4. Power the server back on

5. Identify the New Drive
After the server has booted, identify the device name assigned to the new drive:
lsblk -o NAME,SIZE,TYPE,ROTA,TRAN,MODEL
The new drive appears as a new sdX with no partitions. Use this device name in the commands below
(example: /dev/sde).

6. Wipe the New Drive
Clear (just to make sure) any old partition tables or ZFS labels to prevent import conflicts:
wipefs -a /dev/sdX
Replace sdX with your actual new drive device name.

7. Replace the Failing Drive in ZFS
Issue the ZFS replace command using the exact serial-based identifier from your pool:
zpool replace VMDATA-HDD-ZFS ata-ST6000NT001-3M1101_WX00WVSL /dev/sdX
ZFS immediately begins resilvering — copying all data from the healthy sda to the new drive. The pool stays fully online and accessible during this process.
!! If sdc shows as UNAVAIL or REMOVED in zpool status, use this alternative command instead: !!
zpool replace VMDATA-HDD-ZFS sdc /dev/sdX

8. Monitor Resilver Progress
Watch the resilver in real time:
watch -n 5 zpool status
With ~3.9 TB of data to copy, expect 10–14 hours. Do NOT power off the server during resilvering — doing so will restart it from scratch.
!! Resilvering should and with something like this: SUCCESS: The scan line will read: resilvered X.XG in HH:MM:SS with 0 errors

9. Run a Full Scrub After Resilver Completes
Once resilvering is complete, run a scrub to verify full data integrity across both drives:
zpool scrub VMDATA-HDD-ZFS
Monitor scrub progress:
watch -n 10 zpool status
A clean scrub with 0 errors confirms your mirror is fully healthy and protected again.

10. Optional: Upgrade Pool Features ( !! NOT SURE IF I SHOULD DO THIS !! )
The pool reports some supported features are not yet enabled. After a successful scrub you may optionally upgrade:
zpool upgrade VMDATA-HDD-ZFS
Note: After upgrading, the pool may not be importable on older ZFS versions. Only do this if you will not need to import it on older systems.
 
First, this seems to be for a boot Device.
Well, it includes the additional (possibly important!) steps for a boot device. Of course you don't need to execute those for a pure data pool ;-)

Second. I've got the feeling the Guide is missing step perhaps assuming everyone knows
There are just much too many optional steps to include all of of them in the official reference documentation. For example you want to "6. Wipe the New Drive", which is in the same way optional (and not required for new drives) as the boot aspect, which is obviously not important for you.

Third: I just wanted to make sure you know that documentation as you did not mention if you already found/read it.

Pro tip: it is a mirror, right? I really prefer to add the new drive as a third device to the mirror. If you have a supported free slot place the new drive there. If not: there are USB-adapters allowing to do so - for temporary maintenance. This allows to increase redundancy during the resilvering. If you remove the problematic drive at the beginning of the task ("4.") you are "degraded" immediately and only hoping that the only active half of the original mirror can be read completely without error. Only after that rebuild has successfully finished I remove the problematic device...

Good luck :-)
 
Last edited:
  • Like
Reactions: Onslow
If you're asking whether the claude provided steps are correct- they are, but I would make an adjustment to step 7- specify the destination drive by its uuid/blkid however relevent to how you defined your pool originally, not but drive letter.

I didnt actually see a question asked.