Replacement faulty HDD which is part of a ZFS data volume

edwin2024 · Jun 8, 2026

Replacement faulty HDD which is part of a ZFS data volume

For years i am using vmware but since the took-over by broadsoft i'm finding my way over to ProxMox.
First step is running an older HPE ML30 Gen 10 (non-hot-swap) for private use.
The ML30 is build upon
64Gb internal memory
1x NVME (HPE VR000480KXLX) 480 Gb for pve with periodic backup
2x SSD (Seagate_IronWolf_ZA2000NM10002) in a LFS volume for VM's OS
2x HDD (ST6000NT001-3M1101) in a LFS volume only for data use
Running one linux lxc's and 3 windows vm's

PVE
prompt:~# pveversion
pve-manager/9.1.6/71482d1833ded40a (running kernel: 6.17.13-1-pve)
Running on the NVME

ZFS
Memory capped with
options zfs zfs_arc_max=8589934592
No minimum set

One of the HDD (ST6000NT001) is definitly failing with 2215 reallocated sectors and 212 pending sectors and needs to be replaced.
I ordered a new HDD and waiting for delivery. But then, how to replace the disk (linux newbie)?
I searched the forum and internet general and found bit's and pieces so i asked Claude for advice (i know, never trust AI)

Claude has produced the following manual based on information i send, i would sincerely appriciate it if someone would give this a check.
Thnx in advance, edwin.

1. Verify Current Pool and Backup Status
Confirm pool health and check the last backup timestamp:
zpool status
pvesm status
Expected: VMDATA-HDD-ZFS shows ONLINE with both mirror drives present and errors: No known data errors.
!! Perform, and verify, a backup of PVE, XLC, VM's before continue

2.Gracefully Shut Down All VMs and Containers
The ML30 Gen10 does not support hot-swap. All VMs and containers must be stopped before powering down the server.
# List all running VMs
qm list
# Shutdown each running VM (repeat for each VMID)
qm shutdown <VMID>
# List all running containers
pct list
# Shutdown each running container (repeat for each CTID)
pct shutdown <CTID>
Wait until all VMs and containers report stopped before proceeding.

3. Power Down the Server
Once all VMs and containers are stopped, shut down the server:
shutdown -h now
Wait for the server to fully power off before opening the chassis.
!! DO NOT pull the drive while the server is running — the ML30 Gen10 does not support hot-swap and doing so risks data loss or hardware damage. !!

4. Physically Replace the Drive
With the server powered off:
1. Locate sdc — the Seagate ST6000NT001 with S/N xxxxxx
2. Remove the failing drive from its bay
3. Insert the new 6TB (or larger) replacement drive
4. Power the server back on

5. Identify the New Drive
After the server has booted, identify the device name assigned to the new drive:
lsblk -o NAME,SIZE,TYPE,ROTA,TRAN,MODEL
The new drive appears as a new sdX with no partitions. Use this device name in the commands below
(example: /dev/sde).

6. Wipe the New Drive
Clear (just to make sure) any old partition tables or ZFS labels to prevent import conflicts:
wipefs -a /dev/sdX
Replace sdX with your actual new drive device name.

7. Replace the Failing Drive in ZFS
Issue the ZFS replace command using the exact serial-based identifier from your pool:
zpool replace VMDATA-HDD-ZFS ata-ST6000NT001-3M1101_WX00WVSL /dev/sdX
ZFS immediately begins resilvering — copying all data from the healthy sda to the new drive. The pool stays fully online and accessible during this process.
!! If sdc shows as UNAVAIL or REMOVED in zpool status, use this alternative command instead: !!
zpool replace VMDATA-HDD-ZFS sdc /dev/sdX

8. Monitor Resilver Progress
Watch the resilver in real time:
watch -n 5 zpool status
With ~3.9 TB of data to copy, expect 10–14 hours. Do NOT power off the server during resilvering — doing so will restart it from scratch.
!! Resilvering should and with something like this: SUCCESS: The scan line will read: resilvered X.XG in HH:MM:SS with 0 errors

9. Run a Full Scrub After Resilver Completes
Once resilvering is complete, run a scrub to verify full data integrity across both drives:
zpool scrub VMDATA-HDD-ZFS
Monitor scrub progress:
watch -n 10 zpool status
A clean scrub with 0 errors confirms your mirror is fully healthy and protected again.

10. Optional: Upgrade Pool Features ( !! NOT SURE IF I SHOULD DO THIS !! )
The pool reports some supported features are not yet enabled. After a successful scrub you may optionally upgrade:
zpool upgrade VMDATA-HDD-ZFS
Note: After upgrading, the pool may not be importable on older ZFS versions. Only do this if you will not need to import it on older systems.

UdoB · Jun 8, 2026

Official documentation: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_zfs_change_failed_dev

edwin2024 · Jun 8, 2026

UdoB said:
Official documentation: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_zfs_change_failed_dev

First, this seems to be for a boot Device. It isn't.
Second. I've got the feeling the Guide is missing step perhaps assuming everyone knows, i don't

UdoB · Jun 8, 2026

edwin2024 said:
First, this seems to be for a boot Device.

Well, it includes the additional (possibly important!) steps for a boot device. Of course you don't need to execute those for a pure data pool ;-)

edwin2024 said:
Second. I've got the feeling the Guide is missing step perhaps assuming everyone knows

There are just much too many optional steps to include all of of them in the official reference documentation. For example you want to "6. Wipe the New Drive", which is in the same way optional (and not required for new drives) as the boot aspect, which is obviously not important for you.

Third: I just wanted to make sure you know that documentation as you did not mention if you already found/read it.

Pro tip: it is a mirror, right? I really prefer to add the new drive as a third device to the mirror. If you have a supported free slot place the new drive there. If not: there are USB-adapters allowing to do so - for temporary maintenance. This allows to increase redundancy during the resilvering. If you remove the problematic drive at the beginning of the task ("4.") you are "degraded" immediately and only hoping that the only active half of the original mirror can be read completely without error. Only after that rebuild has successfully finished I remove the problematic device...

Good luck

alexskysilk · Jun 8, 2026

If you're asking whether the claude provided steps are correct- they are, but I would make an adjustment to step 7- specify the destination drive by its uuid/blkid however relevent to how you defined your pool originally, not but drive letter.

I didnt actually see a question asked.

Onslow · Jun 8, 2026

UdoB said:
Pro tip: it is a mirror, right? I really prefer to add the new drive as a third device to the mirror. [...] This allows to increase redundancy during the resilvering.

This is a smart move!

Onslow · Jun 8, 2026

edwin2024 said:
4. Power the server back on

After this point I would add this:

If possible, stop the VMs and LXCs.

Or before powering down the server disable autostart of VMs and LXCs.

To reduce the load of the host and the risk.

edwin2024 · Jun 8, 2026

Onslow said:
After this point I would add this:

If possible, stop the VMs and LXCs.

Or before powering down the server disable autostart of VMS and LXCs.

To reduce the load of the host and the risk.

Disable Autostart is a good one

edwin2024 · Jun 8, 2026

alexskysilk said:
If you're asking whether the claude provided steps are correct- they are, but I would make an adjustment to step 7- specify the destination drive by its uuid/blkid however relevent to how you defined your pool originally, not but drive letter.

I didnt actually see a question asked.

Wil look into that, thnx

edwin2024 · Jun 8, 2026

UdoB said:
Well, it includes the additional (possibly important!) steps for a boot device. Of course you don't need to execute those for a pure data pool ;-)

There are just much too many optional steps to include all of of them in the official reference documentation. For example you want to "6. Wipe the New Drive", which is in the same way optional (and not required for new drives) as the boot aspect, which is obviously not important for you.

Third: I just wanted to make sure you know that documentation as you did not mention if you already found/read it.

Pro tip: it is a mirror, right? I really prefer to add the new drive as a third device to the mirror. If you have a supported free slot place the new drive there. If not: there are USB-adapters allowing to do so - for temporary maintenance. This allows to increase redundancy during the resilvering. If you remove the problematic drive at the beginning of the task ("4.") you are "degraded" immediately and only hoping that the only active half of the original mirror can be read completely without error. Only after that rebuild has successfully finished I remove the problematic device...

Good luck

For step 6, you're right. New drive should be clean.
For documentation. Yes i found that but it didn't look complete to me. Hence the summery and request to validatie the steps in it.
For the 3rd drive. I'm not Really seeing a 3rd drive to a mirror. Beside that before i start i will make a full backup of Every VM etc. Second no bay left to put a drive in so i would have to follow a slow usb route. But I will look into this because i would have never guessed you could add a 3rd drive to a redundant lfs volume.

UdoB · Jun 9, 2026

edwin2024 said:
I'm not Really seeing a 3rd drive to a mirror.

It would be a temporary extension. Basically it is exactly the same what "replace" does if the old drive is still connected, just more "manually".

Code:

$ man zpool-replace

DESCRIPTION
     Replaces device with new-device.  This is equivalent to attaching new-device, waiting for it to resilver, and then detaching device.

The point is: do not remove the problematic drive at the beginning of this maintenance task, but at the end - when the new drive already contains all relevant data.

Your problematic drive is not completely dead, right? During resilvering the new drive it can deliver valid data if the other drive gets into trouble. Yes, this is not really probable, but it is a "free" feature.

edwin2024 · Jun 9, 2026

UdoB said:
It would be a temporary extension. Basically it is exactly the same what "replace" does if the old drive is still connected, just more "manually".

Code:

$ man zpool-replace DESCRIPTION Replaces device with new-device. This is equivalent to attaching new-device, waiting for it to resilver, and then detaching device.

The point is: do not remove the problematic drive at the beginning of this maintenance task, but at the end - when the new drive already contains all relevant data.

Your problematic drive is not completely dead, right? During resilvering the new drive it can deliver valid data if the other drive gets into trouble. Yes, this is not really probable, but it is a "free" feature.

Thnx for explaining. Problematic drive is still functionaliteit, even smart says ok but there is a worrying amount (2200+) of growing reallocated sectors. Zfs pool is also online and not degraded.

edwin2024 · Jun 18, 2026

Totally missed the automatic scrub => drive went offline.
So i folowed the starting post and added all the good stuff you all mentioned except fot the 3rd drive to the pool (did not have a drive large enough).
Did all the backup's on a nas and included pve config files.
At the end all went fine (besides an empty cmos battery => all bios settings gone) and zfs pool is online again with all data on it. No data lost, no backups needed.
Thnx everyone for thinking with me!

guletz · Jun 19, 2026

Hi,

You miss this important step:
1. Test your new Hdd device:
- run in paralell, smartctl -t long, and a badblock test for at least 3 days

Without this, you may lose your time !

Good luck / Bafta !

edwin2024 · Jun 21, 2026

Hm, good point. For now the pool is running online and i'll keep an eye on smart parameters en run a regular smartctl test.

Replacement faulty HDD which is part of a ZFS data volume

edwin2024

Member

UdoB

Distinguished Member

edwin2024

Member

UdoB

Distinguished Member

alexskysilk

Distinguished Member

Onslow

Renowned Member

Onslow

Renowned Member

edwin2024

Member

edwin2024

Member

edwin2024

Member

UdoB

Distinguished Member

edwin2024

Member

edwin2024

Member

guletz

Distinguished Member

edwin2024

Member

We value your privacy