ProxMox booting and zfs HD replacement

RainerM

Member
May 12, 2024
5
1
8
Hi all,
I have several problems to replace a HD which is handed over by ProxMox to an OpenMediaVault (VM500) which handles the HDs by zfs.

When ProxMox (8.4.19) boots, on the console I get the message:
-->
Port3: ST12000VN0007-2GS116
S.M.A.R.T Status Bad, Backup and Replace
Press F1 to continue
<--
(Thats how I first saw the problem)

After pressing F1 boot continues without showing any problems.

On ProxMox: Datacenter ==> svProx1 ==> Disks:
/dev/sdd ... ST12000VN0007-2GS116 ... with Serial ZJV5JAHP <== Shows S.M.A.R.T FAILED!
But no details about the failure are shown and the S.M.A.R.T. values look more or less normal.

Any idea how to find the (detailed) reason of the disk failute ?



FYI: Below svProx1 is the name of the ProxMox server, sv4000 is the name of the OpenMediaVault (VM500) server.

On svProx1, 5* HD's are handed over to OpenMediaVault, which is managing them exclusively by zfs.
ProxMox handed the 5 disks over by
qm set 500 -scsil /dev/disk/by-id/ata-ST12000VN0007-2GS116_ZJV205SW
qm set 500 -scsil /dev/disk/by-id/ata-ST12000VN0007-2GS116_ZJV54B6L
qm set 500 -scsil /dev/disk/by-id/ata-ST12000VN0007-2GS116_ZJV56CPE
qm set 500 -scsil /dev/disk/by-id/ata-ST12000VN0007-2GS116_ZJV5JAHP <== This HD is faulty now
qm set 500 -scsil /dev/disk/by-id/ata-ST12000VN0007-2GS116_ZJV5LWM0

On OpenMediaValut (sv4000) all these HDs were included in one zfs pool 'ZFSRaidZ2'
sv4000 => ZFS => Pool hinzufügen ZFSRaidZ2 (RAIDZ2 Pool)
Name: ZFSRaidZ2
Devices: (all disks)
Mount Point: /ZFSRaidZ2

###--- First idea. I already tried to replace the faulty HD:

Just replaced the faulty HD with a new one, hoping zfs would automatically resilver the disk.
But the OpenMediaVault (VM500) just didn't start anymore, just displaying:
Error: start failed: QEMU Exited with code 1
Details: (kvm: -drive file=/dev/disk/by-id/ata-ST12000VN0007-2GS116_ZJV5JAHP,if=none,id=drive-scsi4,format=raw,cache=none,aio=io_uring,detect-zeroes=on: Could not open '/dev/disk/by-id/ata-ST12000VN0007-2GS116_ZJV5JAHP': No such file or directory
TASK ERROR: start failed: QEMU exited with code 1)

I then replaced the new disk again with the faulty one and booted again.
All booted but my first try to replace disk had just failed.



###--- Second idea. My new intention to replace HD:
According to Oracle, the script to replace a disk is to plainly enter:
zpool replace ZFSRaidZ2 scsi-0QEMU_QEMU_HARDDISK_drive-scsi4

Since I'm not 100% familiar with zfs, I rather ask for advice before damaging the zfs pool.

I think that would be the correct way, if the zfs pool would be handled directly by ProxMox.

Would that also be the correct way to replace a HD on a ProxMox Server where the HDs are handed over to OpenMediaVault to manage them with zfs ?
That would be the easiest way, if it would work, but I'm afaid I need to involve more manual steps.
Could I damage the zfs pool that way ?
Any suggestions ?



Here, what I did to identify the faulty zfs HD on svProx1 and sv4000:

# Identify HD on svProx1 by Serial number shown on svProx1 UI.
root@svProx1:~# lsblk -o NAME,FSTYPE,FSVER,LABEL,FSAVAIL,PARTUUID,PTUUID,SERIAL
NAME FSTYPE FSVER LABEL FSAVAIL PARTUUID PTUUID SERIAL
sdd 9be172dd-2d12-1440-a227-48b5e8af1123 ZJV5JAHP <== defekt disk,
├─sdd1 zfs_member 5000 ZFSRaidZ2 5faa4d32-79a5-864b-9ebf-261e0d890e32 9be172dd-2d12-1440-a227-48b5e8af1123 <== PARTUUID sdd1
└─sdd9 6af2a07c-c072-2a48-8a39-203b99406825 9be172dd-2d12-1440-a227-48b5e8af1123 <== PARTUUID sdd9

# Identify the mount point by PTUUID on sv4000.
root@sv4000:~# lsblk -o NAME,FSTYPE,FSVER,LABEL,FSAVAIL,PARTUUID,PTUUID
NAME FSTYPE FSVER LABEL FSAVAIL PARTUUID PTUUID
sde 9be172dd-2d12-1440-a227-48b5e8af1123
├─sde1 zfs_member 5000 ZFSRaidZ2 5faa4d32-79a5-864b-9ebf-261e0d890e32 9be172dd-2d12-1440-a227-48b5e8af1123 <== PARTUUID von sde1
└─sde9 6af2a07c-c072-2a48-8a39-203b99406825 9be172dd-2d12-1440-a227-48b5e8af1123 <== PARTUUID von sde9

On sv4000 the defect HD is handled as sde.

# On sv4000 identify the disk-id used by zfs by its mount point
root@sv4000:~# ls -l /dev/disk/by-id | grep sde[1,9]
lrwxrwxrwx 1 root root 10 Jun 2 15:17 scsi-0QEMU_QEMU_HARDDISK_drive-scsi4-part1 -> ../../sde1 <== Name: scsi-0QEMU_QEMU_HARDDISK_drive-scsi4 von sde1
lrwxrwxrwx 1 root root 10 Jun 2 15:17 scsi-0QEMU_QEMU_HARDDISK_drive-scsi4-part9 -> ../../sde9 <== Name: scsi-0QEMU_QEMU_HARDDISK_drive-scsi4 von sde9

scsi disk-id is scsi4

# Show zfs status on sv4000 (Sadly HDs internal serial numbers are not shown)
ZPOOL_SCRIPTS_AS_ROOT=1 zpool status -c serial
pool: ZFSRaidZ2
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: resilvered 538M in 00:26:06 with 0 errors on Mon May 11 12:40:13 2026
config:

NAME STATE READ WRITE CKSUM serial
ZFSRaidZ2 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
scsi-0QEMU_QEMU_HARDDISK_drive-scsi1 ONLINE 0 0 0 -
scsi-0QEMU_QEMU_HARDDISK_drive-scsi2 ONLINE 0 0 0 -
scsi-0QEMU_QEMU_HARDDISK_drive-scsi3 ONLINE 0 0 0 -
scsi-0QEMU_QEMU_HARDDISK_drive-scsi4 ONLINE 0 0 0 -
scsi-0QEMU_QEMU_HARDDISK_drive-scsi5 ONLINE 0 0 0 -

errors: No known data errors

Any suggestions ?
Any better ways ?



###--- Third idea to replace faulty HD by manual steps:

# On svProx1 identify the qm handed over scsi disk, by checking the VM500 configuration
root@svProx1: more /etc/pve/qemu-server/500.conf
===>>>
#http%3A//172.16.1.4/#/login
agent: 1
boot: order=scsi0;ide2;net0
cores: 4
cpu: x86-64-v2-AES
ide2: none,media=cdrom
memory: 16384
meta: creation-qemu=8.1.5,ctime=1717161677
name: sv4000
net0: virtio=BC:24:11:30:8E:AB,bridge=vmbr0,firewall=1
net1: virtio=BC:24:11:55:18:BF,bridge=vmbr1,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-500-disk-0,iothread=1,size=32G
scsi1: /dev/disk/by-id/ata-ST12000VN0007-2GS116_ZJV54B6L,size=11176G
scsi2: /dev/disk/by-id/ata-ST12000VN0007-2GS116_ZJV205SW,size=11176G
scsi3: /dev/disk/by-id/ata-ST12000VN0007-2GS116_ZJV56CPE,size=11176G
scsi4: /dev/disk/by-id/ata-ST12000VN0007-2GS116_ZJV5JAHP,size=11176G <= Faulty disk identified by serial number
scsi5: /dev/disk/by-id/ata-ST12000VN0007-2GS116_ZJV5LWM0,size=11176G
scsihw: virtio-scsi-single
smbios1: uuid=d242a868-6726-40fb-a9bc-ebd0a8da5dcb
sockets: 1
vmgenid: 86da8a55-11a5-40b2-b93f-4ec31902b01a
<<<===

# Replace HD by these steps:
# First step: Detach HD from sv4000 zfs
root@sv4000: zpool detach ZFSRaidZ2 scsi-0QEMU_QEMU_HARDDISK_drive-scsi4

# Then remove HD form svProx1 handover to sv4000 (eventually with force)
root@svProx1: qm unlink 500 --idlist /dev/disk/by-id/ata-ST12000VN0007-2GS116_ZJV5JAHP [--force 1]

# ==> shutdown svProx1
# ==> swap ZJV5JAHP disk against new disk
# ==> boot and identify the new disk by serial number as in steps above
# ==> handover new HD to sv4000

# Again hand the new disk over to sv4000 by qm
root@svProx1: qm set 500 -scsil /dev/disk/by-id/ata-... (new HD)

# Attach the new handed over disk to 'ZFSRaidZ2'
root@sv4000: zpool atach ZFSRaidZ2 scsi-... new HD
# DO NOT USE add - root@sv4000: zpool add ZFSRaidZ2 scsi-... (new HD) - will create a new pool.

Any step(s) missing ?
Anything else missing ?
Possibility to damage ZFSRaidZ2 pool ?
Any suggestions or better ways ?

Thanks
Rainer