[SOLVED] Help with replacing failed drive, zpool replace not working

surfing_IT

New Member
Jul 29, 2020
8
0
1
Hello everyone,
My ZFS data store had a failed drive this weekend and I'm having trouble replacing it. I'll provide screenshots to better document.

About:
ext4 ssd for boot
10x 8TB in raidz2 + ssd for cache & log
/dev/sdc is the drive being replaced (it had bad sectors, meh)

Steps:
I followed this guide, https://dannyda.com/2020/05/16/how-...al-disk-from-proxmox-pve-for-zfs-pool-easily/
But zpool replace tank /dev/sdc results in the message: cannot replace /dev/sdc with /dev/sdc: no such device in pool
This is where I am stuck, and replacing a drive should not be this hard!

The new /replacement drive is in the same slot (/dev/sdc), always shows as unavilable via zpool status -v and in the ZFS gui.
PVE will add GPT and ZFS automatically in the Disks gui. and shows the partitions added correctly, but is not added to the pool nor resilvering.

Thanks!!

Please see attached screen shots.
 

Attachments

  • caspar_-_Proxmox_Virtual_Environment-2.jpg
    caspar_-_Proxmox_Virtual_Environment-2.jpg
    114.1 KB · Views: 130
  • caspar_-_Proxmox_Virtual_Environment-3.jpg
    caspar_-_Proxmox_Virtual_Environment-3.jpg
    64.4 KB · Views: 120
  • caspar_-_Proxmox_Virtual_Environment-4.jpg
    caspar_-_Proxmox_Virtual_Environment-4.jpg
    136.7 KB · Views: 121
  • caspar_-_Proxmox_Virtual_Environment.jpg
    caspar_-_Proxmox_Virtual_Environment.jpg
    166.1 KB · Views: 123
zpool status was included in the screenshots. see caspar_-_Proxmox_Virtual_Environment.jpg above
 
Thank you. I think I got it now. I ran:
Code:
zpool replace tank /dev/disk/by-id/ata-ST8000DM004-2CX188_ZCT1J3AZ-part1 /dev/disk/by-id/ata-ST8000VN004-2M2101_WKD1PMS8-part1
And now zpool status -v says 'replacing-2' and 'please wait for resilver to complete'
 
SOLVED?
To replace a drive:
  1. Find failed drive (write down SN of drive, [if no hotswap: shutdown host and physically replace drive, start host])
  2. Wipe partitions on new drive via cfdisk
  3. In shell, run:
  • zpool status -v (to get drive id’s)
  • zpool replace -f pool-name /dev/disk/by-id/"old-disk-info" /dev/disk/by-id/“new-disk-info"
  • zpool status -v (to confirm resilvering in process)
 
Last edited:
SOLVED?
To replace a drive:
  1. Find failed drive (write down SN of drive, [if no hotswap: shutdown host and physically replace drive, start host])
  2. Wipe partitions on new drive via cfdisk
  3. In shell, run:
  • zpool status -v (to get drive id’s)
  • zpool replace -f pool-name /dev/disk/by-id/"old-disk-info" /dev/disk/by-id/“new-disk-info"
  • zpool status -v (to confirm resilvering in process)

That is how it is described in the reference docs, yes. The device name depends on how your zpool is created and how your names are.
 
Hi.

When the disk is replaced, are the partition tables automatically copied as well?
I did a test I could not boot the server from the replaced disk correctly.
The resilver didn't give me any errors.

Hello
Andrew.
 
No, if you boot from that pool you need to partition the new disk first by useself, copy over the bootloader and then just use the 3rd partition of that disk with "zpool replace". See the paragraph "Changing a failed bootable device" here for PVE 6.4 or higher.
 
Hi Dunuin.
I replaced a failed disk in a raid 1 zfs pool.
This was the old situation :

Code:
root@proxmox1:~# root@proxmox1:~# zpool status -v SSD_4_5_2TB
  pool: SSD_4_5_2TB
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:35:58 with 0 errors on Sun Oct 10 00:59:59 2021
config:

        NAME                                             STATE     READ WRITE CKSUM
        SSD_4_5_2TB                                      DEGRADED     0     0     0
          mirror-0                                       DEGRADED     0     0     0
            ata-Samsung_SSD_860_PRO_2TB_S42KNX0R701089J  ONLINE       0     0     0
            ata-Samsung_SSD_860_PRO_2TB_S42KNX0R701206P  DEGRADED     0     0    16  too many errors

errors: No known data errors



I have successfully copied the partitions with the command:


Code:
sgdisk /dev/sde -R /dev/sdf
sgdisk -G /dev/sdf

Then I ran this command:
Code:
zpool replace SSD_4_5_2TB /dev/disk/by-id/ata-Samsung_SSD_860_PRO_2TB_S42KNX0R701206P-part1 /dev/disk/by-id/ata-Samsung_SSD_860_PRO_2TB_S42KNX0R401526V-part1

But now if I do the zfs status I find myself in this situation.
Code:
root@proxmox1:~# zpool status
  pool: SSD_4_5_2TB
 state: ONLINE
  scan: resilvered 893G in 00:36:22 with 0 errors on Mon Jan  3 19:02:01 2022
config:

        NAME                                                   STATE     READ WR                                             ITE CKSUM
        SSD_4_5_2TB                                            ONLINE       0                                                  0     0
          mirror-0                                             ONLINE       0                                                  0     0
            ata-Samsung_SSD_860_PRO_2TB_S42KNX0R701089J        ONLINE       0                                                  0     0
            ata-Samsung_SSD_860_PRO_2TB_S42KNX0R401526V-part1  ONLINE       0                                                  0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 01:59:13 with 0 errors on Sun Dec 12 02:23:14 2021
config:

        NAME                                             STATE     READ WRITE CK                                             SUM
        rpool                                            ONLINE       0     0                                                  0
          mirror-0                                       ONLINE       0     0                                                  0
            ata-TOSHIBA_MG04ACA200NY_9887Y0KNF7EE-part3  ONLINE       0     0                                                  0
            ata-TOSHIBA_MG04ACA200NY_98DYK2W2F7EE-part3  ONLINE       0     0                                                  0

errors: No known data errors



Do you think I made a mistake in the replacement procedure?
 
You only need to do that if you boot from that pool. But you got another "rpool" where it looks like you are booting from. So a normal "zpool replace" would be fine for your "SSD_4_5_2TB" pool but in case a disk of the "rpool" will fail you would need to do the additional stuff like copying partitions and bootloader and so on.
 
Thanks Dunuin.
I thought I made a mistake.
the zpool status command gives me the second disk as part 1.
This is right ?

Thanks so much.
 
Better would be without a partition so both disks use the same...but should work.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!