rpool DEGRADED ('label is missing or invalid' - 'part of active pool' error when trying to replace)

bea · Nov 9, 2023

Hello,

I have a PBS 2.1-5 which shows:

Code:

root@pbs:/# zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: resilvered 41.5G in 01:44:41 with 0 errors on Fri Aug 11 16:49:39 2023
config:

        NAME                                             STATE     READ WRITE CKSUM
        rpool                                            DEGRADED     0     0     0
          mirror-0                                       DEGRADED     0     0     0
            15157359014898742671                         UNAVAIL      0     0     0  was /dev/disk/by-id/ata-MT-2TB_9111012102018-part3
            ata-SanDisk_SDSSDH3_2T00_2143A9440310-part3  ONLINE       0     0     0

errors: No known data errors

Code:

root@pbs:/dev/disk/by-id# ls -hal             
total 0 
drwxr-xr-x 2 root root 200 Oct 25 21:57 . 
drwxr-xr-x 7 root root 140 Oct 25 21:57 .. 
lrwxrwxrwx 1 root root   9 Oct 25 21:57 ata-SanDisk_SDSSDH3_2T00_2143A9440310 -> ../../sda 
lrwxrwxrwx 1 root root  10 Oct 25 21:57 ata-SanDisk_SDSSDH3_2T00_2143A9440310-part1 -> ../../sda1 
lrwxrwxrwx 1 root root  10 Oct 25 21:57 ata-SanDisk_SDSSDH3_2T00_2143A9440310-part2 -> ../../sda2 
lrwxrwxrwx 1 root root  10 Oct 25 21:57 ata-SanDisk_SDSSDH3_2T00_2143A9440310-part3 -> ../../sda3 
lrwxrwxrwx 1 root root   9 Oct 25 21:57 wwn-0x5001b444a734cb75 -> ../../sda 
lrwxrwxrwx 1 root root  10 Oct 25 21:57 wwn-0x5001b444a734cb75-part1 -> ../../sda1 
lrwxrwxrwx 1 root root  10 Oct 25 21:57 wwn-0x5001b444a734cb75-part2 -> ../../sda2 
lrwxrwxrwx 1 root root  10 Oct 25 21:57 wwn-0x5001b444a734cb75-part3 -> ../../sda3

Code:

root@pbs:/dev/disk/by-id# smartctl -H -d ata /dev/disk/by-id/wwn-0x5001b444a734cb75-part3 
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.13.19-6-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Then I read on other posts I might have to replace the name, I tried but got this error:

Code:

root@pbs:/dev/disk/by-id# zpool replace -f rpool 15157359014898742671 wwn-0x5001b444a734cb75-part3 
invalid vdev specification 
the following errors must be manually repaired: 
/dev/disk/by-id/wwn-0x5001b444a734cb75-part3 is part of active pool 'rpool'

What other steps should I take?
Do I have to physically replace the disk?

Thank you!

Dunuin · Nov 9, 2023

First, "wwn-0x5001b444a734cb75-part3" and "ata-SanDisk_SDSSDH3_2T00_2143A9440310-part3" are the same disk. So your second disk isn't recognized anymore and probably completely dead.
Second, you can't just do a "zpool replace -f rpool 15157359014898742671 wwn-0x5001b444a734cb75-part3" as you are booting from that mirror. You will first have to clone the partition table and sync the bootloader from the heathy disks to the new disk. Similar to whats described in the chapter "Changing a failed bootable device": https://pve.proxmox.com/wiki/ZFS_on_Linux#_zfs_administration

bea · Dec 7, 2023

Thank you very much.

I replaced the disk and then followed the mentioned instructions on the documentation without an issue, but please see in the attached picture what I get now.

What shall I do?

Thank you!

Dunuin · Dec 7, 2023

Looks like you replaced the wrong disk. So the healthy disk got replace with the new disk?

bea · Dec 7, 2023

But, If I had done so, I guess I would not see the system keeps working perfectly, would I?
And I guess I would not see one device shown as online, would I?

Dunuin · Dec 7, 2023

bea said:
But, If I had done so, I guess I would not see the system keeps working perfectly, would I?
And I guess I would not see one device shown as online, would I?

Replacing the wrong disk should result in what you see with a still running system.
To be sure see your pools log: zpool history rpool

bea · Dec 7, 2023

I replaced the disk, did the two `sgdisk` commands and had to do:

Code:

zpool replace -f <pool> <old zfs partition> <new zfs partition>

So, as I had replaced the non-working disk, then physically In the system I only had two disks: the old working (ONLINE) device and the new empty device, so I understood <old zfs partition> as the old working device, and did the following:

Code:

zpool replace -f rpool ata-SanDisk_SDSSDH3_2T00_2143A9440310-part3 ata-MT-2TB_001360300400
6-part3

Was that wrong? How could I fix it now?

Dunuin · Dec 7, 2023

bea said:
<old zfs partition> as the old working device

No, that's the old failed vdev you want to replace with the new one.
You want to clone the partition table of the old healthy disk to the new disk and write the bootloader to the new disk but replace the ZFS partition of the old failed disk with the one of the new disk.

bea said:
How could I fix it now?

Replace it again but this time by replacing the failed vdev with with your partition of old but working disk. Not sure if you will have to write the bootloader or clone the partition table again. fdisk -l output of your old but working disk could help.

bea · Dec 7, 2023

Thank you for your help.

Code:

## fdisk -l
Disk /dev/sda: 1.86 TiB, 2048408248320 bytes, 4000797360 sectors
Disk model: MT-2TB        
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: F0013308-F61B-4BEF-A330-4CF53EEF46D9

Device       Start        End    Sectors  Size Type
/dev/sda1       34       2047       2014 1007K BIOS boot
/dev/sda2     2048    1050623    1048576  512M EFI System
/dev/sda3  1050624 3907029134 3905978511  1.8T Solaris /usr & Apple ZFS


Disk /dev/sdb: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: SanDisk SDSSDH3
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 07898F6B-5F61-458E-9CE7-EFE713F000E2

Device       Start        End    Sectors  Size Type
/dev/sdb1       34       2047       2014 1007K BIOS boot
/dev/sdb2     2048    1050623    1048576  512M EFI System
/dev/sdb3  1050624 3907029134 3905978511  1.8T Solaris /usr & Apple ZFS

I pasted this thing here but I don't know what I should conclude from it regarding bootloader and partition table.

And I don't know where to get the string of characters I should use for the old failed vdev on the replace command.

Would you please help me with the commands I should use?

Thank you

Dunuin · Dec 7, 2023

Whats the output of ls -la /dev/disk/by-id and proxmox-boot-tool status?
And "SanDisk SDSSDH3" is the old healty disk? What is that "MT-2TB" SSD? According to your first post the failed SSD was a "MT-2TB". And your latest screenshot shows the new replaced SSD is a "MT-2TB" too. Did you buy the same (probably crappy noname consumer) SSD twice that already failed previously or did you actually replace the healty old SSD with the same failed old SSD?

It's buy the way highly recommended to only use enterprise grade SSDs with ZFS as ZFS isn't performing well with consumer SSDs without power-loss protection and it is wearing these SSDs quite fast.

bea · Dec 7, 2023

Code:

:~# ls -la /dev/disk/by-id
total 0
drwxr-xr-x 2 root root 360 Dec  7 14:02 .
drwxr-xr-x 7 root root 140 Dec  7 14:02 ..
lrwxrwxrwx 1 root root   9 Dec  7 14:02 ata-MT-2TB_0013603004006 -> ../../sda
lrwxrwxrwx 1 root root  10 Dec  7 14:02 ata-MT-2TB_0013603004006-part1 -> ../../sda1
lrwxrwxrwx 1 root root  10 Dec  7 14:02 ata-MT-2TB_0013603004006-part2 -> ../../sda2
lrwxrwxrwx 1 root root  10 Dec  7 14:02 ata-MT-2TB_0013603004006-part3 -> ../../sda3
lrwxrwxrwx 1 root root   9 Dec  7 14:02 ata-SanDisk_SDSSDH3_2T00_2143A9440310 -> ../../sdb
lrwxrwxrwx 1 root root  10 Dec  7 14:02 ata-SanDisk_SDSSDH3_2T00_2143A9440310-part1 -> ../../sdb1
lrwxrwxrwx 1 root root  10 Dec  7 14:02 ata-SanDisk_SDSSDH3_2T00_2143A9440310-part2 -> ../../sdb2
lrwxrwxrwx 1 root root  10 Dec  7 14:02 ata-SanDisk_SDSSDH3_2T00_2143A9440310-part3 -> ../../sdb3
lrwxrwxrwx 1 root root   9 Dec  7 14:02 wwn-0x5000000000005105 -> ../../sda
lrwxrwxrwx 1 root root  10 Dec  7 14:02 wwn-0x5000000000005105-part1 -> ../../sda1
lrwxrwxrwx 1 root root  10 Dec  7 14:02 wwn-0x5000000000005105-part2 -> ../../sda2
lrwxrwxrwx 1 root root  10 Dec  7 14:02 wwn-0x5000000000005105-part3 -> ../../sda3
lrwxrwxrwx 1 root root   9 Dec  7 14:02 wwn-0x5001b444a734cb75 -> ../../sdb
lrwxrwxrwx 1 root root  10 Dec  7 14:02 wwn-0x5001b444a734cb75-part1 -> ../../sdb1
lrwxrwxrwx 1 root root  10 Dec  7 14:02 wwn-0x5001b444a734cb75-part2 -> ../../sdb2
lrwxrwxrwx 1 root root  10 Dec  7 14:02 wwn-0x5001b444a734cb75-part3 -> ../../sdb3

Code:

~# proxmox-boot-tool status
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
System currently booted with uefi
A0DA-28DA is configured with: uefi (versions: 5.13.19-6-pve, 5.15.131-1-pve, 5.15.131-2-pve)

Dunuin said:
And "SanDisk SDSSDH3" is the old healty disk and "MT-2TB" the new one?

Yes, I think so.

Dunuin · Dec 8, 2023

And the output of lsblk?

According to your first post the failed SSD was a "MT-2TB". And your latest screenshot shows the new replaced SSD is a "MT-2TB" too. Did you buy the same (probably crappy noname consumer) SSD twice that already failed previously or did you actually replace the healty old SSD with the same failed old SSD?

bea · Dec 8, 2023

Code:

:~# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda      8:0    0  1.9T  0 disk
|-sda1   8:1    0 1007K  0 part
|-sda2   8:2    0  512M  0 part
`-sda3   8:3    0  1.8T  0 part
sdb      8:16   0  1.8T  0 disk
|-sdb1   8:17   0 1007K  0 part
|-sdb2   8:18   0  512M  0 part
`-sdb3   8:19   0  1.8T  0 part

Dunuin said:
Did you buy the same (probably crappy noname consumer) SSD twice that already failed previously

Not sure, but it is likely.

Dunuin said:
or did you actually replace the healty old SSD with the same failed old SSD?

Nope, I did not do so.

Dunuin · Dec 8, 2023

Dunuin said:
And the output of lsblk?

Sorry I meant blkid and not lsblk for getting the UUID of the ESPs.

bea · Dec 8, 2023

Code:

:~# blkid
/dev/sda3: LABEL="rpool" UUID="9956866810370661066" UUID_SUB="12835679230063345824" BLOCK_SIZE="4096" TYPE="zfs_member" PARTUUID="0c3e20c8-5d00-4cad-a276-ff80d10e68bf"
/dev/sdb2: UUID="A0DA-28DA" BLOCK_SIZE="512" TYPE="vfat" PARTUUID="57ac766f-36cb-4912-a985-5f53d951009c"
/dev/sdb3: LABEL="rpool" UUID="9956866810370661066" UUID_SUB="8871495102062371041" BLOCK_SIZE="4096" TYPE="zfs_member" PARTUUID="637e136d-1a6b-47e7-9395-a53584c68e24"
/dev/sda1: PARTUUID="ed3b22f1-30c3-4958-99da-6edbbda2a6a1"
/dev/sda2: PARTUUID="cb1858ac-42b1-481a-8a21-bf9ebfa8ee0a"
/dev/sdb1: PARTUUID="85499764-c5d2-493a-8f70-4e9fd4fca384"

Dunuin · Dec 8, 2023

So according to "proxmox-boot-tool status" you are still booting from sdb2 (the Sandisk SSD) with no other bootloader available. But your sdb3 isn't part of the rpool anymore as you replaced it sdb3 with sda3 (the hopefully new MT-2TB SSD) as your only available vdev.
Didn't you format and init sda2 with the proxmox-boot-tool when replacing partition 3?

This is pretty screwed up.

I personally would do this, but maybe one of the staff could also have a look before making things worse:

Code:

# Make new MT-2TB bootable:
proxmox-boot-tool format /dev/sda2
proxmox-boot-tool init /dev/sda2
# verify that sda2 (new MT-2TB) and sdb2 (Sandisk) are bootable:
proxmox-boot-tool status
# clone GPT and randomize UUIDs from new MT-2TB to Sandisk:
sgdisk /dev/sda -R /dev/sdb
sgdisk -G /dev/sdb
# replace the failed/old MT-2TB with the Sandisk:
zpool replace -f rpool 15157359014898742671 /dev/disk/by-id/ata-SanDisk_SDSSDH3_2T00_2143A9440310-part3
# Make Sandisk bootable:
proxmox-boot-tool format /dev/sdb2
proxmox-boot-tool init /dev/sdb2
# check pool status and bootloader status:
zpool status -v
proxmox-boot-tool status

bea · Dec 8, 2023

Dunuin said:
Didn't you format and init sda2 with the proxmox-boot-tool when replacing partition 3?

I did not. I am checking the instructions now and noticing I did not run proxmox-boot-tool format or init in any way.

bea · Dec 8, 2023

And I did run proxmox-boot-tool clean after the replace.

bea · Dec 9, 2023

Ok. I'm going to try the proposed commands. First, this is where I am:

Code:

:~# zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: resilvered 683G in 01:24:30 with 0 errors on Wed Dec  6 20:07:08 2023
config:

        NAME                                STATE     READ WRITE CKSUM
        rpool                               DEGRADED     0     0     0
          mirror-0                          DEGRADED     0     0     0
            15157359014898742671            UNAVAIL      0     0     0  was /dev/disk/by-id/ata-MT-2TB_9111012102018-part3
            ata-MT-2TB_0013603004006-part3  ONLINE       0     0     0

errors: No known data errors

And now I execute the first commands:

Code:

:~# proxmox-boot-tool format /dev/sda2
UUID="" SIZE="536870912" FSTYPE="" PARTTYPE="c12a7328-f81f-11d2-ba4b-00a0c93ec93b" PKNAME="sda" MOUNTPOINT=""
Formatting '/dev/sda2' as vfat..
mkfs.fat 4.2 (2021-01-31)
Done.

Code:

:~# proxmox-boot-tool init /dev/sda2
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
UUID="" SIZE="536870912" FSTYPE="" PARTTYPE="c12a7328-f81f-11d2-ba4b-00a0c93ec93b" PKNAME="sda" MOUNTPOINT=""
E: '/dev/sda2' has wrong filesystem (!= vfat).

Code:

:~# proxmox-boot-tool status
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
System currently booted with uefi
A0DA-28DA is configured with: uefi (versions: 5.13.19-6-pve, 5.15.131-1-pve, 5.15.131-2-pve)

Apparently something went wrong, what should I do?

bea · Dec 9, 2023

bea said:
E: '/dev/sda2' has wrong filesystem (!= vfat).

Rebooted and problem gone!
I could successfully execute the next commands, now it's resilvering after the replace. I'll post back when it's done.

rpool DEGRADED ('label is missing or invalid' - 'part of active pool' error when trying to replace)

Active Member

Distinguished Member

Active Member

Attachments

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Active Member

Active Member

Active Member

We value your privacy