One or more devices could not be used because the label is missing or invalid.

pharpe · Jun 18, 2022

I have a single Proxmox server with a zfs container with 3 drives. Today I discovered that there is an error with the container.

Code:

root@pharpe:~# zpool status -x storage
  pool: storage
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub in progress since Sat Jun 18 08:54:35 2022
        1.44G scanned at 148M/s, 2.21M issued at 226K/s, 12.4T total
        0B repaired, 0.00% done, no estimated completion time
config:

        NAME                                   STATE     READ WRITE CKSUM
        storage                                DEGRADED     0     0     0
          raidz1-0                             DEGRADED     0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_7JKRAPUC  ONLINE       0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_1SGGK0DZ  ONLINE       0     0     0
            14087233867307650821               UNAVAIL      0     0     0  was /dev/disk/by-id/ata-WDC_WD80EMAZ-00WJTA0_7SGZJ0UC-part1

errors: No known data errors

When I run smartclt the disk passes

Code:

root@pharpe:~# smartctl -H /dev/disk/by-id/ata-WDC_WD80EMAZ-00WJTA0_7SGZJ0UC-part1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.189-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

I've tried rebooting to see if the drive will resilver but it does not. Could someone help point me in the right directing for my next troubleshooting step?

shrdlicka · Jun 20, 2022

Hi,
some other thread [1] mentions that this might be a bug. You can try what they tried to fix the issue: [2] (but I would recommend using the disk/by-id path since the /dev/sd* is not stable

)

[1] https://forum.proxmox.com/threads/zfs-faulted-drive-looking-for-help.59417/post-294047
[2] https://github.com/openzfs/zfs/issues/2076#issuecomment-652773410

pharpe · Jun 20, 2022

shrdlicka said:
Hi,
some other thread [1] mentions that this might be a bug. You can try what they tried to fix the issue: [2] (but I would recommend using the disk/by-id path since the /dev/sd* is not stable )

[1] https://forum.proxmox.com/threads/zfs-faulted-drive-looking-for-help.59417/post-294047
[2] https://github.com/openzfs/zfs/issues/2076#issuecomment-652773410

Thanks. I saw that thread but my disks are already using disk/by-id, not /dev/sd*, so I don't think it was applicable.

pharpe · Jun 20, 2022

I was able to force it to resilver using:

Code:

zpool replace storage /dev/disk/by-id/ata-WDC_WD80EMAZ-00WJTA0_7SGZJ0UC-part1

It resilvered with no errors but now I have this:

Code:

root@pharpe:~# zpool status -x storage
pool 'storage' is healthy

Code:

root@pharpe:~# zpool status -v storage
  pool: storage
 state: DEGRADED
  scan: resilvered 1.50T in 12:07:28 with 0 errors on Mon Jun 20 02:26:11 2022
config:

        NAME                                   STATE     READ WRITE CKSUM
        storage                                DEGRADED     0     0     0
          raidz1-0                             DEGRADED     0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_7JKRAPUC  ONLINE       0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_1SGGK0DZ  ONLINE       0     0     0
            replacing-2                        DEGRADED   121 61.1K     0
              14087233867307650821             UNAVAIL      0     0     0  was /dev/disk/by-id/ata-WDC_WD80EMAZ-00WJTA0_7SGZJ0UC-part1/old
              16298796475004352883             UNAVAIL      0     0     0  was /dev/disk/by-id/ata-WDC_WD80EMAZ-00WJTA0_7SGZJ0UC-part1

errors: No known data errors

So I thought maybe I need to clear the errors so I ran

Code:

root@pharpe:~# zpool clear storage
root@pharpe:~# zpool status -x storage
pool 'storage' is healthy
root@pharpe:~# zpool status -v storage
  pool: storage
 state: DEGRADED
  scan: scrub in progress since Mon Jun 20 07:26:37 2022
        4.07G scanned at 298M/s, 2.17M issued at 159K/s, 12.4T total
        0B repaired, 0.00% done, no estimated completion time
config:

        NAME                                   STATE     READ WRITE CKSUM
        storage                                DEGRADED     0     0     0
          raidz1-0                             DEGRADED     0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_7JKRAPUC  ONLINE       0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_1SGGK0DZ  ONLINE       0     0     0
            replacing-2                        DEGRADED     0 1.58K     0
              14087233867307650821             UNAVAIL      0     0     0  was /dev/disk/by-id/ata-WDC_WD80EMAZ-00WJTA0_7SGZJ0UC-part1/old
              16298796475004352883             UNAVAIL      0     0     0  was /dev/disk/by-id/ata-WDC_WD80EMAZ-00WJTA0_7SGZJ0UC-part1

errors: No known data errors

Why would zpool status -x return a heathy status but zpool status -v show degraded but indicate no errors?

pharpe · Jun 20, 2022

More info. I opened the case and swapped the sata and power cables with another drive and plugged it into a different sata port on the MB. Rebooted and zpool status -v still shows the same degraded status.

Then I tried swapping the 7SGZJ0UC with another know good 8 TB drive to see if I could resilver that to replace the drive having issues. However, without the 7SGZJ0UC drive in I cannot see the pool at all. Makes me think it must have been working after all?

Code:

root@pharpe:~# zpool status -v storage
cannot open 'storage': no such pool

I then put the 7SGZJ0UC drive into a Windows 11 machine and ran the WD Diagnostics on it and it says it's healthy.

pharpe · Jun 20, 2022

I put everything back the way it originally was but I still don't see the container. The drives are recognized though.

Code:

root@pharpe:~# zpool status -x storage
cannot open 'storage': no such pool

root@pharpe:~# lsblk
NAME     MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda        8:0    0 119.2G  0 disk
└─sda1     8:1    0 119.2G  0 part
sdb        8:16   0   7.3T  0 disk
├─sdb1     8:17   0   7.3T  0 part
└─sdb9     8:25   0     8M  0 part
sdc        8:32   0 447.1G  0 disk
├─sdc1     8:33   0  1007K  0 part
├─sdc2     8:34   0   512M  0 part
└─sdc3     8:35   0 446.6G  0 part
sdd        8:48   0   7.3T  0 disk
├─sdd1     8:49   0   7.3T  0 part
└─sdd9     8:57   0     8M  0 part
sde        8:64   0   7.3T  0 disk
├─sde1     8:65   0   7.3T  0 part
└─sde9     8:73   0     8M  0 part

Did I lose everything?

guletz · Jun 21, 2022

Hi,

I had have a situation like you more or less(degraded pool raid50, but without "label is missing" message), involving 2 new NL-SAS HDD(unavalable disk in zpool status like you).
- smartctl had tell me that both disk was OK(like in your case), I can also see all the partions on the both disk
- reboot for several times
- I run at least 3-4 smartctl long test with succes, but the pool was degraded even after several scrubs(smartctl test then scrub and so on in this order)
- I destroy and recreate the pool one time, but after 4-8 hours, but again, unavalable disk in zpool status hit me

- any scrub test did not see any errors during the proces
- then I take a 2 day pause, doing nothing
- in all this period, on this pool I have 2 VMs, and at least 4 replication from other node

Then I destroy the pool for the second time ..... recreate again the same pool(move the VM data back ...), and surprise surprise.... No problem at all. All of this was happend 2 weeks ago(PMX 7.x with all updates).

As a conclusion, a succesful smartd test does not means that your HDD is OK, because it is basically a read test only(Read/Verify segment the entire disk is checked). I can guess that was some sectors with problems, that during this all period was replaced with good sectors from the disk reserve.

The same situation, I have seen for a few times in the past.

Good luck / Bafta !

pharpe · Jun 24, 2022

So I figured out what it was. After nuking my raidz I ran extended test on all drives. No errors. I tried creating new container and starting over. I kept getting an error with one of the drives. I switch it with another known good drive. Same issue. I decided to go look in the bios for anything odd. I found something in the SATA config. It was set to AHCI. But there was another line for SATA 4/5. That was set to IDE. I changed that to "Use SATA" and rebooted. All disks are working perfect now. Apparently that setting was over riding the Sata 4 and 5 ports and forcing IDE instead of AHCI. Its been set this way for years but I guess when I did the Proxmox upgrade the new version is not as forgiving about the miss match.

What really sucks is if I would have figured that out first than I could have saved my data. But when troubleshooting I tried swapping sata ports and cables I ended up putting another of the raid drives on 4/5 and that took out the container.

guletz · Jun 24, 2022

pharpe said:
What really sucks is if I would have figured that out first than I could have saved my data. But when troubleshooting I tried swapping sata ports and cables I ended up putting another of the raid drives on 4/5 and that took out the container

Hi,

This is the path to go for any one! All of us we make mistakes, but we start learning ....! Next time I am sure you will not make the same mistake.

Good luck / Bafta!

One or more devices could not be used because the label is missing or invalid.

pharpe

Member

shrdlicka

Proxmox Retired Staff

pharpe

Member

pharpe

Member

pharpe

Member

pharpe

Member

guletz

Distinguished Member

pharpe

Member

guletz

Distinguished Member

We value your privacy