One or more devices could not be used because the label is missing or invalid.

pharpe

Member
Jul 15, 2020
32
1
13
49
I have a single Proxmox server with a zfs container with 3 drives. Today I discovered that there is an error with the container.
Code:
root@pharpe:~# zpool status -x storage
  pool: storage
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub in progress since Sat Jun 18 08:54:35 2022
        1.44G scanned at 148M/s, 2.21M issued at 226K/s, 12.4T total
        0B repaired, 0.00% done, no estimated completion time
config:

        NAME                                   STATE     READ WRITE CKSUM
        storage                                DEGRADED     0     0     0
          raidz1-0                             DEGRADED     0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_7JKRAPUC  ONLINE       0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_1SGGK0DZ  ONLINE       0     0     0
            14087233867307650821               UNAVAIL      0     0     0  was /dev/disk/by-id/ata-WDC_WD80EMAZ-00WJTA0_7SGZJ0UC-part1

errors: No known data errors

When I run smartclt the disk passes
Code:
root@pharpe:~# smartctl -H /dev/disk/by-id/ata-WDC_WD80EMAZ-00WJTA0_7SGZJ0UC-part1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.189-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

I've tried rebooting to see if the drive will resilver but it does not. Could someone help point me in the right directing for my next troubleshooting step?
 

pharpe

Member
Jul 15, 2020
32
1
13
49
Hi,
some other thread [1] mentions that this might be a bug. You can try what they tried to fix the issue: [2] (but I would recommend using the disk/by-id path since the /dev/sd* is not stable :))

[1] https://forum.proxmox.com/threads/zfs-faulted-drive-looking-for-help.59417/post-294047
[2] https://github.com/openzfs/zfs/issues/2076#issuecomment-652773410
Thanks. I saw that thread but my disks are already using disk/by-id, not /dev/sd*, so I don't think it was applicable.
 
Last edited:

pharpe

Member
Jul 15, 2020
32
1
13
49
I was able to force it to resilver using:
Code:
zpool replace storage /dev/disk/by-id/ata-WDC_WD80EMAZ-00WJTA0_7SGZJ0UC-part1
It resilvered with no errors but now I have this:

Code:
root@pharpe:~# zpool status -x storage
pool 'storage' is healthy

Code:
root@pharpe:~# zpool status -v storage
  pool: storage
 state: DEGRADED
  scan: resilvered 1.50T in 12:07:28 with 0 errors on Mon Jun 20 02:26:11 2022
config:

        NAME                                   STATE     READ WRITE CKSUM
        storage                                DEGRADED     0     0     0
          raidz1-0                             DEGRADED     0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_7JKRAPUC  ONLINE       0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_1SGGK0DZ  ONLINE       0     0     0
            replacing-2                        DEGRADED   121 61.1K     0
              14087233867307650821             UNAVAIL      0     0     0  was /dev/disk/by-id/ata-WDC_WD80EMAZ-00WJTA0_7SGZJ0UC-part1/old
              16298796475004352883             UNAVAIL      0     0     0  was /dev/disk/by-id/ata-WDC_WD80EMAZ-00WJTA0_7SGZJ0UC-part1

errors: No known data errors

So I thought maybe I need to clear the errors so I ran

Code:
root@pharpe:~# zpool clear storage
root@pharpe:~# zpool status -x storage
pool 'storage' is healthy
root@pharpe:~# zpool status -v storage
  pool: storage
 state: DEGRADED
  scan: scrub in progress since Mon Jun 20 07:26:37 2022
        4.07G scanned at 298M/s, 2.17M issued at 159K/s, 12.4T total
        0B repaired, 0.00% done, no estimated completion time
config:

        NAME                                   STATE     READ WRITE CKSUM
        storage                                DEGRADED     0     0     0
          raidz1-0                             DEGRADED     0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_7JKRAPUC  ONLINE       0     0     0
            ata-WDC_WD80EMAZ-00WJTA0_1SGGK0DZ  ONLINE       0     0     0
            replacing-2                        DEGRADED     0 1.58K     0
              14087233867307650821             UNAVAIL      0     0     0  was /dev/disk/by-id/ata-WDC_WD80EMAZ-00WJTA0_7SGZJ0UC-part1/old
              16298796475004352883             UNAVAIL      0     0     0  was /dev/disk/by-id/ata-WDC_WD80EMAZ-00WJTA0_7SGZJ0UC-part1

errors: No known data errors

Why would zpool status -x return a heathy status but zpool status -v show degraded but indicate no errors?
 
Last edited:

pharpe

Member
Jul 15, 2020
32
1
13
49
More info. I opened the case and swapped the sata and power cables with another drive and plugged it into a different sata port on the MB. Rebooted and zpool status -v still shows the same degraded status.

Then I tried swapping the 7SGZJ0UC with another know good 8 TB drive to see if I could resilver that to replace the drive having issues. However, without the 7SGZJ0UC drive in I cannot see the pool at all. Makes me think it must have been working after all?
Code:
root@pharpe:~# zpool status -v storage
cannot open 'storage': no such pool

I then put the 7SGZJ0UC drive into a Windows 11 machine and ran the WD Diagnostics on it and it says it's healthy.

1655751995684.png
 

pharpe

Member
Jul 15, 2020
32
1
13
49
I put everything back the way it originally was but I still don't see the container. The drives are recognized though.
Code:
root@pharpe:~# zpool status -x storage
cannot open 'storage': no such pool

root@pharpe:~# lsblk
NAME     MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda        8:0    0 119.2G  0 disk
└─sda1     8:1    0 119.2G  0 part
sdb        8:16   0   7.3T  0 disk
├─sdb1     8:17   0   7.3T  0 part
└─sdb9     8:25   0     8M  0 part
sdc        8:32   0 447.1G  0 disk
├─sdc1     8:33   0  1007K  0 part
├─sdc2     8:34   0   512M  0 part
└─sdc3     8:35   0 446.6G  0 part
sdd        8:48   0   7.3T  0 disk
├─sdd1     8:49   0   7.3T  0 part
└─sdd9     8:57   0     8M  0 part
sde        8:64   0   7.3T  0 disk
├─sde1     8:65   0   7.3T  0 part
└─sde9     8:73   0     8M  0 part

Did I lose everything?
 

guletz

Famous Member
Apr 19, 2017
1,585
260
108
Brasov, Romania
Hi,

I had have a situation like you more or less(degraded pool raid50, but without "label is missing" message), involving 2 new NL-SAS HDD(unavalable disk in zpool status like you).
- smartctl had tell me that both disk was OK(like in your case), I can also see all the partions on the both disk
- reboot for several times
- I run at least 3-4 smartctl long test with succes, but the pool was degraded even after several scrubs(smartctl test then scrub and so on in this order)
- I destroy and recreate the pool one time, but after 4-8 hours, but again, unavalable disk in zpool status hit me ;)
- any scrub test did not see any errors during the proces
- then I take a 2 day pause, doing nothing
- in all this period, on this pool I have 2 VMs, and at least 4 replication from other node

Then I destroy the pool for the second time ..... recreate again the same pool(move the VM data back ...), and surprise surprise.... No problem at all. All of this was happend 2 weeks ago(PMX 7.x with all updates).

As a conclusion, a succesful smartd test does not means that your HDD is OK, because it is basically a read test only(Read/Verify segment the entire disk is checked). I can guess that was some sectors with problems, that during this all period was replaced with good sectors from the disk reserve.

The same situation, I have seen for a few times in the past.

Good luck / Bafta !
 
  • Like
Reactions: pharpe

pharpe

Member
Jul 15, 2020
32
1
13
49
So I figured out what it was. After nuking my raidz I ran extended test on all drives. No errors. I tried creating new container and starting over. I kept getting an error with one of the drives. I switch it with another known good drive. Same issue. I decided to go look in the bios for anything odd. I found something in the SATA config. It was set to AHCI. But there was another line for SATA 4/5. That was set to IDE. I changed that to "Use SATA" and rebooted. All disks are working perfect now. Apparently that setting was over riding the Sata 4 and 5 ports and forcing IDE instead of AHCI. Its been set this way for years but I guess when I did the Proxmox upgrade the new version is not as forgiving about the miss match.

What really sucks is if I would have figured that out first than I could have saved my data. But when troubleshooting I tried swapping sata ports and cables I ended up putting another of the raid drives on 4/5 and that took out the container.
 
Last edited:
  • Like
Reactions: ekin06

guletz

Famous Member
Apr 19, 2017
1,585
260
108
Brasov, Romania
What really sucks is if I would have figured that out first than I could have saved my data. But when troubleshooting I tried swapping sata ports and cables I ended up putting another of the raid drives on 4/5 and that took out the container
Hi,

This is the path to go for any one! All of us we make mistakes, but we start learning ....! Next time I am sure you will not make the same mistake.

Good luck / Bafta!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!