Again ZFS mirror degraded

drnicolas

Renowned Member
Dec 8, 2010
170
7
83
For the second time I have a problem with my SSD ZFS mirror.
Both NMVE-SSD are mounted on one of these PCI-e crads with 4 NVME/m.2 slots.

First it worked for a couple of months.
Then the zpool was degraded, complaining on of the SSDs is missing.

the probably faulty SSD was replaced with a new one adn the zpool was online again.

Now the same again: this is teh message
Code:
Processing triggers for man-db (2.11.2-2) ...
root@pve:~# lsscsi -g
[0:0:0:0]    disk    ATA      Vi550 S3 SSD     61.3  /dev/sda   /dev/sg0
[6:0:0:0]    disk    ATA      WDC WD40EFPX-68C 0A81  /dev/sdb   /dev/sg1
[7:0:0:0]    disk    ATA      WDC WD40EFRX-68W 0A82  /dev/sdc   /dev/sg2
[8:0:0:0]    disk    ATA      WDC WD30EFRX-68E 0A80  /dev/sdd   /dev/sg3
[9:0:0:0]    disk    ATA      WDC WD30EFRX-68E 0A80  /dev/sde   /dev/sg4
[13:0:0:0]   process Marvell  Console          1.01  -          /dev/sg5
[N:0:8224:1] disk    WD_BLACK SN850X 2000GB__1                  /dev/nvme0n1  -       
[N:1:4:1]    disk    Samsung SSD 970 EVO 1TB__1                 /dev/nvme1n1  -       
root@pve:~# zpool status SSD
  pool: SSD
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 00:06:41 with 0 errors on Sun Nov 10 00:30:56 2024
config:

        NAME                                              STATE     READ WRITE CKSUM
        SSD                                               DEGRADED     0     0     0
          mirror-0                                        DEGRADED     0     0     0
            10734493458139189249                          FAULTED      0     0     0  was /dev/nvme1n1p1
            nvme-Samsung_SSD_970_EVO_1TB_S467NX0M908205Y  ONLINE       0     0     0

errors: No known data errors
root@pve:~# zpool replace SSD nvme0n1
invalid vdev specification
use '-f' to override the following errors:
/dev/nvme0n1p1 is part of active pool 'SSD'
root@pve:~# ^C
root@pve:~#


The WD Black is the new one.
For me it looks as if the WD black is missing in the zpool. Nevertheless, the WD black seems to function still but is nvme0n1 and not nvme1n1p1

Whar can I do?

replacing did not work
 
Never use /dev/somedisk but /dev/disk/by-id/somedisk instead.
The order of /dev/somedisk may change upon reboot an that will confuse ZFS.
 
  • Like
Reactions: UdoB
I was able to OFFLINE the WD black

this is the smartctl-data:
Code:
root@pve:~# smartctl -a /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-3-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WD_BLACK SN850X 2000GB
Serial Number:                      24023A801465
Firmware Version:                   620361WD
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      8224
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001b44 8b474dfc2e
Local Time is:                      Wed Dec  4 10:38:22 2024 CET
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x00df):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Verify
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     90 Celsius
Critical Comp. Temp. Threshold:     94 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W    9.00W       -    0  0  0  0        0       0
 1 +     6.00W    6.00W       -    0  0  0  0        0       0
 2 +     4.50W    4.50W       -    0  0  0  0        0       0
 3 -   0.0250W       -        -    3  3  3  3     5000   10000
 4 -   0.0050W       -        -    4  4  4  4     3900   45700

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        27 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    2,881,629 [1.47 TB]
Data Units Written:                 12,084,593 [6.18 TB]
Host Read Commands:                 91,278,444
Host Write Commands:                148,413,290
Controller Busy Time:               246
Power Cycles:                       4
Power On Hours:                     948
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

root@pve:~#

The drive seems to be available and healthy from my perspective.

after offlining the pool-status looks like this:
Code:
root@pve:~# zpool status SSD
  pool: SSD
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0B in 00:06:41 with 0 errors on Sun Nov 10 00:30:56 2024
config:

        NAME                                              STATE     READ WRITE CKSUM
        SSD                                               DEGRADED     0     0     0
          mirror-0                                        DEGRADED     0     0     0
            10734493458139189249                          OFFLINE      0     0     0  was /dev/nvme1n1p1
            nvme-Samsung_SSD_970_EVO_1TB_S467NX0M908205Y  ONLINE       0     0     0

errors: No known data errors
root@pve:~#

How can I re-attach the WD black to the pool?
 
I think you have two options:
a) Replace the WD black again with zpool replace SSD 10734493458139189249 /dev/disk/by-id/yourdisk. This can be done without shutting down VMs.

b) Export and reimport the pool with the -d argument like described here [1]. As you will export the storage it will go offline for this so you can't do this while VMs are running on it.
Code:
zpool export storage
zpool import storage -d /dev/disk/by-id

[1] https://serverfault.com/questions/8...-in-a-zfs-pool-from-dev-sdx-to-dev-disk-by-id
 
Last edited:
Pretty complicated. I have all of these:

root@pve:/dev/disk/by-id# ls nvme-WD* -l
lrwxrwxrwx 1 root root 13 Nov 23 15:22 nvme-WD_BLACK_SN850X_2000GB_24023A801465 -> ../../nvme0n1
lrwxrwxrwx 1 root root 13 Nov 23 15:22 nvme-WD_BLACK_SN850X_2000GB_24023A801465_1 -> ../../nvme0n1
lrwxrwxrwx 1 root root 15 Nov 23 15:22 nvme-WD_BLACK_SN850X_2000GB_24023A801465_1-part1 -> ../../nvme0n1p1
lrwxrwxrwx 1 root root 15 Nov 23 15:22 nvme-WD_BLACK_SN850X_2000GB_24023A801465_1-part9 -> ../../nvme0n1p9
lrwxrwxrwx 1 root root 15 Nov 23 15:22 nvme-WD_BLACK_SN850X_2000GB_24023A801465-part1 -> ../../nvme0n1p1
lrwxrwxrwx 1 root root 15 Nov 23 15:22 nvme-WD_BLACK_SN850X_2000GB_24023A801465-part9 -> ../../nvme0n1p9
root@pve:/dev/disk/by-id# zpool replace SSD 10734493458139189249 dev/disk/by-id/nvme-WD_BLACK_SN850X_2000GB_24023A801465
cannot open 'dev/disk/by-id/nvme-WD_BLACK_SN850X_2000GB_24023A801465': no such device in /dev
must be a full path or shorthand device name
root@pve:/dev/disk/by-id# zpool replace SSD 10734493458139189249 dev/disk/by-id/nvme-WD_BLACK_SN850X_2000GB_24023A801465_1
cannot open 'dev/disk/by-id/nvme-WD_BLACK_SN850X_2000GB_24023A801465_1': no such device in /dev
must be a full path or shorthand device name

The first attempts did not work
 
root@pve:/dev/disk/by-id# zpool replace SSD 10734493458139189249 dev/disk/by-id/nvme-WD_BLACK_SN850X_2000GB_24023A801465
Try zpool replace SSD 10734493458139189249 /dev/disk/by-id/nvme-WD_BLACK_SN850X_2000GB_24023A801465-part1 instead. the leading / and the ending -part1 was missing. I'm guessing the -part1 as you appear to have partitioned the NVMe drive, You can also remove the partitions and use the whole drive (and then remove the -part1 but you always need the leading /).
This is not Proxmox specific and other ZFS guides on the internet might give more detailed information on how to do this.