[SOLVED] ZFS pool DEGRADED - Disk not showing up? Is it dead?

lmm5247 · Jul 11, 2023

I'm running Proxmox on a DeskMini H470 with the following disk layout:

1x Samsung 970 PRO 512GB (for Proxmox installation)
2x Intel D3-S4510 960GB (for ZFS mirror - these are "enterprise" drives with less than 8000 power on hours)

The other day, Proxmox was completely hung and I was unable to SSH into it or get it to respond. All of my VMs were also unresponsive, all of their services were offline, and I was unable to ping any of them. So, I powered off my Proxmox box manually. Now, I'm noticing my zpool is degraded.

Code:

root@proxmox02:~# zpool status -P
  pool: intel_mirror
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 00:36:53 with 0 errors on Sun Jul  9 01:00:54 2023
config:

        NAME                                                                  STATE     READ WRITE CKSUM
        intel_mirror                                                          DEGRADED     0     0     0
          mirror-0                                                            DEGRADED     0     0     0
            3372824492195189717                                               UNAVAIL      0     0     0  was /dev/disk/by-id/ata-INTEL_SSDSC2KB960G8_BTYF2060050M960CGN-part1
            /dev/disk/by-id/ata-INTEL_SSDSC2KB960G8_BTYF206005VJ960CGN-part1  ONLINE       0     0     0

What worries me is that my missing drive isn't being seen by lsblk.

Code:

root@proxmox02:~# lsblk
NAME                 MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                    8:0    0 894.3G  0 disk
├─sda1                 8:1    0 894.2G  0 part
└─sda9                 8:9    0     8M  0 part
zd0                  230:0    0     8G  0 disk
├─zd0p1              230:1    0     7G  0 part
├─zd0p2              230:2    0     1K  0 part
└─zd0p5              230:5    0   975M  0 part
zd16                 230:16   0     1M  0 disk
zd32                 230:32   0   120G  0 disk
└─zd32p1             230:33   0   120G  0 part
zd48                 230:48   0    75G  0 disk
├─zd48p1             230:49   0    74G  0 part
├─zd48p2             230:50   0     1K  0 part
└─zd48p5             230:53   0   975M  0 part
zd64                 230:64   0    16G  0 disk
├─zd64p1             230:65   0    32M  0 part
├─zd64p2             230:66   0    24M  0 part
├─zd64p3             230:67   0   256M  0 part
├─zd64p4             230:68   0    24M  0 part
├─zd64p5             230:69   0   256M  0 part
├─zd64p6             230:70   0     8M  0 part
├─zd64p7             230:71   0    96M  0 part
└─zd64p8             230:72   0  15.3G  0 part
zd80                 230:80   0     8G  0 disk
├─zd80p1             230:81   0     7G  0 part
├─zd80p2             230:82   0     1K  0 part
└─zd80p5             230:85   0   975M  0 part
zd96                 230:96   0    55G  0 disk
├─zd96p1             230:97   0   549M  0 part
└─zd96p2             230:98   0  54.5G  0 part
nvme0n1              259:0    0 476.9G  0 disk
├─nvme0n1p1          259:1    0  1007K  0 part
├─nvme0n1p2          259:2    0   512M  0 part /boot/efi
└─nvme0n1p3          259:3    0 476.4G  0 part
  ├─pve-swap         253:0    0     8G  0 lvm  [SWAP]
  ├─pve-root         253:1    0    96G  0 lvm  /
  ├─pve-data_tmeta   253:2    0   3.6G  0 lvm
  │ └─pve-data-tpool 253:4    0 349.3G  0 lvm
  │   └─pve-data     253:5    0 349.3G  1 lvm
  └─pve-data_tdata   253:3    0 349.3G  0 lvm
    └─pve-data-tpool 253:4    0 349.3G  0 lvm
      └─pve-data     253:5    0 349.3G  1 lvm

I'm also not seeing sdb in dmesg. Am I in trouble? I'm guessing this drive is dead?

Code:

root@proxmox02:~# dmesg | grep -E "sda|sdb"
[    1.941969] sd 3:0:0:0: [sda] 1875385008 512-byte logical blocks: (960 GB/894 GiB)
[    1.941972] sd 3:0:0:0: [sda] 4096-byte physical blocks
[    1.941980] sd 3:0:0:0: [sda] Write Protect is off
[    1.941981] sd 3:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    1.942041] sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.942083] sd 3:0:0:0: [sda] Preferred minimum I/O size 4096 bytes
[    1.958554]  sda: sda1 sda9
[    1.958641] sd 3:0:0:0: [sda] Attached SCSI disk

lmm5247 · Jul 11, 2023

Crap, it's NOT being seen by the BIOS either...

lmm5247 · Jul 11, 2023

Ok, I re-seated the SATA cable and it's back online but with checksum errors...

Code:

root@proxmox02:~# zpool status -x
  pool: intel_mirror
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 11.6G in 00:00:44 with 0 errors on Tue Jul 11 17:05:29 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        intel_mirror                                    ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            ata-INTEL_SSDSC2KB960G8_BTYF2060050M960CGN  ONLINE       0     0     3
            ata-INTEL_SSDSC2KB960G8_BTYF206005VJ960CGN  ONLINE       0     0     0

I ran zpool clear intel_mirror and just started /usr/lib/zfs-linux/scrub...

It ran for about 20mins and I guess I'm ok now? I'm contacting ASRock about a replacement cable...

Code:

root@proxmox02:~# zpool status -v
  pool: intel_mirror
 state: ONLINE
  scan: scrub repaired 0B in 00:16:37 with 0 errors on Tue Jul 11 17:29:01 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        intel_mirror                                    ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            ata-INTEL_SSDSC2KB960G8_BTYF2060050M960CGN  ONLINE       0     0     0
            ata-INTEL_SSDSC2KB960G8_BTYF206005VJ960CGN  ONLINE       0     0     0

Maximiliano · Jul 12, 2023

Hello @lmm5247, I would advice to run a SMART test on both devices to have an idea if there are further issues. Also don't forget to mark your issue as resolved

lmm5247 · Jul 13, 2023

@Maximiliano, thanks! I'm seeing some SMART errors on the drive that was giving me problems (ASRock is shipping new cables). The drive has been powered on for 7500 hours. Are these errors (which occured for example at 6 hours) measured from last power-on, or initial power-on (7500 hours ago)?

Code:

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       7523
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       17
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       14
175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2530 (17 65535)
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       57
184 End-to-End_Error_Count  0x0033   100   100   090    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Drive_Temperature       0x0022   061   057   000    Old_age   Always       -       39 (Min/Max 20/46)
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       14
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       39
197 Pending_Sector_Count    0x0012   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       12
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1827461
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       1648
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       15
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       451304
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   099   099   000    Old_age   Always       -       0
234 Thermal_Throttle_Status 0x0032   100   100   000    Old_age   Always       -       0/0
235 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2530 (17 65535)
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1827461
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       335561
243 NAND_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       6061325

SMART Error Log Version: 1
ATA Error Count: 23 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 23 occurred at disk power-on lifetime: 6 hours (0 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 46 00 00 00 a0  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 46 00 00 00 a0 00      00:00:33.373  SET FEATURES [Set transfer mode]
  ef c3 01 00 00 00 a0 00      00:00:33.373  SET FEATURES [Sense Data Reporting]
  ec 00 00 00 00 00 a0 00      00:00:33.373  IDENTIFY DEVICE

Error 22 occurred at disk power-on lifetime: 6 hours (0 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 a0 00      00:00:31.867  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:00:31.867  SET FEATURES [Set transfer mode]
  ef c3 01 00 00 00 a0 00      00:00:31.867  SET FEATURES [Sense Data Reporting]
  ec 00 00 00 00 00 a0 00      00:00:31.867  IDENTIFY DEVICE

Error 21 occurred at disk power-on lifetime: 206 hours (8 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ea 00 00 00 00 00 a0 00  27d+08:02:30.615  FLUSH CACHE EXT
  b0 d5 01 01 4f c2 00 00  27d+07:58:47.491  SMART READ LOG
  b0 d5 01 06 4f c2 00 00  27d+07:58:47.491  SMART READ LOG
  b0 d0 01 00 4f c2 00 00  27d+07:58:47.491  SMART READ DATA
  b0 da 00 00 4f c2 00 00  27d+07:58:47.490  SMART RETURN STATUS

Error 20 occurred at disk power-on lifetime: 206 hours (8 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ea 00 00 00 00 00 a0 00  27d+08:02:30.615  FLUSH CACHE EXT
  b0 d5 01 01 4f c2 00 00  27d+07:58:47.491  SMART READ LOG
  b0 d5 01 06 4f c2 00 00  27d+07:58:47.491  SMART READ LOG
  b0 d0 01 00 4f c2 00 00  27d+07:58:47.491  SMART READ DATA
  b0 da 00 00 4f c2 00 00  27d+07:58:47.490  SMART RETURN STATUS

Error 19 occurred at disk power-on lifetime: 206 hours (8 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ea 00 00 00 00 00 a0 00  27d+08:02:30.615  FLUSH CACHE EXT
  b0 d5 01 01 4f c2 00 00  27d+07:58:47.491  SMART READ LOG
  b0 d5 01 06 4f c2 00 00  27d+07:58:47.491  SMART READ LOG
  b0 d0 01 00 4f c2 00 00  27d+07:58:47.491  SMART READ DATA
  b0 da 00 00 4f c2 00 00  27d+07:58:47.490  SMART RETURN STATUS

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         1         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Maximiliano · Jul 14, 2023

Code:

199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       12

might indicate issues with the drive, note that in general SMART's output is very vendor specific.

Regarding the error themselves, for example Error 19 occurred at disk power-on lifetime: 206 hours (8 days + 14 hours), happened 206 hours after the last cold boot of the machine, not (7500 - 206) = 7294 hours ago.

Search

Search

[SOLVED] ZFS pool DEGRADED - Disk not showing up? Is it dead?

lmm5247

Member

lmm5247

Member

lmm5247

Member

Maximiliano

Proxmox Staff Member

lmm5247

Member

Maximiliano

Proxmox Staff Member