SSD state: DEGRADED

gusto · Nov 23, 2021

Today I found out that I have a problem with one ssd.
Is it possible to fix it or do I need to replace the SSD?

Code:

zpool status

  pool: rpool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: resilvered 6.36G in 00:03:11 with 0 errors on Tue Nov 23 07:53:33 2021
config:

        NAME                                                   STATE     READ WRITE CKSUM
        rpool                                                  DEGRADED     0     0     0
          mirror-0                                             DEGRADED     0     0     0
            ata-Patriot_P200_256GB_AA000000000000000978-part3  FAULTED      0    44     1  too many errors
            ata-Patriot_P200_256GB_AA000000000000000025-part3  ONLINE       0     0     0

leesteken · Nov 23, 2021

I've seen similar issues with a drive connected with a bad cable. Sometimes it would lose connection and lots of write errors would occur. When the connection got restored, ZFS would automatically resilver the drive as if it was just a little behind instead of broken, which makes sense.
Your resilvering did not encounter errors, so maybe doing a zpool clear rpool is enough. Run a zpool scrub rpool afterwards to check the drive. Maybe replace the cable, or at least disconnect and reconnect both ends of the cable if this happens again.

Dunuin · Nov 23, 2021

You also should do a smart test and look if the smart attributes are fine.

gusto · Nov 24, 2021

I turned off the whole machine. I swapped SATA cables between /dev/sda and /dev/sdb

Code:

  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 2.12G in 00:00:09 with 0 errors on Wed Nov 24 06:49:11 2021
config:

        NAME                                                   STATE     READ WRITE CKSUM
        rpool                                                  ONLINE       0     0     0
          mirror-0                                             ONLINE       0     0     0
            ata-Patriot_P200_256GB_AA000000000000000978-part3  ONLINE       0     0     1
            ata-Patriot_P200_256GB_AA000000000000000025-part3  ONLINE       0     0     0

errors: No known data errors

smartctl -a /dev/sda
smartctl -a /dev/sdb

leesteken · Nov 24, 2021

Please check if there are recen errors in the Proxmox Syslog (or use journalctl on the command line).
ZFS resilved one of your drives again, so it lost or had issues with one of them. Did it do this and/or dectect errors before or after you swapped the cables?.
What were the results of zpool scrub rpool?

gusto · Nov 24, 2021

Here is result journalctl

Code:

zpool scrub rpool

no result

leesteken · Nov 24, 2021

gusto said:
Code:

zpool scrub rpool

no result

Zpool scrub does never shows anything, it shows up in zpool status during and after it finishes. What does zpool status rpool show after running zpool scrub rpool?

gusto · Nov 24, 2021

After zpool scrub rpool

Code:

  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Wed Nov 24 12:11:11 2021
        26.4G scanned at 13.2G/s, 424K issued at 212K/s, 28.2G total
        0B repaired, 0.00% done, no estimated completion time
config:

        NAME                                                   STATE     READ WRITE CKSUM
        rpool                                                  ONLINE       0     0     0
          mirror-0                                             ONLINE       0     0     0
            ata-Patriot_P200_256GB_AA000000000000000978-part3  ONLINE       0     0     5
            ata-Patriot_P200_256GB_AA000000000000000025-part3  ONLINE       0     0     0

errors: No known data errors

leesteken · Nov 24, 2021

gusto said:

Code:

  scan: scrub in progress since Wed Nov 24 12:11:11 2021
        26.4G scanned at 13.2G/s, 424K issued at 212K/s, 28.2G total
        0B repaired, 0.00% done, no estimated completion time

It is still in progress, so you'll have to wait for the result. As the number of problems keep increasing with that particular drive. it look more and more like that drive is failing.

gusto · Nov 24, 2021

Code:

root@local-proxmox:~# zpool status rpool
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:01:28 with 0 errors on Wed Nov 24 12:12:39 2021
config:

        NAME                                                   STATE     READ WRITE CKSUM
        rpool                                                  ONLINE       0     0     0
          mirror-0                                             ONLINE       0     0     0
            ata-Patriot_P200_256GB_AA000000000000000978-part3  ONLINE       0     0     5
            ata-Patriot_P200_256GB_AA000000000000000025-part3  ONLINE       0     0     0

errors: No known data errors

Dunuin · Nov 24, 2021

Your SSDs wrote about 64TB and the TBW is 160TB. If the SSD isn't older than 3 years (and I guess they aren'T because they only ran for 1.2 years) you could send it in to get a replacement.

Search

Search

SSD state: DEGRADED

gusto

Well-Known Member

leesteken

Distinguished Member

Dunuin

Distinguished Member

gusto

Well-Known Member

leesteken

Distinguished Member

gusto

Well-Known Member

leesteken

Distinguished Member

gusto

Well-Known Member

leesteken

Distinguished Member

gusto

Well-Known Member

Dunuin

Distinguished Member

We value your privacy