SSD state: DEGRADED

gusto

Well-Known Member
Feb 10, 2018
85
2
48
25
Today I found out that I have a problem with one ssd.
Is it possible to fix it or do I need to replace the SSD?

Code:
zpool status

  pool: rpool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: resilvered 6.36G in 00:03:11 with 0 errors on Tue Nov 23 07:53:33 2021
config:

        NAME                                                   STATE     READ WRITE CKSUM
        rpool                                                  DEGRADED     0     0     0
          mirror-0                                             DEGRADED     0     0     0
            ata-Patriot_P200_256GB_AA000000000000000978-part3  FAULTED      0    44     1  too many errors
            ata-Patriot_P200_256GB_AA000000000000000025-part3  ONLINE       0     0     0
 
I've seen similar issues with a drive connected with a bad cable. Sometimes it would lose connection and lots of write errors would occur. When the connection got restored, ZFS would automatically resilver the drive as if it was just a little behind instead of broken, which makes sense.
Your resilvering did not encounter errors, so maybe doing a zpool clear rpool is enough. Run a zpool scrub rpool afterwards to check the drive. Maybe replace the cable, or at least disconnect and reconnect both ends of the cable if this happens again.
 
I turned off the whole machine. I swapped SATA cables between /dev/sda and /dev/sdb
Code:
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 2.12G in 00:00:09 with 0 errors on Wed Nov 24 06:49:11 2021
config:

        NAME                                                   STATE     READ WRITE CKSUM
        rpool                                                  ONLINE       0     0     0
          mirror-0                                             ONLINE       0     0     0
            ata-Patriot_P200_256GB_AA000000000000000978-part3  ONLINE       0     0     1
            ata-Patriot_P200_256GB_AA000000000000000025-part3  ONLINE       0     0     0

errors: No known data errors

smartctl -a /dev/sda
smartctl -a /dev/sdb
 
Please check if there are recen errors in the Proxmox Syslog (or use journalctl on the command line).
ZFS resilved one of your drives again, so it lost or had issues with one of them. Did it do this and/or dectect errors before or after you swapped the cables?.
What were the results of zpool scrub rpool?
 
After zpool scrub rpool
Code:
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Wed Nov 24 12:11:11 2021
        26.4G scanned at 13.2G/s, 424K issued at 212K/s, 28.2G total
        0B repaired, 0.00% done, no estimated completion time
config:

        NAME                                                   STATE     READ WRITE CKSUM
        rpool                                                  ONLINE       0     0     0
          mirror-0                                             ONLINE       0     0     0
            ata-Patriot_P200_256GB_AA000000000000000978-part3  ONLINE       0     0     5
            ata-Patriot_P200_256GB_AA000000000000000025-part3  ONLINE       0     0     0

errors: No known data errors
 
Code:
  scan: scrub in progress since Wed Nov 24 12:11:11 2021
        26.4G scanned at 13.2G/s, 424K issued at 212K/s, 28.2G total
        0B repaired, 0.00% done, no estimated completion time
It is still in progress, so you'll have to wait for the result. As the number of problems keep increasing with that particular drive. it look more and more like that drive is failing.
 
Code:
root@local-proxmox:~# zpool status rpool
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:01:28 with 0 errors on Wed Nov 24 12:12:39 2021
config:

        NAME                                                   STATE     READ WRITE CKSUM
        rpool                                                  ONLINE       0     0     0
          mirror-0                                             ONLINE       0     0     0
            ata-Patriot_P200_256GB_AA000000000000000978-part3  ONLINE       0     0     5
            ata-Patriot_P200_256GB_AA000000000000000025-part3  ONLINE       0     0     0

errors: No known data errors
 
Your SSDs wrote about 64TB and the TBW is 160TB. If the SSD isn't older than 3 years (and I guess they aren'T because they only ran for 1.2 years) you could send it in to get a replacement.
 
Last edited: