ZFS FAULTED - but smartcheck ok, so what is to do?

fireon

Distinguished Member
Oct 25, 2010
4,535
498
153
Austria/Graz
deepdoc.at
Hello all,

i use here PVE6.1. For some day one Pool on one of my servers go in "faulted" state.
Code:
pool: v-machines
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 0 days 05:29:54 with 0 errors on Sun Dec  8 05:53:58 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        v-machines                                      DEGRADED     0     0     0
          raidz1-0                                      ONLINE       0     0     0
            ata-WDC_WD20EARX-00ZUDB0_WD-WCC1H0343538    ONLINE       0     0     0
            ata-WDC_WD20EARX-00ZUDB0_WD-WCC1H0369015    ONLINE       0     0     0
            ata-WDC_WD20EZRX-32D8PB0_WD-WCC4N7UHY6VC    ONLINE       0     0     0
            ata-WDC_WD20EZRX-32SPEB0_WD-WCC4E7NTND7L    ONLINE       0     0     0
          raidz1-1                                      DEGRADED     0     0     0
            ata-WDC_WD20EURS-63S48Y0_WD-WMAZA9381012    FAULTED     16   167     0  too many errors
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA2270223    ONLINE       0     0     0
            ata-WDC_WD2000F9YZ-09N20L1_WD-WCC1P1090743  ONLINE       0     0     0
            ata-WDC_WD2000F9YZ-09N20L1_WD-WMC1P0382195  ONLINE       0     0     0

errors: No known data errors
So i've done an extended smartcheck. But this check is ok.
Code:
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     51222         -
# 2  Short offline       Completed without error       00%     30142         -
 
I also had this but with ZFS on USB Driver as Backup on 6.1 . I heard that USB is not reliable as SATA or SAS on connection.
 
Hi,

A successful smartd test is not a guarantee that your disk is ok. Also zfs errors can be a result of bad hardware (sata cables, psu, ram). Try to change the sata port for the faulty hdd with another good hdd sata port.

Good luck / Bafta
 
Last edited:
  • Like
Reactions: fireon
I'd also monitor 'zpool iostat -lvy 10' when the system is under I/O load, to see whether that disk is noticeably slower than the others.. and think about ordering a replacement disk ;)
 
  • Like
Reactions: fireon and guletz
Hi,

For any WD that I had have, I never see any problems in smartd even in cases where obviously that hdd it was almost broken, with only one exception wd raptor

I also see some cases using seagate when a small numbers of blocks(like in your case) was unreadable but after a week or so they have been replaced with good blocks from the hdd reserves blocks (mostly for new hdd in my intial tests)

And the last case was on a 1 year old system, without any problem who start to show some few scrub errors. In this case the problem was a faulty sata cable (and starting from this incident I use only sata cables with metalic clips). For this reason, one time/year I clean up this contacts with alcohol (cable part, hdd and sata MB ports)! In my country/town is a lot of humidity like in your country :)

Good luck / Bafta!
 
No SATAcables are used. 2 LSI HBA controllers with SAS cables on backplane. Next scrub will be interesting.