unklarer Fehler ZFS/SMART/SATA 6.0 Gbps/UDMA/133

tony blue · Nov 14, 2021

Hallo,

ich bin ratlos, was mit meinem seit geraumer Zeit zuverlässig laufendem Proxmox Server los ist und wie ich das lösen kann.

Kurzfassung: ZFS + SMART melden Fehler, alle Festplatten werden nur noch als UDMA/133 eingebunden.

Heute Morgen 5:58 Uhr: ZFS device fault for pool

Code:

The number of I/O errors associated with a ZFS device exceeded
acceptable levels. ZFS has marked the device as faulted.

 impact: Fault tolerance of the pool may be compromised.
    eid: 154391
  class: statechange
  state: FAULTED
   host: virtualhost
   time: 2021-11-14 05:58:11+0100
  vpath: /dev/sdc2
  vguid: 0xC338A55969D3184F
   pool: 0x174BC0321B6273A6

Code:

:~# zpool status -x
  pool: rpool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub in progress since Sun Nov 14 00:24:02 2021
        9.88T scanned at 360M/s, 9.00T issued at 328M/s, 15.6T total
        912K repaired, 57.77% done, 05:50:27 to go
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       DEGRADED     0     0     0
          raidz1-0  DEGRADED     0     0     0
            sda2    ONLINE       0     0     0
            sdb2    ONLINE       0     0     0
            sdc2    FAULTED     17     0     0  too many errors  (repairing)

errors: No known data errors

Anschließend 6:18 Uhr: SMART error (ErrorCount) detected on host:

Code:

This message was generated by the smartd daemon running on:

   host name:  virtualhost
   DNS domain: duck

The following warning/error was logged by the smartd daemon:

Device: /dev/sdc [SAT], ATA error count increased from 0 to 1

Device info:
ST8000NM0055-1RM112, S/N:ZA19V8QR, WWN:5-000c50-0af629d42, FW:SN05, 8.00 TB

In der /var/log/syslog steht dazu

Code:

Nov 14 06:18:37 virtualhost smartd[2680]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 69
Nov 14 06:18:37 virtualhost smartd[2680]: Device: /dev/sdc [SAT], SMART Usage Attribute: 187 Reported_Uncorrect changed from 100 to 99
Nov 14 06:18:37 virtualhost smartd[2680]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 60 to 62
Nov 14 06:18:37 virtualhost smartd[2680]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 40 to 38
Nov 14 06:18:37 virtualhost smartd[2680]: Device: /dev/sdc [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 100 to 78
Nov 14 06:18:37 virtualhost smartd[2680]: Device: /dev/sdc [SAT], ATA error count increased from 0 to 1
Nov 14 06:18:37 virtualhost smartd[2680]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...

Daraufhin habe ich einen neuer Smart-Schnelltest angestoßen: smartctl -t short /dev/sdc

Code:

smartctl -l selftest /dev/sdc
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.11.22-5-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     12392         -
# 2  Extended offline    Completed without error       00%        40         -

Meine Vermutung hierzu war, dass die Platte /dev/sdc defekt ist und ausgetauscht werden muss. Daraufhin habe ich im laufenden System eine Ersatzplatte eingesteckt und wollte das resilvering anstoßen. Beim Einstecken ist mir aufgefallen, dass die Platte nur im UDMA/133 läuft.

Code:

[1860674.617985] ata4: link is slow to respond, please be patient (ready=0)
[

1860678.486019] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[1860678.488885] ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.SAT0.PRT3._GTF.DSSP], AE_NOT_FOUND (20201113/psargs-330)

[1860678.488921] No Local Variables are initialized for Method [_GTF]

[1860678.488923] No Arguments are initialized for method [_GTF]

[1860678.488924] ACPI Error: Aborting method \_SB.PCI0.SAT0.PRT3._GTF due to previous error (AE_NOT_FOUND) (20201113/psparse-529)
[1860678.489451] ata4.00: ATA-10: ST8000NM0055-1RM112, SN05, max UDMA/133
[1860678.489453] ata4.00: 15628053168 sectors, multi 0: LBA48 NCQ (depth 32), AA
[1860678.492204] ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.SAT0.PRT3._GTF.DSSP], AE_NOT_FOUND (20201113/psargs-330)

[1860678.492240] No Local Variables are initialized for Method [_GTF]

[1860678.492241] No Arguments are initialized for method [_GTF]

[1860678.492243] ACPI Error: Aborting method \_SB.PCI0.SAT0.PRT3._GTF due to previous error (AE_NOT_FOUND) (20201113/psparse-529)
[1860678.492639] ata4.00: configured for UDMA/133
[1860678.492696] scsi 3:0:0:0: Direct-Access     ATA      ST8000NM0055-1RM SN05 PQ: 0 ANSI: 5
[1860678.492887] sd 3:0:0:0: Attached scsi generic sg3 type 0
[1860678.492922] sd 3:0:0:0: [sdd] 15628053168 512-byte logical blocks: (8.00 TB/7.28 TiB)
[1860678.492924] sd 3:0:0:0: [sdd] 4096-byte physical blocks
[1860678.492929] sd 3:0:0:0: [sdd] Write Protect is off
[1860678.492930] sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
[1860678.492939] sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[1860678.570090] sd 3:0:0:0: [sdd] Attached SCSI disk

Daraufhin habe ich kein resilvering angestoßen, sondern den Rechner neu gestartet. Nun ist mir aufgefallen, dass alle Platten im UDMA/133 laufen (obwohl die Platten als auch der Controller SATA 6.0 Gbps können).

Code:

dmesg | grep ata1
[    1.439448] ata1: SATA max UDMA/133 abar m2048@0xf7a4b000 port 0xf7a4b100 irq 130
[    1.757152] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    1.774653] ata1.00: ATA-10: ST8000NM0055-1RM112, SN04, max UDMA/133
[    1.774655] ata1.00: 15628053168 sectors, multi 16: LBA48 NCQ (depth 32), AA
[    1.778225] ata1.00: configured for UDMA/133
root@virtualhost:~# dmesg | grep ata2
[    1.439450] ata2: SATA max UDMA/133 abar m2048@0xf7a4b000 port 0xf7a4b180 irq 130
[    1.752945] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    1.756486] ata2.00: ATA-10: ST8000NM0055-1RM112, SN04, max UDMA/133
[    1.756489] ata2.00: 15628053168 sectors, multi 16: LBA48 NCQ (depth 32), AA
[    1.759997] ata2.00: configured for UDMA/133
root@virtualhost:~# dmesg | grep ata3
[    1.439452] ata3: SATA max UDMA/133 abar m2048@0xf7a4b000 port 0xf7a4b200 irq 130
[    1.752901] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    1.765947] ata3.00: ATA-10: ST8000NM0055-1RM112, SN05, max UDMA/133
[    1.765950] ata3.00: 15628053168 sectors, multi 16: LBA48 NCQ (depth 32), AA
[    1.769530] ata3.00: configured for UDMA/133

Ich bin ratlos. Was könnte die Ursache sein?

Vielen Dank

Tony

gmed · Nov 14, 2021

Deine sdc ist der Verursacher.
Ist die Ersatzplatte fabrikneu?
Ist sie Typgleich zur Alten?

Wenn sie nicht neu ist, war sie schon mal an irgendeinem RAID-Controller oder als Teil eines md-Device im Einsatz?

tony blue · Nov 14, 2021

Ist die Ersatzplatte fabrikneu? Ist sie Typgleich zur Alten?

Ja, sie war noch eingeschweißt und ist identisch zur sdc und zu den anderen beiden Platten.

Was mich besonders wundert, ist dass die anderen Platten plötzlich mit UDMA/133 laufen.

gmed · Nov 16, 2021

An was für einem HBA/ Controller hängen denn die Platten?

tony blue · Nov 17, 2021

Die Platten hängen am SATA-Controller des Mainboards (Gigabyte 270-HD3P).

Was ist der sinnvollste nächste Schritt? Platte tauschen oder Mainboard tauschen (wg. SATA-Controller)?

gmed · Nov 18, 2021

Moijen,

Ganz ehrlich würd ich da gar nix machen.

"SATA link up 6.0 Gbps" sagt, daß die Platten mit ihrer best möglichen Bandbreite angesprochen werden und gut ist.
Sata heist ja immer noch ATA und daher kommt wohl die für aktuelle Hardware eigentlich unnötige Meldung?

tony blue · Nov 19, 2021

Vielen Dank gmed für die Rückmeldung. D. h. der SATA-Controller am Mainboard müsste in Ordnung sein.

ZFS meldet allerdings immer noch einen Fehlerzustand:

zpool status
pool: rpool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: resilvered 1.35G in 00:01:47 with 0 errors on Sun Nov 14 10:44:08 2021
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
sda2 ONLINE 0 0 0
sdb2 ONLINE 0 0 0
sdc2 ONLINE 1 0 2

errors: No known data errors

Der Smartd meldet:

The following warning/error was logged by the smartd daemon:

Device: /dev/sdc [SAT], ATA error count increased from 1 to 2

Den Fehler kann ich allerdings nicht beurteilen. Sollte ich

* ZFS sagen, das alles i. O ist (zpool clear) oder
* die Platte tauschen (zpool replace)?

Vielen Dank!

Tony

gmed · Nov 19, 2021

Was gibt dem smartctl -a /dev/sdc an?
Wenn da sowas bei rumkommt:

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

dann würde ich meinen, dass ein zpool clear Ruhe bringt.
Sollten sich auch bei der neuen Platte Fehler zeigen eben nicht.

tony blue · Nov 19, 2021

Die Platte ist bisher noch nicht getauscht.

Code:

 smartctl -a /dev/sdc

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.11.22-5-pve] (local build)

Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===

Model Family:     Seagate Enterprise Capacity 3.5 HDD

Device Model:     ST8000NM0055-1RM112

Serial Number:    ZA19V8QR

LU WWN Device Id: 5 000c50 0af629d42

Firmware Version: SN05

User Capacity:    8.001.563.222.016 bytes [8,00 TB]

Sector Sizes:     512 bytes logical, 4096 bytes physical

Rotation Rate:    7200 rpm

Form Factor:      3.5 inches

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   ACS-3 T13/2161-D revision 5

SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is:    Fri Nov 19 14:52:33 2021 CET

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

                                        was completed without error.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever

                                        been run.

Total time to complete Offline

data collection:                (  575) seconds.

Offline data collection

capabilities:                    (0x7b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine

recommended polling time:        (   1) minutes.

Extended self-test routine

recommended polling time:        ( 816) minutes.

Conveyance self-test routine

recommended polling time:        (   2) minutes.

SCT capabilities:              (0x70bd) SCT Status supported.

                                        SCT Error Recovery Control supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.


SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   081   061   044    Pre-fail  Always       -       116949104

  3 Spin_Up_Time            0x0003   089   089   000    Pre-fail  Always       -       0

  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       23

  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       2696

  7 Seek_Error_Rate         0x000f   093   060   045    Pre-fail  Always       -       2046447042

  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       12519

 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       22

184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0

187 Reported_Uncorrect      0x0032   098   098   000    Old_age   Always       -       2

188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       0 0 3

189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0

190 Airflow_Temperature_Cel 0x0022   059   053   040    Old_age   Always       -       41 (Min/Max 32/42)

191 G-Sense_Error_Rate      0x0032   090   090   000    Old_age   Always       -       21032

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       331

193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       569

194 Temperature_Celsius     0x0022   041   047   000    Old_age   Always       -       41 (0 23 0 0 0)

195 Hardware_ECC_Recovered  0x001a   081   064   000    Old_age   Always       -       116949104

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1

240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       12460h+10m+38.267s

241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       120337357432

242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       199351569366


SMART Error Log Version: 1

ATA Error Count: 2

        CR = Command Register [HEX]

        FR = Features Register [HEX]

        SC = Sector Count Register [HEX]

        SN = Sector Number Register [HEX]

        CL = Cylinder Low Register [HEX]

        CH = Cylinder High Register [HEX]

        DH = Device/Head Register [HEX]

        DC = Device Command Register [HEX]

        ER = Error register [HEX]

        ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.


Error 2 occurred at disk power-on lifetime: 12476 hours (519 days + 20 hours)

  When the command that caused the error occurred, the device was active or idle.


  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 53 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455


  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  61 00 20 ff ff ff 4f 00   3d+09:19:11.102  WRITE FPDMA QUEUED

  60 00 08 ff ff ff 4f 00   3d+09:19:10.810  READ FPDMA QUEUED

  60 00 08 ff ff ff 4f 00   3d+09:19:10.797  READ FPDMA QUEUED

  60 00 38 ff ff ff 4f 00   3d+09:19:10.797  READ FPDMA QUEUED

  60 00 10 ff ff ff 4f 00   3d+09:19:10.787  READ FPDMA QUEUED


Error 1 occurred at disk power-on lifetime: 12390 hours (516 days + 6 hours)

  When the command that caused the error occurred, the device was active or idle.


  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 53 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455


  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  61 00 80 ff ff ff 4f 00  31d+18:37:15.339  WRITE FPDMA QUEUED

  60 00 88 ff ff ff 4f 00  31d+18:37:12.290  READ FPDMA QUEUED

  60 00 90 ff ff ff 4f 00  31d+18:37:12.267  READ FPDMA QUEUED

  60 00 38 ff ff ff 4f 00  31d+18:37:12.267  READ FPDMA QUEUED

  60 00 48 ff ff ff 4f 00  31d+18:37:12.227  READ FPDMA QUEUED


SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed: read failure       90%     12392         -

# 2  Extended offline    Completed without error       00%        40         -


SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Ich kann die Fehler selbst nicht beurteilen. Sollte ich die Platte austauschen?

Vielen Dank!

Falk R. · Nov 19, 2021

Sieht soweit normal aus für Seagate. Da sind erst 2 unrecoverable errors.

RolandK · Nov 20, 2021

tausch mal das sata kabel

gmed · Nov 20, 2021

Regelmäßig Backups machen und abwarten oder die Platte tauschen und die Alte, sofern noch Garantie, bei Seagate einsenden und dann bekommst du ne Tauschplatte.

unklarer Fehler ZFS/SMART/SATA 6.0 Gbps/UDMA/133

tony blue

Renowned Member

gmed

Renowned Member

tony blue

Renowned Member

gmed

Renowned Member

tony blue

Renowned Member

gmed

Renowned Member

tony blue

Renowned Member

gmed

Renowned Member

tony blue

Renowned Member

Falk R.

Distinguished Member

RolandK

Famous Member

gmed

Renowned Member

We value your privacy