Is HDD Dying ?

RudyBzh

Member
Jul 9, 2020
18
1
23
44
Hi,

Any advise please if my /dev/sdb is dying seeing these logs ?

Code:
sept. 02 05:22:43 pve zed[185520]: eid=28 class=checksum pool='tank' vdev=ata-ST14000VN0008-2JG101_ZHZ7ENR0-part1 size=57344 offset=4970742665216 priority=4 err=0 flags=0x1000b0 bookmark=1412:196742:0:124948
sept. 02 05:26:45 pve zed[186864]: eid=29 class=checksum pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 size=65536 offset=5389274861568 priority=4 err=0 flags=0x1000b0 bookmark=1412:199189:0:109375
sept. 02 05:52:25 pve zed[195386]: eid=30 class=checksum pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 size=65536 offset=6913804709888 priority=4 err=0 flags=0x1000b0 bookmark=1412:210864:0:2479
sept. 02 05:58:14 pve zed[197302]: eid=31 class=checksum pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 size=65536 offset=7247486844928 priority=4 err=0 flags=0x1000b0 bookmark=1412:279591:0:13
sept. 02 08:49:38 pve zed[254861]: eid=32 class=checksum pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 size=65536 offset=13678176972800 priority=4 err=0 flags=0x1000b0 bookmark=1412:394530:0:25384
sept. 02 09:13:26 pve zed[262688]: eid=34 class=scrub_finish pool='tank'
sept. 02 15:23:13 pve smartd[1662]: Device: /dev/sdb [SAT], CHECK POWER STATUS spins up disk (0x80 -> 0xff)
sept. 02 18:10:39 pve kernel: ata6.00: exception Emask 0x0 SAct 0x2000000 SErr 0x0 action 0x0
sept. 02 18:10:39 pve kernel: ata6.00: irq_stat 0x40000008
sept. 02 18:10:39 pve kernel: ata6.00: failed command: READ FPDMA QUEUED
sept. 02 18:10:39 pve kernel: ata6.00: cmd 60/80:c8:f8:ac:c3/00:00:0a:04:00/40 tag 25 ncq dma 65536 in
                                       res 43/04:80:f8:ac:c3/00:00:0a:04:00/40 Emask 0x400 (NCQ error) <F>
sept. 02 18:10:39 pve kernel: ata6.00: status: { DRDY SENSE ERR }
sept. 02 18:10:39 pve kernel: ata6.00: error: { ABRT }
sept. 02 18:10:39 pve kernel: ata6.00: n_sectors mismatch 27344764928 != 0
sept. 02 18:10:39 pve kernel: ata6.00: revalidation failed (errno=-19)
sept. 02 18:10:39 pve kernel: ata6: limiting SATA link speed to 3.0 Gbps
sept. 02 18:10:39 pve kernel: ata6.00: limiting speed to UDMA/100:PIO3
sept. 02 18:10:39 pve kernel: ata6: hard resetting link
sept. 02 18:10:39 pve kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
sept. 02 18:10:39 pve kernel: ata6.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
sept. 02 18:10:39 pve kernel: ata6.00: revalidation failed (errno=-5)
sept. 02 18:10:39 pve kernel: ata6.00: disable device
sept. 02 18:10:39 pve kernel: sd 5:0:0:0: rejecting I/O to offline device
sept. 02 18:10:39 pve kernel: I/O error, dev sdb, sector 22123744136 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=11327355949056 size=65536 flags=1573248
sept. 02 18:10:39 pve kernel: I/O error, dev sdb, sector 22123744264 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 0
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=11327356014592 size=65536 flags=1573248
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=11327356080128 size=57344 flags=1573248
sept. 02 18:10:39 pve kernel: I/O error, dev sdb, sector 22085397208 op 0x0:(READ) flags 0x0 phys_seg 3 prio class 0
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=11307722321920 size=65536 flags=1572992
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=11307722387456 size=65536 flags=1572992
sept. 02 18:10:39 pve kernel: I/O error, dev sdb, sector 2576 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=11307722452992 size=65536 flags=1573248
sept. 02 18:10:39 pve kernel: I/O error, dev sdb, sector 22123744504 op 0x0:(READ) flags 0x0 phys_seg 23 prio class 0
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=11327356137472 size=671744 flags=1074267264
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=270336 size=8192 flags=721089
sept. 02 18:10:39 pve kernel: I/O error, dev sdb, sector 27344745488 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=14000508641280 size=8192 flags=721089
sept. 02 18:10:39 pve kernel: I/O error, dev sdb, sector 27344746000 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=14000508903424 size=8192 flags=721089
sept. 02 18:10:39 pve zed[422455]: eid=35 class=io pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 size=8192 offset=270336 priority=0 err=5 flags=0xb00c1
sept. 02 18:10:39 pve zed[422456]: eid=36 class=io pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 size=8192 offset=14000508641280 priority=0 err=5 flags=0xb00c1
sept. 02 18:10:39 pve zed[422458]: eid=37 class=io pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 size=8192 offset=14000508903424 priority=0 err=5 flags=0xb00c1
sept. 02 18:10:39 pve zed[422459]: eid=38 class=probe_failure pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1
sept. 02 18:10:44 pve kernel: ata6: hard resetting link
sept. 02 18:10:45 pve kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
sept. 02 18:10:45 pve kernel: ata6.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
sept. 02 18:10:50 pve kernel: ata6: hard resetting link
sept. 02 18:10:50 pve kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
sept. 02 18:10:50 pve kernel: ata6.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
sept. 02 18:10:50 pve kernel: ata6: limiting SATA link speed to 3.0 Gbps
sept. 02 18:10:56 pve kernel: ata6: hard resetting link
sept. 02 18:10:56 pve kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
sept. 02 18:10:56 pve kernel: ata6.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
sept. 02 18:11:01 pve kernel: ata6: hard resetting link
sept. 02 18:11:02 pve kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
sept. 02 18:11:02 pve kernel: sd 5:0:0:0: [sdb] tag#25 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=23s
sept. 02 18:11:02 pve kernel: sd 5:0:0:0: [sdb] tag#25 Sense Key : Hardware Error [current]
sept. 02 18:11:02 pve kernel: sd 5:0:0:0: [sdb] tag#25 ASC=0x44 <<vendor>>ASCQ=0xd2
sept. 02 18:11:02 pve kernel: sd 5:0:0:0: [sdb] tag#25 CDB: Read(16) 88 00 00 00 00 04 0a c3 ac f8 00 00 00 80 00 00
sept. 02 18:11:02 pve kernel: critical target error, dev sdb, sector 17360465144 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
sept. 02 18:11:02 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=121 type=1 offset=8888557105152 size=65536 flags=1572992
sept. 02 18:11:02 pve kernel: ata6: EH complete
sept. 02 18:11:02 pve kernel: ata6.00: detaching (SCSI 5:0:0:0)
sept. 02 18:11:02 pve zed[422572]: eid=39 class=statechange pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 vdev_state=FAULTED
sept. 02 18:11:02 pve zed[422581]: eid=40 class=probe_failure pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1
sept. 02 18:11:02 pve kernel: sd 5:0:0:0: [sdb] Synchronizing SCSI cache
sept. 02 18:11:02 pve kernel: sd 5:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sept. 02 18:11:02 pve zed[422671]: eid=41 class=statechange pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 vdev_state=REMOVED
sept. 02 18:11:02 pve zed[422673]: eid=42 class=removed pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 vdev_state=REMOVED
sept. 02 18:11:02 pve zed[422744]: eid=43 class=config_sync pool='tank'
sept. 02 18:23:02 pve smartd[1662]: Device: /dev/sdb [SAT], removed ATA device: No such device

Yesterday, I had to do a cold restart of my server to get it back visible as /dev/sdb.
I manage to run a zfs scrumb, which corrected errors after some hours.
Today, it's down again with a degraded pool.

Thanks for your advices.

Regards.
 
What does SMART values on the drive say?
Thanks for reply.
Nothing dramatic as I know/understand…
Notice I had to cold reboot server again see /dev/sdb again.



Code:
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-1-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba MG07ACA... Enterprise Capacity HDD
Device Model:     TOSHIBA MG07ACA14TE
Serial Number:    Z070A2V2F94G
LU WWN Device Id: 5 000039 a98c9710e
Firmware Version: 0103
User Capacity:    14 000 519 643 136 bytes [14,0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep  2 22:50:37 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  20) The self-test routine was aborted by
                                        the host.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1355) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       9018
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       67
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   030   030   000    Old_age   Always       -       28394
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       67
 23 Helium_Condition_Lower  0x0023   100   100   075    Pre-fail  Always       -       0
 24 Helium_Condition_Upper  0x0023   100   100   075    Pre-fail  Always       -       0
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       32
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       32
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       1018
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       34 (Min/Max 12/61)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       168689668
222 Loaded_Hours            0x0032   032   032   000    Old_age   Always       -       27438
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       593
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Aborted by host               40%     28394         -
# 2  Short offline       Completed without error       00%     28394         -
# 3  Short offline       Completed without error       00%     28372         -
# 4  Extended offline    Completed without error       00%        32         -
# 5  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.