Is HDD Dying ?

RudyBzh

Member
Jul 9, 2020
18
1
23
44
Hi,

Any advise please if my /dev/sdb is dying seeing these logs ?

Code:
sept. 02 05:22:43 pve zed[185520]: eid=28 class=checksum pool='tank' vdev=ata-ST14000VN0008-2JG101_ZHZ7ENR0-part1 size=57344 offset=4970742665216 priority=4 err=0 flags=0x1000b0 bookmark=1412:196742:0:124948
sept. 02 05:26:45 pve zed[186864]: eid=29 class=checksum pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 size=65536 offset=5389274861568 priority=4 err=0 flags=0x1000b0 bookmark=1412:199189:0:109375
sept. 02 05:52:25 pve zed[195386]: eid=30 class=checksum pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 size=65536 offset=6913804709888 priority=4 err=0 flags=0x1000b0 bookmark=1412:210864:0:2479
sept. 02 05:58:14 pve zed[197302]: eid=31 class=checksum pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 size=65536 offset=7247486844928 priority=4 err=0 flags=0x1000b0 bookmark=1412:279591:0:13
sept. 02 08:49:38 pve zed[254861]: eid=32 class=checksum pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 size=65536 offset=13678176972800 priority=4 err=0 flags=0x1000b0 bookmark=1412:394530:0:25384
sept. 02 09:13:26 pve zed[262688]: eid=34 class=scrub_finish pool='tank'
sept. 02 15:23:13 pve smartd[1662]: Device: /dev/sdb [SAT], CHECK POWER STATUS spins up disk (0x80 -> 0xff)
sept. 02 18:10:39 pve kernel: ata6.00: exception Emask 0x0 SAct 0x2000000 SErr 0x0 action 0x0
sept. 02 18:10:39 pve kernel: ata6.00: irq_stat 0x40000008
sept. 02 18:10:39 pve kernel: ata6.00: failed command: READ FPDMA QUEUED
sept. 02 18:10:39 pve kernel: ata6.00: cmd 60/80:c8:f8:ac:c3/00:00:0a:04:00/40 tag 25 ncq dma 65536 in
                                       res 43/04:80:f8:ac:c3/00:00:0a:04:00/40 Emask 0x400 (NCQ error) <F>
sept. 02 18:10:39 pve kernel: ata6.00: status: { DRDY SENSE ERR }
sept. 02 18:10:39 pve kernel: ata6.00: error: { ABRT }
sept. 02 18:10:39 pve kernel: ata6.00: n_sectors mismatch 27344764928 != 0
sept. 02 18:10:39 pve kernel: ata6.00: revalidation failed (errno=-19)
sept. 02 18:10:39 pve kernel: ata6: limiting SATA link speed to 3.0 Gbps
sept. 02 18:10:39 pve kernel: ata6.00: limiting speed to UDMA/100:PIO3
sept. 02 18:10:39 pve kernel: ata6: hard resetting link
sept. 02 18:10:39 pve kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
sept. 02 18:10:39 pve kernel: ata6.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
sept. 02 18:10:39 pve kernel: ata6.00: revalidation failed (errno=-5)
sept. 02 18:10:39 pve kernel: ata6.00: disable device
sept. 02 18:10:39 pve kernel: sd 5:0:0:0: rejecting I/O to offline device
sept. 02 18:10:39 pve kernel: I/O error, dev sdb, sector 22123744136 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=11327355949056 size=65536 flags=1573248
sept. 02 18:10:39 pve kernel: I/O error, dev sdb, sector 22123744264 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 0
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=11327356014592 size=65536 flags=1573248
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=11327356080128 size=57344 flags=1573248
sept. 02 18:10:39 pve kernel: I/O error, dev sdb, sector 22085397208 op 0x0:(READ) flags 0x0 phys_seg 3 prio class 0
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=11307722321920 size=65536 flags=1572992
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=11307722387456 size=65536 flags=1572992
sept. 02 18:10:39 pve kernel: I/O error, dev sdb, sector 2576 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=11307722452992 size=65536 flags=1573248
sept. 02 18:10:39 pve kernel: I/O error, dev sdb, sector 22123744504 op 0x0:(READ) flags 0x0 phys_seg 23 prio class 0
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=11327356137472 size=671744 flags=1074267264
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=270336 size=8192 flags=721089
sept. 02 18:10:39 pve kernel: I/O error, dev sdb, sector 27344745488 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=14000508641280 size=8192 flags=721089
sept. 02 18:10:39 pve kernel: I/O error, dev sdb, sector 27344746000 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
sept. 02 18:10:39 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=5 type=1 offset=14000508903424 size=8192 flags=721089
sept. 02 18:10:39 pve zed[422455]: eid=35 class=io pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 size=8192 offset=270336 priority=0 err=5 flags=0xb00c1
sept. 02 18:10:39 pve zed[422456]: eid=36 class=io pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 size=8192 offset=14000508641280 priority=0 err=5 flags=0xb00c1
sept. 02 18:10:39 pve zed[422458]: eid=37 class=io pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 size=8192 offset=14000508903424 priority=0 err=5 flags=0xb00c1
sept. 02 18:10:39 pve zed[422459]: eid=38 class=probe_failure pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1
sept. 02 18:10:44 pve kernel: ata6: hard resetting link
sept. 02 18:10:45 pve kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
sept. 02 18:10:45 pve kernel: ata6.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
sept. 02 18:10:50 pve kernel: ata6: hard resetting link
sept. 02 18:10:50 pve kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
sept. 02 18:10:50 pve kernel: ata6.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
sept. 02 18:10:50 pve kernel: ata6: limiting SATA link speed to 3.0 Gbps
sept. 02 18:10:56 pve kernel: ata6: hard resetting link
sept. 02 18:10:56 pve kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
sept. 02 18:10:56 pve kernel: ata6.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
sept. 02 18:11:01 pve kernel: ata6: hard resetting link
sept. 02 18:11:02 pve kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
sept. 02 18:11:02 pve kernel: sd 5:0:0:0: [sdb] tag#25 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=23s
sept. 02 18:11:02 pve kernel: sd 5:0:0:0: [sdb] tag#25 Sense Key : Hardware Error [current]
sept. 02 18:11:02 pve kernel: sd 5:0:0:0: [sdb] tag#25 ASC=0x44 <<vendor>>ASCQ=0xd2
sept. 02 18:11:02 pve kernel: sd 5:0:0:0: [sdb] tag#25 CDB: Read(16) 88 00 00 00 00 04 0a c3 ac f8 00 00 00 80 00 00
sept. 02 18:11:02 pve kernel: critical target error, dev sdb, sector 17360465144 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
sept. 02 18:11:02 pve kernel: zio pool=tank vdev=/dev/disk/by-id/ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 error=121 type=1 offset=8888557105152 size=65536 flags=1572992
sept. 02 18:11:02 pve kernel: ata6: EH complete
sept. 02 18:11:02 pve kernel: ata6.00: detaching (SCSI 5:0:0:0)
sept. 02 18:11:02 pve zed[422572]: eid=39 class=statechange pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 vdev_state=FAULTED
sept. 02 18:11:02 pve zed[422581]: eid=40 class=probe_failure pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1
sept. 02 18:11:02 pve kernel: sd 5:0:0:0: [sdb] Synchronizing SCSI cache
sept. 02 18:11:02 pve kernel: sd 5:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sept. 02 18:11:02 pve zed[422671]: eid=41 class=statechange pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 vdev_state=REMOVED
sept. 02 18:11:02 pve zed[422673]: eid=42 class=removed pool='tank' vdev=ata-TOSHIBA_MG07ACA14TE_Z070A2V2F94G-part1 vdev_state=REMOVED
sept. 02 18:11:02 pve zed[422744]: eid=43 class=config_sync pool='tank'
sept. 02 18:23:02 pve smartd[1662]: Device: /dev/sdb [SAT], removed ATA device: No such device

Yesterday, I had to do a cold restart of my server to get it back visible as /dev/sdb.
I manage to run a zfs scrumb, which corrected errors after some hours.
Today, it's down again with a degraded pool.

Thanks for your advices.

Regards.
 
What does SMART values on the drive say?
Thanks for reply.
Nothing dramatic as I know/understand…
Notice I had to cold reboot server again see /dev/sdb again.



Code:
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-1-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba MG07ACA... Enterprise Capacity HDD
Device Model:     TOSHIBA MG07ACA14TE
Serial Number:    Z070A2V2F94G
LU WWN Device Id: 5 000039 a98c9710e
Firmware Version: 0103
User Capacity:    14 000 519 643 136 bytes [14,0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep  2 22:50:37 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  20) The self-test routine was aborted by
                                        the host.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1355) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       9018
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       67
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   030   030   000    Old_age   Always       -       28394
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       67
 23 Helium_Condition_Lower  0x0023   100   100   075    Pre-fail  Always       -       0
 24 Helium_Condition_Upper  0x0023   100   100   075    Pre-fail  Always       -       0
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       32
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       32
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       1018
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       34 (Min/Max 12/61)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       168689668
222 Loaded_Hours            0x0032   032   032   000    Old_age   Always       -       27438
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       593
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Aborted by host               40%     28394         -
# 2  Short offline       Completed without error       00%     28394         -
# 3  Short offline       Completed without error       00%     28372         -
# 4  Extended offline    Completed without error       00%        32         -
# 5  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!