SSD I/O error and S.M.A.R.T data

dec · Sep 2, 2024

Hi,

I had backup job errors since two days (ERROR: job failed with err -125 - Operation canceled).
Looking with dmesg I can see this:

Code:

[413188.587199] ata1.00: exception Emask 0x0 SAct 0x1000 SErr 0x0 action 0x0
[413188.587213] ata1.00: irq_stat 0x40000008
[413188.587217] ata1.00: failed command: READ FPDMA QUEUED
[413188.587219] ata1.00: cmd 60/08:60:38:54:42/00:00:02:00:00/40 tag 12 ncq dma 4096 in
                         res 41/40:00:38:54:42/00:00:02:00:00/00 Emask 0x409 (media error) <F>
[413188.587228] ata1.00: status: { DRDY ERR }
[413188.587231] ata1.00: error: { UNC }
[413188.594362] ata1.00: configured for UDMA/133
[413188.594379] sd 0:0:0:0: [sda] tag#12 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[413188.594382] sd 0:0:0:0: [sda] tag#12 Sense Key : Medium Error [current]
[413188.594384] sd 0:0:0:0: [sda] tag#12 Add. Sense: Unrecovered read error - auto reallocate failed
[413188.594387] sd 0:0:0:0: [sda] tag#12 CDB: Read(10) 28 00 02 42 54 38 00 00 08 00
[413188.594388] I/O error, dev sda, sector 37901368 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[413188.594408] ata1: EH complete
[413188.651220] ata1.00: exception Emask 0x0 SAct 0x80 SErr 0x0 action 0x0
[413188.651233] ata1.00: irq_stat 0x40000008
[413188.651236] ata1.00: failed command: READ FPDMA QUEUED
[413188.651239] ata1.00: cmd 60/08:38:38:54:42/00:00:02:00:00/40 tag 7 ncq dma 4096 in
                         res 41/40:00:38:54:42/00:00:02:00:00/00 Emask 0x409 (media error) <F>
[413188.651247] ata1.00: status: { DRDY ERR }
[413188.651250] ata1.00: error: { UNC }
[413188.658395] ata1.00: configured for UDMA/133
[413188.658409] sd 0:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[413188.658413] sd 0:0:0:0: [sda] tag#7 Sense Key : Medium Error [current]
[413188.658415] sd 0:0:0:0: [sda] tag#7 Add. Sense: Unrecovered read error - auto reallocate failed
[413188.658417] sd 0:0:0:0: [sda] tag#7 CDB: Read(10) 28 00 02 42 54 38 00 00 08 00
[413188.658418] I/O error, dev sda, sector 37901368 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[413188.658433] ata1: EH complete

I know I need to replace this SSD but I would like to know why S.M.A.R.T continue to show PASSED in Proxmox disks UI

Code:

root@odin:~# smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.13-1-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Marvell based SanDisk SSDs
Device Model:     SanDisk SSD PLUS 480GB
Serial Number:    193448800138
LU WWN Device Id: 5 001b44 8b821c63d
Firmware Version: UG2204RL
User Capacity:    480,103,981,056 bytes [480 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3, ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep  2 09:55:19 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

[...]

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       43227
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       56
165 Total_Write/Erase_Count 0x0032   100   100   000    Old_age   Always       -       4105
166 Min_W/E_Cycle           0x0032   100   100   ---    Old_age   Always       -       17
167 Min_Bad_Block/Die       0x0032   100   100   ---    Old_age   Always       -       32
168 Maximum_Erase_Cycle     0x0032   100   100   ---    Old_age   Always       -       50
169 Total_Bad_Block         0x0032   100   100   ---    Old_age   Always       -       426
170 Unknown_Marvell_Attr    0x0032   100   100   ---    Old_age   Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Avg_Write/Erase_Count   0x0032   100   100   000    Old_age   Always       -       17
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       40
184 End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       5510
188 Command_Timeout         0x0032   100   100   ---    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   065   054   000    Old_age   Always       -       35 (Min/Max 16/54)
199 SATA_CRC_Error          0x0032   100   100   ---    Old_age   Always       -       0
230 Perc_Write/Erase_Count  0x0032   100   100   000    Old_age   Always       -       2888 808 2888
232 Perc_Avail_Resrvd_Space 0x0033   100   100   005    Pre-fail  Always       -       100
233 Total_NAND_Writes_GiB   0x0032   100   100   ---    Old_age   Always       -       9104
234 Perc_Write/Erase_Ct_BC  0x0032   100   100   000    Old_age   Always       -       50816
241 Total_Writes_GiB        0x0030   100   100   000    Old_age   Offline      -       16739
242 Total_Reads_GiB         0x0030   100   100   000    Old_age   Offline      -       17602
244 Thermal_Throttle        0x0032   000   100   ---    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported

Shouldn't Proxmox send an e-mail to indicate an hardware error like smartd does ?
Maybe writing a basic parser that would check for most common errors ? (using a knowledge database)

Maximiliano · Sep 2, 2024

Hello,

According to the SMARTD output, everything (except thermals) reports a normalized value of 100 (lower is is worse) and there are "No Errors Logged" so naturally Proxmox VE would report that the SMARTD test passed.

Note that how those are reported and normalized is very much vendor specific. Do you have a hardware controller in between the host and the disks? That might prevent smartd from talking to the disks.

dec · Sep 2, 2024

No, there is no hardware controller, all disks are directly plugged on the motherboard.
I was thinking that an error would be logged by S.M.A.R.T because, in my mind, reallocating was about reallocating sectors (managed by disk controller), but maybe this reallocating thing is related to the filesystem (or underlying I/O layer) ?

Maximiliano · Sep 2, 2024

I am not familiar with the errors you see, but there are more possible causes for IO errors other than faulty disks. Have you had more errors with this disk in the past? Have you tried re-seating the disk?

alofgran · Oct 21, 2024

@dec, did you ever find answers to this? I'm fighting with exactly the same error in addition to a HDD supplier who isn't (yet) willing to honor their warranty and replace the drive because they think I haven't definitively proven (with their methods) that the disk is indeed dying/dead.

Any information you learned would be helpful.

SSD I/O error and S.M.A.R.T data

dec

Member

Maximiliano

Proxmox Staff Member

dec

Member

Maximiliano

Proxmox Staff Member

alofgran

New Member

We value your privacy