Hi,
I had backup job errors since two days (ERROR: job failed with err -125 - Operation canceled).
Looking with dmesg I can see this:
I know I need to replace this SSD but I would like to know why S.M.A.R.T continue to show PASSED in Proxmox disks UI
Shouldn't Proxmox send an e-mail to indicate an hardware error like smartd does ?
Maybe writing a basic parser that would check for most common errors ? (using a knowledge database)
I had backup job errors since two days (ERROR: job failed with err -125 - Operation canceled).
Looking with dmesg I can see this:
Code:
[413188.587199] ata1.00: exception Emask 0x0 SAct 0x1000 SErr 0x0 action 0x0
[413188.587213] ata1.00: irq_stat 0x40000008
[413188.587217] ata1.00: failed command: READ FPDMA QUEUED
[413188.587219] ata1.00: cmd 60/08:60:38:54:42/00:00:02:00:00/40 tag 12 ncq dma 4096 in
res 41/40:00:38:54:42/00:00:02:00:00/00 Emask 0x409 (media error) <F>
[413188.587228] ata1.00: status: { DRDY ERR }
[413188.587231] ata1.00: error: { UNC }
[413188.594362] ata1.00: configured for UDMA/133
[413188.594379] sd 0:0:0:0: [sda] tag#12 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[413188.594382] sd 0:0:0:0: [sda] tag#12 Sense Key : Medium Error [current]
[413188.594384] sd 0:0:0:0: [sda] tag#12 Add. Sense: Unrecovered read error - auto reallocate failed
[413188.594387] sd 0:0:0:0: [sda] tag#12 CDB: Read(10) 28 00 02 42 54 38 00 00 08 00
[413188.594388] I/O error, dev sda, sector 37901368 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[413188.594408] ata1: EH complete
[413188.651220] ata1.00: exception Emask 0x0 SAct 0x80 SErr 0x0 action 0x0
[413188.651233] ata1.00: irq_stat 0x40000008
[413188.651236] ata1.00: failed command: READ FPDMA QUEUED
[413188.651239] ata1.00: cmd 60/08:38:38:54:42/00:00:02:00:00/40 tag 7 ncq dma 4096 in
res 41/40:00:38:54:42/00:00:02:00:00/00 Emask 0x409 (media error) <F>
[413188.651247] ata1.00: status: { DRDY ERR }
[413188.651250] ata1.00: error: { UNC }
[413188.658395] ata1.00: configured for UDMA/133
[413188.658409] sd 0:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[413188.658413] sd 0:0:0:0: [sda] tag#7 Sense Key : Medium Error [current]
[413188.658415] sd 0:0:0:0: [sda] tag#7 Add. Sense: Unrecovered read error - auto reallocate failed
[413188.658417] sd 0:0:0:0: [sda] tag#7 CDB: Read(10) 28 00 02 42 54 38 00 00 08 00
[413188.658418] I/O error, dev sda, sector 37901368 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[413188.658433] ata1: EH complete
I know I need to replace this SSD but I would like to know why S.M.A.R.T continue to show PASSED in Proxmox disks UI
Code:
root@odin:~# smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.13-1-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Marvell based SanDisk SSDs
Device Model: SanDisk SSD PLUS 480GB
Serial Number: 193448800138
LU WWN Device Id: 5 001b44 8b821c63d
Firmware Version: UG2204RL
User Capacity: 480,103,981,056 bytes [480 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic
Device is: In smartctl database 7.3/5319
ATA Version is: ACS-3, ACS-2 T13/2015-D revision 3
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Sep 2 09:55:19 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
[...]
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 43227
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 56
165 Total_Write/Erase_Count 0x0032 100 100 000 Old_age Always - 4105
166 Min_W/E_Cycle 0x0032 100 100 --- Old_age Always - 17
167 Min_Bad_Block/Die 0x0032 100 100 --- Old_age Always - 32
168 Maximum_Erase_Cycle 0x0032 100 100 --- Old_age Always - 50
169 Total_Bad_Block 0x0032 100 100 --- Old_age Always - 426
170 Unknown_Marvell_Attr 0x0032 100 100 --- Old_age Always - 0
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Avg_Write/Erase_Count 0x0032 100 100 000 Old_age Always - 17
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 40
184 End-to-End_Error 0x0032 100 100 --- Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 5510
188 Command_Timeout 0x0032 100 100 --- Old_age Always - 0
194 Temperature_Celsius 0x0022 065 054 000 Old_age Always - 35 (Min/Max 16/54)
199 SATA_CRC_Error 0x0032 100 100 --- Old_age Always - 0
230 Perc_Write/Erase_Count 0x0032 100 100 000 Old_age Always - 2888 808 2888
232 Perc_Avail_Resrvd_Space 0x0033 100 100 005 Pre-fail Always - 100
233 Total_NAND_Writes_GiB 0x0032 100 100 --- Old_age Always - 9104
234 Perc_Write/Erase_Ct_BC 0x0032 100 100 000 Old_age Always - 50816
241 Total_Writes_GiB 0x0030 100 100 000 Old_age Offline - 16739
242 Total_Reads_GiB 0x0030 100 100 000 Old_age Offline - 17602
244 Thermal_Throttle 0x0032 000 100 --- Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
Selective Self-tests/Logging not supported
Shouldn't Proxmox send an e-mail to indicate an hardware error like smartd does ?
Maybe writing a basic parser that would check for most common errors ? (using a knowledge database)