Smart errors

wirdo02 · Jul 16, 2020

Hello,

I'm having an odd issue with one of our servers. I replaced one of the disks because it was bad but the new one keeps giving SMART errors as well.
When doing an short or extended manual test I don't see these errors at all. Could it be cached from the failed drive (which was also sdb)?

Syslog:
Jul 16 08:20:32 intern smartd[18216]: Device: /dev/sdb [SAT], Failed SMART usage Attribute: 1 Raw_Read_Error_Rate.
Jul 16 08:20:32 intern smartd[18216]: Device: /dev/sdb [SAT], Failed SMART usage Attribute: 172 Erase_Fail_Count.
Jul 16 08:20:32 intern smartd[18216]: Device: /dev/sdb [SAT], Failed SMART usage Attribute: 173 Ave_Block-Erase_Count.
Jul 16 08:20:32 intern smartd[18216]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 68 to 69
Jul 16 08:20:32 intern smartd[18216]: Device: /dev/sdb [SAT], Failed SMART usage Attribute: 206 Write_Error_Rate.

After extended smart test:
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 000 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 70
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 1
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 000 000 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 000 000 000 Old_age Always - 2
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 0
180 Unused_Reserve_NAND_Blk 0x0033 100 100 000 Pre-fail Always - 228
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 039 000 Old_age Always - 36 (Min/Max 24/61)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 100 100 001 Old_age Offline - 0
206 Write_Error_Rate 0x000e 000 000 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 775601670
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 24237552
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 10450944

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 70 -
# 2 Short offline Completed without error 00% 69 -

Does anyone have an idea?

wirdo02 · Aug 17, 2020

Does anyone perhaps know how to filter out the incorrect information? We keep getting e-mails everyday about these Smart errors..

wirdo02 · Aug 20, 2020

So I noticed that the error being sent is an old error

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Wed Jun 24 15:14:57 2020 CEST
Another message will be sent in 24 hours if the problem persists.

The serial number that is mentioned in the e-mail is from the old drive that has been removed already.
Is this somewhere cached or something? Shouldn't it notice the drive is gona?

Zoker · Jan 19, 2021

I have a similar issue:

Device: /dev/sda [SAT], Failed SMART usage Attribute: 194 Temperature_Celsius.

Device info:
CT240BX500SSD1, S/N:1933E1940072, WWN:0-000000-000000000, FW:M6CR013, 240 GB

Can someone from the proxmox team please have a look at this and maybe give some tips, how to solve this issue and what it actually means?

guletz · Jan 19, 2021

Hi @Zoker

The best it to post your full smartd output:

Code:

smartctl -a /dev/sda

It could be possible that your SSD/HDD to use a wrong SMART temp threshold.

Zoker · Jan 19, 2021

Hi @guletz

Thanks for the reply!

This is the output of the command:

Code:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.78-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron BX/MX1/2/3/500, M5/600, 1100 SSDs
Device Model:     CT240BX500SSD1
Serial Number:    ----
LU WWN Device Id: 0 000000 000000000
Firmware Version: M6CR013
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jan 19 15:20:36 2021 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   050    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   050    Old_age   Always       -       1942
 12 Power_Cycle_Count       0x0032   100   100   050    Old_age   Always       -       66
171 Program_Fail_Count      0x0032   100   100   050    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   050    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   100   100   050    Old_age   Always       -       24
174 Unexpect_Power_Loss_Ct  0x0032   100   100   050    Old_age   Always       -       29
180 Unused_Reserve_NAND_Blk 0x0032   100   100   050    Old_age   Always       -       100
183 SATA_Interfac_Downshift 0x0032   100   100   050    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   050    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   050    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   055   029   050    Old_age   Always   In_the_past 45 (Min/Max 22/71)
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   050    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   050    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   099   099   001    Old_age   Offline      -       99
206 Write_Error_Rate        0x002e   100   100   050    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   050    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   050    Old_age   Always       -       2499958988
247 Host_Program_Page_Count 0x0032   100   100   050    Old_age   Always       -       78123718
248 FTL_Program_Page_Count  0x0032   100   100   050    Old_age   Always       -       115347456

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         3         -

Selective Self-tests/Logging not supported

guletz · Jan 19, 2021

Hi again

194 Temperature_Celsius 0x0022 055 029 050 Old_age Always In_the_past 45 (Min/Max 22/71)

So in plain words smartd try to say that your SSD, had have (In_the_past) TEMPerature > that Max value(71). Now the TEMPerature is 45 C !!!!
In my own opinnion I think even 45 C is not normal/optimum, maybe your server room is too hot, or the airflow in your server case is not OK?

Another ideea is to get the same info for any another HDD/SSD that you allredy have in that server(to see let say "another oppinion")!

Good luck / Bafta !

diegargon · Dec 11, 2021

Same here.

I read few time ago about a related temperature bug in some Crucial disk. I not do the update but if your disk is affected u can update the disk firmware if i remember right.

brydzysta · Apr 15, 2023

there is nothing to worry about
it is the temperature on the Fahrenheit scale
for example:
75F = 24C
85F = 30C
115F = 46C
Worry about the SSD when the temperature reaches:
149F - 65C = attention- check cooling
158F - 70C = critical - replace the drive

Search

Search

Smart errors

wirdo02

Member

wirdo02

Member

wirdo02

Member

Zoker

Active Member

guletz

Distinguished Member

Zoker

Active Member

guletz

Distinguished Member

diegargon

New Member

brydzysta

New Member

We value your privacy