SMART error e-mails on PAST temperature issue

IFN

New Member
Nov 8, 2022
9
0
1
Hi all:

I'm getting daily e-mails of this:
The following warning/error was logged by the smartd daemon:

Device: /dev/nvme0, Critical Warning (0x02): Temperature

When I run smartctl -a /dev/nvme0, I get:

Code:
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.2.16-14-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 980 1TB
Serial Number:                      xxxx
Firmware Version:                   2B4QFXO7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      5
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Utilization:            186,906,210,304 [186 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 d12142ee3d
Local Time is:                      Sat Mar  9 17:47:04 2024 PST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055):     Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Namespace 1 Features (0x10):        NP_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     5.24W       -        -    0  0  0  0        0       0
 1 +     4.49W       -        -    1  1  1  1        0       0
 2 +     2.19W       -        -    2  2  2  2        0     500
 3 -   0.0500W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     1000    9000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        43 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    5,076,879 [2.59 TB]
Data Units Written:                 9,893,908 [5.06 TB]
Host Read Commands:                 156,410,825
Host Write Commands:                287,179,799
Controller Busy Time:               1,299
Power Cycles:                       8
Power On Hours:                     2,396
Unsafe Shutdowns:                   2
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    748
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               43 Celsius
Temperature Sensor 2:               47 Celsius
Thermal Temp. 2 Transition Count:   171808
Thermal Temp. 2 Total Time:         37434

Error Information (NVMe Log 0x01, 16 of 64 entries)

I've run this within 30 minutes of getting the e-mail. I know we had an hvac failure one night, and the room got HOT. That was promptly corrected, but every night since, i've continued to receive this e-mail. It appears to me that there is no current temperature warning, just a past one.

Is there some way to acknowledge or accept this past issue and reset it/proxmox so it doesn't continue to send these now-false warnings daily?

Thanks!
--Jim
 
Sure, looks like its all defaults:

Code:
root@ADM-HV1:~# cat /etc/smartd.conf
# Sample configuration file for smartd.  See man smartd.conf.

# Home page is: http://www.smartmontools.org

# smartd will re-read the configuration file if it receives a HUP
# signal

# The file gives a list of devices to monitor using smartd, with one
# device per line. Text after a hash (#) is ignored, and you may use
# spaces and tabs for white space. You may use '\' to continue lines.

# You can usually identify which hard disks are on your system by
# looking in /proc/ide and in /proc/scsi.

# The word DEVICESCAN will cause any remaining lines in this
# configuration file to be ignored: it tells smartd to scan for all
# ATA and SCSI devices.  DEVICESCAN may be followed by any of the
# Directives listed below, which will be applied to all devices that
# are found.  Most users should comment out DEVICESCAN and explicitly
# list the devices that they wish to monitor.
DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner

# Alternative setting to ignore temperature and power-on hours reports
# in syslog.
#DEVICESCAN -I 194 -I 231 -I 9

# Alternative setting to report more useful raw temperature in syslog.
#DEVICESCAN -R 194 -R 231 -I 9

# Alternative setting to report raw temperature changes >= 5 Celsius
# and min/max temperatures.
#DEVICESCAN -I 194 -I 231 -I 9 -W 5

# First ATA/SATA or SCSI/SAS disk.  Monitor all attributes, enable
# automatic online data collection, automatic Attribute autosave, and
# start a short self-test every day between 2-3am, and a long self test
# Saturdays between 3-4am.
#/dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03)

# Monitor SMART status, ATA Error Log, Self-test log, and track
# changes in all attributes except for attribute 194
#/dev/sdb -H -l error -l selftest -t -I 194

# Monitor all attributes except normalized Temperature (usually 194),
# but track Temperature changes >= 4 Celsius, report Temperatures
# >= 45 Celsius and changes in Raw value of Reallocated_Sector_Ct (5).
# Send mail on SMART failures or when Temperature is >= 55 Celsius.
#/dev/sdc -a -I 194 -W 4,45,55 -R 5 -m admin@example.com

# An ATA disk may appear as a SCSI device to the OS. If a SCSI to
# ATA Translation (SAT) layer is between the OS and the device then
# this can be flagged with the '-d sat' option. This situation may
# become common with SATA disks in SAS and FC environments.
# /dev/sda -a -d sat

# A very silent check.  Only report SMART health status if it fails
# But send an email in this case
#/dev/sdc -H -C 0 -U 0 -m admin@example.com

# First two SCSI disks.  This will monitor everything that smartd can
# monitor.  Start extended self-tests Wednesdays between 6-7pm and
# Sundays between 1-2 am
#/dev/sda -d scsi -s L/../../3/18
#/dev/sdb -d scsi -s L/../../7/01

# Monitor 4 ATA disks connected to a 3ware 6/7/8000 controller which uses
# the 3w-xxxx driver. Start long self-tests Sundays between 1-2, 2-3, 3-4,
# and 4-5 am.
# NOTE: starting with the Linux 2.6 kernel series, the /dev/sdX interface
# is DEPRECATED.  Use the /dev/tweN character device interface instead.
# For example /dev/twe0, /dev/twe1, and so on.
#/dev/sdc -d 3ware,0 -a -s L/../../7/01
#/dev/sdc -d 3ware,1 -a -s L/../../7/02
#/dev/sdc -d 3ware,2 -a -s L/../../7/03
#/dev/sdc -d 3ware,3 -a -s L/../../7/04

# Monitor 2 ATA disks connected to a 3ware 9000 controller which
# uses the 3w-9xxx driver (Linux, FreeBSD). Start long self-tests Tuesdays
# between 1-2 and 3-4 am.
#/dev/twa0 -d 3ware,0 -a -s L/../../2/01
#/dev/twa0 -d 3ware,1 -a -s L/../../2/03

# Monitor 2 SATA (not SAS) disks connected to a 3ware 9000 controller which
# uses the 3w-sas driver (Linux). Start long self-tests Tuesdays
# between 1-2 and 3-4 am.
# On FreeBSD /dev/tws0 should be used instead
#/dev/twl0 -d 3ware,0 -a -s L/../../2/01
#/dev/twl0 -d 3ware,1 -a -s L/../../2/03

# Same as above for Windows. Option '-d 3ware,N' is not necessary,
# disk (port) number is specified in device name.
# NOTE: On Windows, DEVICESCAN works also for 3ware controllers.
#/dev/hdc,0 -a -s L/../../2/01
#/dev/hdc,1 -a -s L/../../2/03
#
# Monitor 2 disks connected to the first HP SmartArray controller which
# uses the cciss driver. Start long tests on Sunday nights and short
# self-tests every night and send errors to root
#/dev/sda -d cciss,0 -a -s (L/../../7/02|S/../.././02) -m root
#/dev/sda -d cciss,1 -a -s (L/../../7/03|S/../.././03) -m root

# Monitor 3 ATA disks directly connected to a HighPoint RocketRAID. Start long
# self-tests Sundays between 1-2, 2-3, and 3-4 am.
#/dev/sdd -d hpt,1/1 -a -s L/../../7/01
#/dev/sdd -d hpt,1/2 -a -s L/../../7/02
#/dev/sdd -d hpt,1/3 -a -s L/../../7/03

# Monitor 2 ATA disks connected to the same PMPort which connected to the
# HighPoint RocketRAID. Start long self-tests Tuesdays between 1-2 and 3-4 am
#/dev/sdd -d hpt,1/4/1 -a -s L/../../2/01
#/dev/sdd -d hpt,1/4/2 -a -s L/../../2/03

# HERE IS A LIST OF DIRECTIVES FOR THIS CONFIGURATION FILE.
# PLEASE SEE THE smartd.conf MAN PAGE FOR DETAILS
#
#   -d TYPE Set the device type: ata, scsi, marvell, removable, 3ware,N, hpt,L/M/N
#   -T TYPE set the tolerance to one of: normal, permissive
#   -o VAL  Enable/disable automatic offline tests (on/off)
#   -S VAL  Enable/disable attribute autosave (on/off)
#   -n MODE No check. MODE is one of: never, sleep, standby, idle
#   -H      Monitor SMART Health Status, report if failed
#   -l TYPE Monitor SMART log.  Type is one of: error, selftest
#   -f      Monitor for failure of any 'Usage' Attributes
#   -m ADD  Send warning email to ADD for -H, -l error, -l selftest, and -f
#   -M TYPE Modify email warning behavior (see man page)
#   -s REGE Start self-test when type/date matches regular expression (see man page)
#   -p      Report changes in 'Prefailure' Normalized Attributes
#   -u      Report changes in 'Usage' Normalized Attributes
#   -t      Equivalent to -p and -u Directives
#   -r ID   Also report Raw values of Attribute ID with -p, -u or -t
#   -R ID   Track changes in Attribute ID Raw value with -p, -u or -t
#   -i ID   Ignore Attribute ID for -f Directive
#   -I ID   Ignore Attribute ID for -p, -u or -t Directive
#   -C ID   Report if Current Pending Sector count non-zero
#   -U ID   Report if Offline Uncorrectable count non-zero
#   -W D,I,C Monitor Temperature D)ifference, I)nformal limit, C)ritical limit
#   -v N,ST Modifies labeling of Attribute N (see man page)
#   -a      Default: equivalent to -H -f -t -l error -l selftest -C 197 -U 198
#   -F TYPE Use firmware bug workaround. Type is one of: none, samsung
#   -P TYPE Drive-specific presets: use, ignore, show, showall
#    #      Comment: text after a hash sign is ignored
#    \      Line continuation character
# Attribute ID is a decimal integer 1 <= ID <= 255
# except for -C and -U, where ID = 0 turns them off.
# All but -d, -m and -M Directives are only implemented for ATA devices
#
# If the test string DEVICESCAN is the first uncommented text
# then smartd will scan for devices.
# DEVICESCAN may be followed by any desired Directives.
 
I just modified the DEVICESCAN line to include the -I options of the line below it that should ignore temperature. I'll see how that goes.
 
Remember to restart the smartd daemon!
 
Same for me. I get a email about a temperature issue every day. I added the -I option to the main line, but that does not change anything (yes I restarted the daemon, at some point even the whole machine):

Code:
DEVICESCAN -d removable -n standby -m root -I 194 -M exec /usr/share/smartmontools/smartd-runner

This is the email I get:

Code:
The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], Failed SMART usage Attribute: 194 Temperature_Celsius.

Device info:
CT240BX500SSD1, S/N:1933E1940072, WWN:0-000000-000000000, FW:M6CR013, 240 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Wed Feb 21 04:24:03 2024 CET
Another message will be sent in 24 hours if the problem persists.
 
I also have not managed to find a solution to this problem.

It seems unnecessary / misconfigured to send an alert out every 24 hrs for a PAST fault, not a present one. The temp on the SSD exceeded set one time (data center AC went out; was repaired the next day), now for over a year I've received an e-mail every day for an SSD that is working fine. Unfortunately, this results in me just deleting the SMART error e-mails without opening them now, meaning the probability of missing a NEW, current fault is relatively high.