Overheating OSdrive, no alert???

yusisushi

New Member
Feb 12, 2024
10
0
1
Hi

I suspect my nvme osdrive is overheating. At random moments the proxmox host becomes unreachable.

Today for the first time I saw some indication of what could be a cause. 83 degrees does not seem like too high but note the
"Specified Maximum Operating Temperature" near the bottom.

smartctl /dev/sda |grep Temperature
1710512605995.png

Because of this, I suspect the osdrive overheats. However I haven't gotten any alerts for this. Email alerts do work however, when I test mails or when I restart smartd
 
Hi!
could you please post the full output including the drive model?
I don't know about the "Specified Maximum Operating Temperature", but looking at the "Min/Max Temperature Limit" and "Under/Over Temperature Limit Count", it says that you have never been above the limit, and thus you would never have gotten a warning.
 
  • Like
Reactions: yusisushi
Hi!
could you please post the full output including the drive model?
I don't know about the "Specified Maximum Operating Temperature", but looking at the "Min/Max Temperature Limit" and "Under/Over Temperature Limit Count", it says that you have never been above the limit, and thus you would never have gotten a warning.

sure, here goes -a

Code:
root@pve:~# smartctl /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.13-1-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

ATA device successfully opened

Use 'smartctl -a' (or '-x') to print SMART (and more) information

root@pve:~# smartctl -a /dev/sda |grep Temperature
190 Temperature_Case        0x0032   028   072   000    Old_age   Always       -       28 (Min/Max 6/72)
root@pve:~# smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.13-1-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Intel 545s Series SSDs
Device Model:     INTEL SSDSCKKW128G8
Serial Number:    BTLA80410NQ8128I
LU WWN Device Id: 5 5cd2e4 14ef64d80
Firmware Version: LHF002C
User Capacity:    128,035,676,160 bytes [128 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      M.2
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Mar 21 18:49:39 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  15) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       6278
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2206
170 Unknown_Attribute       0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0033   093   093   005    Pre-fail  Always       -       476752183360
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       312
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Temperature_Case        0x0032   028   072   000    Old_age   Always       -       28 (Min/Max 6/72)
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       312
199 CRC_Error_Count         0x0032   100   100   000    Old_age   Always       -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       303405
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       0
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       0
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       0
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   093   093   000    Old_age   Always       -       0
236 Unknown_Attribute       0x0032   094   094   000    Old_age   Always       -       0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       303405
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       224500
249 NAND_Writes_1GiB        0x0032   100   100   000    Old_age   Always       -       14381
252 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       111

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      6278         -
# 2  Short offline       Completed without error       00%      6277         -
# 3  Short offline       Completed without error       00%      6276         -
# 4  Short offline       Completed without error       00%      6275         -
# 5  Extended offline    Completed without error       00%       504         -
# 6  Short offline       Completed without error       00%       504         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Hi!
could you please post the full output including the drive model?
I don't know about the "Specified Maximum Operating Temperature", but looking at the "Min/Max Temperature Limit" and "Under/Over Temperature Limit Count", it says that you have never been above the limit, and thus you would never have gotten a warning.
and here goes -x as well

Code:
root@pve:~# smartctl -x /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.13-1-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Intel 545s Series SSDs
Device Model:     INTEL SSDSCKKW128G8
Serial Number:    BTLA80410NQ8128I
LU WWN Device Id: 5 5cd2e4 14ef64d80
Firmware Version: LHF002C
User Capacity:    128,035,676,160 bytes [128 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      M.2
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Mar 21 18:50:45 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM level is:     254 (maximum performance)
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Write SCT (Get) Feature Control Command failed: scsi error badly formed scsi parameters
Wt Cache Reorder: Unknown (SCT Feature Control command failed)

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  15) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   -O--CK   100   100   000    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    6278
 12 Power_Cycle_Count       -O--CK   100   100   000    -    2206
170 Unknown_Attribute       PO--CK   100   100   010    -    0
171 Program_Fail_Count      -O--CK   100   100   000    -    0
172 Erase_Fail_Count        -O--CK   100   100   000    -    0
173 Unknown_Attribute       PO--CK   093   093   005    -    476752183360
174 Unexpect_Power_Loss_Ct  -O--CK   100   100   000    -    312
183 SATA_Downshift_Count    -O--CK   100   100   000    -    0
184 End-to-End_Error        PO--CK   100   100   090    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
190 Temperature_Case        -O--CK   029   072   000    -    29 (Min/Max 6/72)
192 Unsafe_Shutdown_Count   -O--CK   100   100   000    -    312
199 CRC_Error_Count         -O--CK   100   100   000    -    0
225 Host_Writes_32MiB       -O--CK   100   100   000    -    303405
226 Workld_Media_Wear_Indic -O--CK   100   100   000    -    0
227 Workld_Host_Reads_Perc  -O--CK   100   100   000    -    0
228 Workload_Minutes        -O--CK   100   100   000    -    0
232 Available_Reservd_Space PO--CK   100   100   010    -    0
233 Media_Wearout_Indicator -O--CK   093   093   000    -    0
236 Unknown_Attribute       -O--CK   094   094   000    -    0
241 Host_Writes_32MiB       -O--CK   100   100   000    -    303405
242 Host_Reads_32MiB        -O--CK   100   100   000    -    224500
249 NAND_Writes_1GiB        -O--CK   100   100   000    -    14381
252 Unknown_Attribute       -O--CK   100   100   000    -    111
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01       GPL,SL  R/O      1  Summary SMART error log
0x02       GPL,SL  R/O      1  Comprehensive SMART error log
0x03       GPL,SL  R/O      1  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06       GPL,SL  R/O      1  SMART self-test log
0x07       GPL,SL  R/O      1  Extended self-test log
0x09       GPL,SL  R/W      1  Selective self-test log
0x10       GPL,SL  R/O      1  NCQ Command Error log
0x11       GPL,SL  R/O      1  SATA Phy Event Counters log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xdf       GPL,SL  VS       1  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      6278         -
# 2  Short offline       Completed without error       00%      6277         -
# 3  Short offline       Completed without error       00%      6276         -
# 4  Short offline       Completed without error       00%      6275         -
# 5  Extended offline    Completed without error       00%       504         -
# 6  Short offline       Completed without error       00%       504         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       0 (0x0000)
Device State:                        Active (0)
Current Temperature:                    37 Celsius
Power Cycle Min/Max Temperature:     22/37 Celsius
Lifetime    Min/Max Temperature:     10/89 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/100 Celsius
Min/Max Temperature Limit:            0/100 Celsius
Temperature History Size (Index):    128 (110)

Index    Estimated Time   Temperature Celsius
 111    2024-03-21 16:43    87  ***************************************+
 ...    ..( 25 skipped).    ..  ***************************************+
   9    2024-03-21 17:09    87  ***************************************+
  10    2024-03-21 17:10    88  ***************************************+
 ...    ..(  8 skipped).    ..  ***************************************+
  19    2024-03-21 17:19    88  ***************************************+
  20    2024-03-21 17:20    89  ***************************************+
  21    2024-03-21 17:21    88  ***************************************+
  22    2024-03-21 17:22    88  ***************************************+
  23    2024-03-21 17:23    88  ***************************************+
  24    2024-03-21 17:24    89  ***************************************+
 ...    ..( 16 skipped).    ..  ***************************************+
  41    2024-03-21 17:41    89  ***************************************+
  42    2024-03-21 17:42    88  ***************************************+
  43    2024-03-21 17:43    87  ***************************************+
  44    2024-03-21 17:44    86  ***************************************+
  45    2024-03-21 17:45    86  ***************************************+
  46    2024-03-21 17:46    41  **********************
  47    2024-03-21 17:47    37  ******************
  48    2024-03-21 17:48    37  ******************
  49    2024-03-21 17:49    38  *******************
  50    2024-03-21 17:50    40  *********************
  51    2024-03-21 17:51    40  *********************
  52    2024-03-21 17:52    28  *********
  53    2024-03-21 17:53    39  ********************
  54    2024-03-21 17:54    38  *******************
  55    2024-03-21 17:55    40  *********************
  56    2024-03-21 17:56    40  *********************
  57    2024-03-21 17:57    40  *********************
  58    2024-03-21 17:58    37  ******************
  59    2024-03-21 17:59    25  ******
  60    2024-03-21 18:00    36  *****************
  61    2024-03-21 18:01    35  ****************
  62    2024-03-21 18:02    37  ******************
  63    2024-03-21 18:03    36  *****************
  64    2024-03-21 18:04    36  *****************
  65    2024-03-21 18:05    37  ******************
  66    2024-03-21 18:06    37  ******************
  67    2024-03-21 18:07    36  *****************
 ...    ..(  2 skipped).    ..  *****************
  70    2024-03-21 18:10    36  *****************
  71    2024-03-21 18:11    40  *********************
  72    2024-03-21 18:12    41  **********************
  73    2024-03-21 18:13    43  ************************
 ...    ..( 22 skipped).    ..  ************************
  96    2024-03-21 18:36    43  ************************
  97    2024-03-21 18:37    22  ***
  98    2024-03-21 18:38    34  ***************
  99    2024-03-21 18:39    32  *************
 100    2024-03-21 18:40    34  ***************
 101    2024-03-21 18:41    35  ****************
 102    2024-03-21 18:42    35  ****************
 103    2024-03-21 18:43    36  *****************
 104    2024-03-21 18:44    36  *****************
 105    2024-03-21 18:45    36  *****************
 106    2024-03-21 18:46    37  ******************
 ...    ..(  2 skipped).    ..  ******************
 109    2024-03-21 18:49    37  ******************
 110    2024-03-21 18:50    87  ***************************************+

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4            2206  ---  Lifetime Power-On Resets
0x01  0x010  4            6278  ---  Power-on Hours
0x01  0x018  6     19883966818  ---  Logical Sectors Written
0x01  0x020  6       220460450  ---  Number of Write Commands
0x01  0x028  6     14712841542  ---  Logical Sectors Read
0x01  0x030  6       220653932  ---  Number of Read Commands
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4             312  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              37  ---  Current Temperature
0x05  0x010  1              42  ---  Average Short Term Temperature
0x05  0x018  1               -  ---  Average Long Term Temperature
0x05  0x020  1              57  ---  Highest Temperature
0x05  0x028  1              30  ---  Lowest Temperature
0x05  0x030  1              42  ---  Highest Average Short Term Temperature
0x05  0x038  1              42  ---  Lowest Average Short Term Temperature
0x05  0x040  1               -  ---  Highest Average Long Term Temperature
0x05  0x048  1               -  ---  Lowest Average Long Term Temperature
0x05  0x050  4              10  ---  Time in Over-Temperature
0x05  0x058  1              85  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4           25807  ---  Number of Hardware Resets
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1              11  ---  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0005  2            0  R_ERR response for non-data FIS
0x000a  2            3  Device-to-host register FISes sent due to a COMRESET

I have to add, im not quite sure what to make of it anymore. I did acutal additional testing last weekend and tried overheating the system on purpose.
I got the CPU up to 95 and the drive up to 87C and it never stopped being responsive once...

I'm unsure what happens when the system is not responding
 
Luckily intel was one of the few companies that actually documented SMART attributes. See page 23: https://www.bhphotovideo.com/lit_files/397146.pdf
1711044196509.png
So this is indeed celsius and your SSD is running at 28 deg C with a max of 72 deg C while the max temperature according to datasheet is 70 deg C. So yes, at least once it was too hot and probably thermal throttled.
 
  • Like
Reactions: yusisushi

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!