SMART error (OfflineUncorrectableSector) detected on host

alex_7628

New Member
Jan 20, 2023
2
0
1
Hi everyone,

Recently I have started 2 new Servers with Proxmox VE 7.1

Server ProLiant DL360 Gen10
System ROM U32 v2.68 (07/14/2022)
System ROM Date 07/14/2022
Redundant System ROM U32 v2.66 (05/17/2022)
2 x Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
128 GB DDR4 RDIMM
RAID controller HPE MR416i-a Gen10+
2 x 3.84 SATA SSD drives in RAID 1 for VM data

System was installed at another RAID 1 (120GB)

After 2 days of installation I have started receiving such kind of emails :

This message was generated by the smartd daemon running on:


The following warning/error was logged by the smartd daemon:

Device: /dev/bus/0 [megaraid_disk_01] [SAT], 56 Offline uncorrectable sectors

Device info:
HFS3T8G32FEH-7410C, S/N:FN07N6904I0407P1S, WWN:5-ace42e-0251a7455, FW:90030Q00, 3.84 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Wed Jan 11 22:12:29 2023 +06
Another message will be sent in 24 hours if the problem persists.


SMART shows OK at the 3.84 RAID 1 with temperature 0
Shell output:

root@HEALAHASRVPMX01:~# smartctl -a /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.74-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: HPE
Product: MR416i-a Gen10+
Revision: 5.16
Compliance: SPC-3
User Capacity: 3,840,204,079,104 bytes [3.84 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
Rotation Rate: Solid State Device
Logical Unit id: 0x600062b209ad60c02b51970c50e94320
Serial number: 002043e9500c97512bc060ad09b26200
Device type: disk
Local Time is: Fri Jan 20 22:27:43 2023 +06
SMART support is: Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature: 0 C
Drive Trip Temperature: 0 C

Error Counter logging not supported

Device does not support Self Test logging



I am going to deploy prod VMs in two weeks
Could you please advise if there could be some RAID controller incompatibility or faulty SSD?


Appreciate, Alex.
 
In addition:

I have done two entries for each SSD in Array

root@HEALAHASRVPMX01:~# smartctl -a -T permissive /dev/sdb -d megaraid,00
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.74-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: HFS3T8G32FEH-7410C
Serial Number: FJ07N7686I0107T47
LU WWN Device Id: 5 ace42e 0251ab4ec
Firmware Version: 90030Q00
User Capacity: 3,840,755,982,336 bytes [3.84 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-4, ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Jan 21 00:26:47 2023 +06
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x19) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0002) Does not save SMART data before
entering power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 4) minutes.
SCT capabilities: (0x0025) SCT Status supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 0
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 006 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 25
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 348
11 Unknown_SSD_Attribute 0x0012 100 100 000 Old_age Always - 17
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 26
171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
173 Unknown_Attribute 0x0033 100 100 001 Pre-fail Always - 648832
174 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 17
175 Program_Fail_Count_Chip 0x0033 100 100 050 Pre-fail Always - 0
180 Unused_Rsvd_Blk_Cnt_Tot 0x003b 100 100 006 Pre-fail Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 000 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 000 Old_age Always - 0
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 000 Old_age Always - 18
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 078 065 000 Old_age Always - 22 (Min/Max 18/35)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 0
196 Reallocated_Event_Count 0x0033 100 100 036 Pre-fail Always - 25
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 96
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
201 Unknown_SSD_Attribute 0x000e 100 100 000 Old_age Always - 0
204 Soft_ECC_Correction 0x000e 100 100 000 Old_age Always - 0
231 Unknown_SSD_Attribute 0x0033 100 100 001 Pre-fail Always - 100
234 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 9287
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 8972
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 9733
250 Read_Error_Retry_Rate 0x0032 100 100 000 Old_age Always - 558669

Read SMART Error Log failed: megasas_cmd result: 0.0 = 0/45

SMART Self-test Log not supported

Selective Self-tests/Logging not supported




















root@HEALAHASRVPMX01:~# smartctl -a -T permissive /dev/sdb -d megaraid,01
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.74-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: HFS3T8G32FEH-7410C
Serial Number: FN07N6904I0407P1X
LU WWN Device Id: 5 ace42e 0251a745a
Firmware Version: 90030Q00
User Capacity: 3,840,755,982,336 bytes [3.84 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-4, ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Jan 21 00:27:18 2023 +06
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x19) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0002) Does not save SMART data before
entering power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 4) minutes.
SCT capabilities: (0x0025) SCT Status supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 0
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 006 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0033 098 098 036 Pre-fail Always - 254
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 348
11 Unknown_SSD_Attribute 0x0012 100 100 000 Old_age Always - 17
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 26
171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
173 Unknown_Attribute 0x0033 100 100 001 Pre-fail Always - 599936
174 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 17
175 Program_Fail_Count_Chip 0x0033 100 100 050 Pre-fail Always - 0
180 Unused_Rsvd_Blk_Cnt_Tot 0x003b 100 001 006 Pre-fail Always In_the_past 0
181 Program_Fail_Cnt_Total 0x0032 100 100 000 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 000 Old_age Always - 0
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 000 Old_age Always - 3894
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 077 066 000 Old_age Always - 23 (Min/Max 18/34)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 0
196 Reallocated_Event_Count 0x0033 098 098 036 Pre-fail Always - 254
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 272
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
201 Unknown_SSD_Attribute 0x000e 100 100 000 Old_age Always - 0
204 Soft_ECC_Correction 0x000e 100 100 000 Old_age Always - 0
231 Unknown_SSD_Attribute 0x0033 100 100 001 Pre-fail Always - 100
234 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 9296
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 8972
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 9702
250 Read_Error_Retry_Rate 0x0032 100 100 000 Old_age Always - 578112

Read SMART Error Log failed: megasas_cmd result: 0.1 = 0/45

SMART Self-test Log not supported

Selective Self-tests/Logging not supported