SMART error notification but no SMART error on test

StackIOI

Member
Nov 9, 2021
11
1
8
49
Hello, on my PVE I have a USB HDD connected for Backups, ISOs, Templates and such... I just replaced an old drive with a *brand new* 4TB Seagate BarraCuda HDD in an external (powered) USB3 enclosure.

Today I received an email notification with the subject "SMART error (OfflineUncorrectableSector) detected on host: desert"

This message was generated by the smartd daemon running on:

host name: desert
DNS domain: REDACTED

The following warning/error was logged by the smartd daemon:

Device: /dev/sdb [SAT], 134861051 Total offline uncorrectable sectors (changed +37792)

Device info:
ST4000DM004-2CV104, S/N:REDACTED, WWN:REDACTED, FW:0001, 4.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Mon Jan 8 09:36:43 2024 EST
Another message will be sent in 24 hours if the problem persists.

Code:
>#pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.131-2-pve)
pve-manager: 7.4-17 (running version: 7.4-17/513c62be)
pve-kernel-5.15: 7.4-9
pve-kernel-5.15.131-2-pve: 5.15.131-3
pve-kernel-5.15.131-1-pve: 5.15.131-2
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.4-1
proxmox-backup-file-restore: 2.4.4-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-6
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.14-pve1

So, I went to my console and tested the new drive, but no errors were found anywhere...

Code:
>#smartctl -t short /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Tue Jan  9 10:58:43 2024 EST
Use smartctl -X to abort test.


>#smartctl -a /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate BarraCuda 3.5
Device Model:     ST4000DM004-2CV104
Serial Number:    REDACTED
LU WWN Device Id: REDACTED
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5425 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jan  9 11:10:26 2024 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        97         -

Code:
>#smartctl -t conveyance /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Conveyance self-test routine immediately in off-line mode".
Drive command "Execute SMART Conveyance self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Tue Jan  9 11:06:24 2024 EST
Use smartctl -X to abort test.


>#smartctl -a /dev/sdb
SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%        97         -
# 2  Short offline       Completed without error       00%        97         -

Got some hits here though...

Code:
user@desert:/var/log$ sudo cat syslog | grep SMART | grep /dev/sdb
Jan  8 11:00:57 desert smartd[681]: Device: /dev/sdb [SAT], not capable of SMART Health Status check
Jan  8 11:00:58 desert smartd[681]: Device: /dev/sdb [SAT], is SMART capable. Adding to "monitor" list.
Jan  8 11:00:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 72 to 73
Jan  8 11:00:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 72 to 73
Jan  8 11:30:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 73 to 82
Jan  8 11:30:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 59 to 55
Jan  8 11:30:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 41 to 45
Jan  8 11:30:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 73 to 82
Jan  8 12:00:59 desert smartd[681]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 67
Jan  8 12:00:59 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 55 to 56
Jan  8 12:00:59 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 45 to 44
Jan  8 12:00:59 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 82 to 67
Jan  8 12:30:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 56 to 58
Jan  8 12:30:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 44 to 42
Jan  8 13:00:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 58 to 60

Jan  9 00:17:05 desert smartd[679]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 67 to 68
Jan  9 00:17:05 desert smartd[679]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 60 to 59
Jan  9 00:17:05 desert smartd[679]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 40 to 41
Jan  9 00:17:05 desert smartd[679]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 67 to 68

Jan  9 03:17:04 desert smartd[679]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 68 to 81
Jan  9 03:17:04 desert smartd[679]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 60 to 57
Jan  9 03:17:04 desert smartd[679]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 40 to 43
Jan  9 03:17:04 desert smartd[679]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 68 to 81

Will run a long test next, but why would a notification message be sent out when no errors are detected? Or.. who to believe? What else should I be testing or checking?

Any help is greatly appreciated!
 
Last edited:
Today again another email notification reporting OfflineUncorrectableSector...

This message was generated by the smartd daemon running on:

host name: desert
DNS domain: REDACTED

The following warning/error was logged by the smartd daemon:

Device: /dev/sdb [SAT], 26948760 Total offline uncorrectable sectors (changed +1715888)

Device info:
ST4000DM004-2CV104, S/N:REDACTED, WWN:REDACTED, FW:0001, 4.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Mon Jan 8 09:36:43 2024 EST
Another message will be sent in 24 hours if the problem persists.

Code:
>#smartctl -a /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate BarraCuda 3.5
Device Model:     ST4000DM004-2CV104
Serial Number:    REDACTED
LU WWN Device Id: REDACTED
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5425 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jan  9 11:10:26 2024 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       106         -
# 2  Conveyance offline  Completed without error       00%        97         -
# 3  Short offline       Completed without error       00%        97         -

Thnx
 
smartd/smartctl is not perfect and the SMART attributes are unfortunately not standardized. smartd will warn when important ones as OfflineUncorrectableSector go up but given the amount it increased, it's probably mistaken and that SMART attribute is not about offline uncorrectable sectors.
Maybe you can find out which attributes are what from the drive manufacturer and update the smartd database, or just you local smartd configuration.
 
smartd/smartctl is not perfect and the SMART attributes are unfortunately not standardized. smartd will warn when important ones as OfflineUncorrectableSector go up but given the amount it increased, it's probably mistaken and that SMART attribute is not about offline uncorrectable sectors.
Maybe you can find out which attributes are what from the drive manufacturer and update the smartd database, or just you local smartd configuration.
I agree, however, running smart on all disks on a multiboot system, with Debian bookworm on both instances (pbs and other instance)
I detect no errors. As stated in my linked entry, this has only occured since installing a clean install of the latest pbs.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!