Hello, on my PVE I have a USB HDD connected for Backups, ISOs, Templates and such... I just replaced an old drive with a *brand new* 4TB Seagate BarraCuda HDD in an external (powered) USB3 enclosure.
Today I received an email notification with the subject "SMART error (OfflineUncorrectableSector) detected on host: desert"
So, I went to my console and tested the new drive, but no errors were found anywhere...
Got some hits here though...
Will run a long test next, but why would a notification message be sent out when no errors are detected? Or.. who to believe? What else should I be testing or checking?
Any help is greatly appreciated!
Today I received an email notification with the subject "SMART error (OfflineUncorrectableSector) detected on host: desert"
This message was generated by the smartd daemon running on:
host name: desert
DNS domain: REDACTED
The following warning/error was logged by the smartd daemon:
Device: /dev/sdb [SAT], 134861051 Total offline uncorrectable sectors (changed +37792)
Device info:
ST4000DM004-2CV104, S/N:REDACTED, WWN:REDACTED, FW:0001, 4.00 TB
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Mon Jan 8 09:36:43 2024 EST
Another message will be sent in 24 hours if the problem persists.
host name: desert
DNS domain: REDACTED
The following warning/error was logged by the smartd daemon:
Device: /dev/sdb [SAT], 134861051 Total offline uncorrectable sectors (changed +37792)
Device info:
ST4000DM004-2CV104, S/N:REDACTED, WWN:REDACTED, FW:0001, 4.00 TB
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Mon Jan 8 09:36:43 2024 EST
Another message will be sent in 24 hours if the problem persists.
Code:
>#pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.131-2-pve)
pve-manager: 7.4-17 (running version: 7.4-17/513c62be)
pve-kernel-5.15: 7.4-9
pve-kernel-5.15.131-2-pve: 5.15.131-3
pve-kernel-5.15.131-1-pve: 5.15.131-2
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.4-1
proxmox-backup-file-restore: 2.4.4-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-6
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.14-pve1
So, I went to my console and tested the new drive, but no errors were found anywhere...
Code:
>#smartctl -t short /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Tue Jan 9 10:58:43 2024 EST
Use smartctl -X to abort test.
>#smartctl -a /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate BarraCuda 3.5
Device Model: ST4000DM004-2CV104
Serial Number: REDACTED
LU WWN Device Id: REDACTED
Firmware Version: 0001
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5425 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Jan 9 11:10:26 2024 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 97 -
Code:
>#smartctl -t conveyance /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Conveyance self-test routine immediately in off-line mode".
Drive command "Execute SMART Conveyance self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Tue Jan 9 11:06:24 2024 EST
Use smartctl -X to abort test.
>#smartctl -a /dev/sdb
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Conveyance offline Completed without error 00% 97 -
# 2 Short offline Completed without error 00% 97 -
Got some hits here though...
Code:
user@desert:/var/log$ sudo cat syslog | grep SMART | grep /dev/sdb
Jan 8 11:00:57 desert smartd[681]: Device: /dev/sdb [SAT], not capable of SMART Health Status check
Jan 8 11:00:58 desert smartd[681]: Device: /dev/sdb [SAT], is SMART capable. Adding to "monitor" list.
Jan 8 11:00:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 72 to 73
Jan 8 11:00:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 72 to 73
Jan 8 11:30:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 73 to 82
Jan 8 11:30:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 59 to 55
Jan 8 11:30:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 41 to 45
Jan 8 11:30:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 73 to 82
Jan 8 12:00:59 desert smartd[681]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 67
Jan 8 12:00:59 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 55 to 56
Jan 8 12:00:59 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 45 to 44
Jan 8 12:00:59 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 82 to 67
Jan 8 12:30:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 56 to 58
Jan 8 12:30:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 44 to 42
Jan 8 13:00:58 desert smartd[681]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 58 to 60
Jan 9 00:17:05 desert smartd[679]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 67 to 68
Jan 9 00:17:05 desert smartd[679]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 60 to 59
Jan 9 00:17:05 desert smartd[679]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 40 to 41
Jan 9 00:17:05 desert smartd[679]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 67 to 68
Jan 9 03:17:04 desert smartd[679]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 68 to 81
Jan 9 03:17:04 desert smartd[679]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 60 to 57
Jan 9 03:17:04 desert smartd[679]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 40 to 43
Jan 9 03:17:04 desert smartd[679]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 68 to 81
Will run a long test next, but why would a notification message be sent out when no errors are detected? Or.. who to believe? What else should I be testing or checking?
Any help is greatly appreciated!
Last edited: