[SOLVED] Smart Error

ffuentes

Active Member
Nov 15, 2016
23
2
43
45
Team,

I keep getting the following errors via email:

Code:
This message was generated by the smartd daemon running on:

   host name:  hera
   DNS domain: domain.net

The following warning/error was logged by the smartd daemon:

Device: /dev/bus/0 [megaraid_disk_10], Read SMART Self-Test Log Failed

Device info:
[IBM-ESXS MK3001GRRB       SC29], lu id: 0x50000394e819f7b0, S/N: SNHERE, 300 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Fri Nov 11 09:50:19 2016 CST
Another message will be sent in 24 hours if the problem persists.

The problem is that the system local storage is not reporting any issues. The RAID console shows the drive and pool to be healthy.

Any ideas?
 
Smart can check only disks that are attached directly. For HW Raid use HW Raidtools depending on manufactor.
 
fireon,

Thanks for your reply.
The problem is not checking it. As my previous post states, The problem is that I am getting that email when there is no issues on the disk it self as per the raid console.

Any ideas how to make the email stop?

TIA.
 
systemctl stop smartd.service
systemctl disable smartd.service

...should help
 
Thank you, this helped me investigate the problem a bit further, the case is I have 9 identical Seagate ST300MM0006 HDDs on a LSI MegaRAID SAS RAID controller and I'm getting following warning to my email every day for few weeks now and it getting me nervous:
This message was generated by the smartd daemon running on:

host name: proxmox
DNS domain: my.domain

The following warning/error was logged by the smartd daemon:

Device: /dev/bus/0 [megaraid_disk_04], Read SMART Self-Test Log Failed

Device info:
[SEAGATE ST300MM0006 6102], lu id: 0x5000c50070ac939f, S/N: {serial_number_deleted}, 300 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Sat Dec 31 16:33:52 2016 CET
Another message will be sent in 24 hours if the problem persists.
So apparentelly there is something wrong with HDD #4 which is actually the first in a list:
root@proxmox:~# smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/bus/0 -d megaraid,4 # /dev/bus/0 [megaraid_disk_04], SCSI device
/dev/bus/0 -d megaraid,5 # /dev/bus/0 [megaraid_disk_05], SCSI device
/dev/bus/0 -d megaraid,6 # /dev/bus/0 [megaraid_disk_06], SCSI device
/dev/bus/0 -d megaraid,7 # /dev/bus/0 [megaraid_disk_07], SCSI device
/dev/bus/0 -d megaraid,8 # /dev/bus/0 [megaraid_disk_08], SCSI device
/dev/bus/0 -d megaraid,9 # /dev/bus/0 [megaraid_disk_09], SCSI device
/dev/bus/0 -d megaraid,10 # /dev/bus/0 [megaraid_disk_10], SCSI device
/dev/bus/0 -d megaraid,11 # /dev/bus/0 [megaraid_disk_11], SCSI device
/dev/bus/0 -d megaraid,12 # /dev/bus/0 [megaraid_disk_12], SCSI device
Then I found a command "smartctl -a -d megaraid,N /dev/sdX" which should return SMART values of individual HDD inside RAID volume, so if I run this line for HDD #4 I get that this device lacks SMART capability:
root@proxmox:~# smartctl -a -d megaraid,4 /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.35-1-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST300MM0006
Revision: 6102
Compliance: SPC-4
User Capacity: 300,000,000,000 bytes [300 GB]
Logical block size: 512 bytes
LU is fully provisioned
Rotation Rate: 10500 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x5000c50070ac939f
Serial number: {serial_number_deleted}
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Sun Jan 1 18:45:06 2017 CET
SMART support is: Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature: 0 C
Drive Trip Temperature: 0 C

Error Counter logging not supported

Device does not support Self Test logging
If I run the same command on HDDs #5 - #12 I get "normal" SMART results:
root@proxmox:~# smartctl -a -d megaraid,5 /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.35-1-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST300MM0006
Revision: 6102
Compliance: SPC-4
User Capacity: 300,000,000,000 bytes [300 GB]
Logical block size: 512 bytes
LU is fully provisioned
Rotation Rate: 10500 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x5000c50070ab10bf
Serial number: {serial_number_deleted}
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Sun Jan 1 18:47:52 2017 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature: 20 C
Drive Trip Temperature: 68 C

Manufactured in week 46 of year 2013
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 41
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 1048
Elements in grown defect list: 38

Vendor (Seagate) cache information
Blocks sent to initiator = 3020535767
Blocks received from initiator = 2824080073
Blocks read from cache and sent to initiator = 2104971126
Number of read and write commands whose size <= segment size = 93629214
Number of read and write commands whose size > segment size = 5897

Vendor (Seagate/Hitachi) factory information
number of hours powered up = 24296.05
number of minutes until next internal SMART test = 5

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 1229634459 36 0 1229634495 36 69107.906 0
write: 0 0 0 0 0 7758.889 0
verify: 3515682185 0 0 3515682185 0 33099.757 0

Non-medium error count: 302

No self-tests have been logged
So how can a single HDD suddenly lacks a SMART capability?
 
Last edited:
I have a Dell R515 server that I upgraded so that all 8 hard-drives are solid state. Specifically, I have 8 of these SSDs in a RAID 10:

Code:
Samsung V-NAND SSD 860 EVO 1TB SATA 6Gbps
    Model MZ-76E1T0
    Model Code: MZ-76E1T0E

After a few months, Proxmox kept emailing me this error regarding disk 5 :
Code:
Device: /dev/bus/1 [megaraid_disk_05] [SAT], Read SMART Self-Test Log Failed

While, the server was on, I looked at the drive lights and noticed that disk 5 had no light at all. When I would reboot the server, though, the light on disk 5's drive tray would come back on. But it wouldn't be long before Proxmox would email me that same error above and the light would go back off, while the sever was running.

2 days ago, Proxmox emailed the same error again regarding that same drive. Yesterday, I removed that hot-swap-able drive (actually I'm not even sure that this SSD it is designed to be hot-swap-able) from the server while the server was running. I unscrewed it from the drive tray, and hooked it up to my Kubuntu 20.04 laptop via a USB drive dock.

I used gparted to view the drive, and it appeared like the drive had nothing on it, not even a partition table. So I created a GPT partition table, using Gparted, and then created an EXT4 partition at max size. These details probably don't matter. I was just wanting to remove anything that might still be on the disk. After this, I removed the Ext4 partition, so that the drive had no partitions on it. I really didn't know what RAID 10 wanted me to do to get that SSD into an optimal state for RAID 10 to rebuild the drive, but I'm just telling you what I did.

After this, I screwed the drive back into its drive tray, and re-inserted it back into the server's drive bay. When I plugged it back in, its drive light came on, and I could see that drive 4 was rebuilding this drive 5, because both of their green lights were blinking rapidly while all of the other drive lights were at their normal activity levels.

Later that day, I check the lights again, and all lights were normal green with equal activity. So, I assume the drive was rebuilt by the RAID 10 successfully.

So far, everything is normal, and I've received no further notifications from Proxmox regarding drive 5.

I'm not sure my procedure was proper, so I'm not advising you follow it, I'm simply sharing what I did to make the error notifications go away. One dangerous thing I did, was hot-swaping that SSD without knowing if it was designed for hot-swapping. It would have been safer for me to perform these steps I did while the server was completely off (I suspect).

UPDATE: I was wrong. This didn't fix my issue. I just got the same error again. I don't know how to fix this. When I replaced all the hard drives in this server with SSDs, the guy I bought this from warned me that their could be consequences like this. I can't remember exactly what he said, but I do recall him saying that it would work, but might generate some type of errors.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!