Hello everyone.
I have a home proxmox setup on an oldish Super Micro Opteron server.
Supermicro SC846 24 Bay
2 AMD Opteron Hex Core 2431 @ 2.4Ghz for total of 12 cores
49GB RAM
3 x Supermicro's SAT2-MV8 SATA controller (based on the Marvell Hercules-2 Rev. C0 SATA host controller)
I am running on latest VE 5.2-1 setup. the whole server seams to be OK over all
it is up and running well. I use ZFS mirror pools for everything,
boot volume(rpool) is 2x120Gb mirror.
additional local storage(pvstore) is 2x1TB HDD mirror
and 2 ZFS volumes tank0 and tank1 for main shared storage use. this are bind-mounted into LXC container and shared from there for all users as NAS.
6x2TB HDD
and 2x3TB HDD
I had several 2TB disk at home and I also bought 4 drives at my office.
the 4 drives are DELL(Seagate) ata-ST2000NM0055-1V4104 models.
they all connected to the same controller card.
in the last 2 weeks I lost 2 of them one after the other.
when I do zpool status I get tank0 is degraded, drive ATAXXXXXXXXXXX not available.
since I run 3 vdevs 2 disks in mirror mode no data loos happens. and as I do have several 2TB spares I just replace the disks and resolver. but the disks are online and do not show any issues.
I can redo the partition table and they are working.
how can I trace the issue?
what is the best way to Burn-in test the disks before use ?
thanks....
here is the smart report from the last failed disk
I have a home proxmox setup on an oldish Super Micro Opteron server.
Supermicro SC846 24 Bay
2 AMD Opteron Hex Core 2431 @ 2.4Ghz for total of 12 cores
49GB RAM
3 x Supermicro's SAT2-MV8 SATA controller (based on the Marvell Hercules-2 Rev. C0 SATA host controller)
I am running on latest VE 5.2-1 setup. the whole server seams to be OK over all
it is up and running well. I use ZFS mirror pools for everything,
boot volume(rpool) is 2x120Gb mirror.
Code:
NAME USED AVAIL REFER MOUNTPOINT
rpool 57.9G 49.6G 104K /rpool
rpool/ROOT 46.1G 49.6G 96K /rpool/ROOT
rpool/ROOT/pve-1 46.1G 49.6G 46.1G /
rpool/data 3.32G 49.6G 128K /rpool/data
rpool/data/subvol-101-disk-2 3.32G 46.7G 3.32G /rpool/data/subvol-101-disk-2
rpool/swap 8.50G 57.9G 196M -
additional local storage(pvstore) is 2x1TB HDD mirror
Code:
NAME USED AVAIL REFER MOUNTPOINT
pvstore 19.6G 879G 13.1G /pvstore
pvstore/iso 104K 879G 104K /pvstore/iso
pvstore/subvol-103-disk-1 3.32G 6.68G 3.32G /pvstore/subvol-103-disk-1
pvstore/subvol-104-disk-1 1.75G 8.25G 1.75G /pvstore/subvol-104-disk-1
pvstore/template 1.37G 879G 1.37G /pvstore/template
and 2 ZFS volumes tank0 and tank1 for main shared storage use. this are bind-mounted into LXC container and shared from there for all users as NAS.
6x2TB HDD
Code:
NAME USED AVAIL REFER MOUNTPOINT
tank0 1.02T 4.24T 977G /tank0
tank0/share0 71.4G 4.24T 71.4G /tank0/share0
and 2x3TB HDD
Code:
NAME USED AVAIL REFER MOUNTPOINT
tank1 1.56T 1.07T 728G /tank1
tank1/share1 871G 1.07T 871G /tank1/share1
I had several 2TB disk at home and I also bought 4 drives at my office.
the 4 drives are DELL(Seagate) ata-ST2000NM0055-1V4104 models.
they all connected to the same controller card.
in the last 2 weeks I lost 2 of them one after the other.
when I do zpool status I get tank0 is degraded, drive ATAXXXXXXXXXXX not available.
since I run 3 vdevs 2 disks in mirror mode no data loos happens. and as I do have several 2TB spares I just replace the disks and resolver. but the disks are online and do not show any issues.
I can redo the partition table and they are working.
how can I trace the issue?
what is the best way to Burn-in test the disks before use ?
thanks....
here is the smart report from the last failed disk
Code:
# smartctl -a /dev/sdn
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.17-2-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: ST2000NM0055-1V4104
Serial Number: ZXXXXXX8
LU WWN Device Id: 5 000c50 0a381b3d2
Add. Product Id: DELL(tm)
Firmware Version: DA05
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Wed Jun 13 11:42:52 2018 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 90) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 191) minutes.
Conveyance self-test routine
recommended polling time: ( 3) minutes.
SCT capabilities: (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x010f 082 065 044 Pre-fail Always - 168192984
3 Spin_Up_Time 0x0103 096 096 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 8
5 Reallocated_Sector_Ct 0x0133 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 089 060 045 Pre-fail Always - 713238413
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4794 (52 162 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 8
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 067 063 040 Old_age Always - 33 (Min/Max 24/37)
191 G-Sense_Error_Rate 0x0032 097 097 000 Old_age Always - 6781
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 7
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 846
194 Temperature_Celsius 0x0022 033 040 000 Old_age Always - 33 (0 19 0 0 0)
195 Hardware_ECC_Recovered 0x001a 011 001 000 Old_age Always - 168192984
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 1776 (179 221 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 595494286
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 305256200
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.