rapid ssd wear out!

zenny

Active Member
Jul 7, 2008
86
2
28
Hi,

I have two machines (in different locations) which has exactly same hardware as well as proxmox versions (4) running. The first machine was deployed two years ahead of the second one. The first server has more network transactions than the second one.

It is strange that smartctl states that the SSD used for ZIL and ZLARC in the second server deployed later worn out completely while the first one is intact. However 'zpool status' shows no errors and log and cache online to suspect whether smartct reports false positive? 'systemctl staus' shows datapool 'degraded', but zpool status shows no issues (ONLINE)!

Details below:

FIRSTOLDSEVER=2 HDDs single rpool + SSD for ZIL and ZLARC
SSDRAPIDWEAROUT= 4HDDs striped rpool and striped datapool + SSD for ZIL and ZLARC

The outputs are as of below:

Code:
2018-06-19 09:11:46 root@FIRSTOLDSERVER:[~]:$ lsblk -dt /dev/sd?
NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED    RQ-SIZE  RA WSAME
sda          0    512      0     512     512    0 deadline     128 128    0B
sdb          0    512      0     512     512    1 deadline     128 128    0B
sdc          0    512      0     512     512    1 deadline     128 128    0B


2018-06-19 09:11:55 root@SSDRAPIDWEAROUT:[~]:$ lsblk -dt /dev/sd?
NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED    RQ-SIZE  RA WSAME
sda          0    512      0     512     512    0 deadline     128 128    0B
sdb          0   4096      0    4096     512    1 noop         128 128    0B
sdc          0    512      0     512     512    1 deadline     128 128    0B
sdd          0   4096      0    4096     512    1 noop         128 128    0B
sde          0    512      0     512     512    1 deadline     128 128    0B
[/CODE}

[CODE]
2018-06-19 09:13:23 root@FIRSTOLDSERVER:[~]:$ zdb | grep ashift
            ashift: 12
            ashift: 9


2018-06-19 09:13:37 root@SSDRAPIDWEAROUT:[~]:$ zdb | grep ashift
            ashift: 12
            ashift: 12
            ashift: 12
            ashift: 9

Code:
2018-06-19 09:23:04 root@FIRSTOLDSERVER:[~]:$ zpool iostat
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
rpool        231G  1.59T     21     22   574K   140K

2018-06-19 10:25:04 root@SSDRAPIDWEAROUT:[~]:$ zpool iostat
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
datapool   116G  3,51T      1      7  37,9K  25,2K
rpool       77,5G  1,74T      2     44  25,3K   120K
----------  -----  -----  -----  -----  -----  -----
 
Last edited:
What type of SSDs is this? Could you please post a smartctl -a on the disks?

The reportedly failing drive:

Code:
2018-06-19 14:26:55 root@SSDRAPIDWEAROUT:[~]:$ smartctl -a /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.83-1-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SandForce Driven SSDs
Device Model:     KINGSTON SV300S37A120G
Serial Number:    50026B785201DF22
LU WWN Device Id: 5 0026b7 85201df22
Firmware Version: 583ABBF0
User Capacity:    120 034 123 776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jun 19 14:27:09 2018 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x7d) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Abort Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
                                        General Purpose Logging supported.            [13/1925]
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  48) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x0025) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALU
E
  1 Raw_Read_Error_Rate     0x0032   095   095   050    Old_age   Always       -       1/657345
24
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   091   091   000    Old_age   Always       -       8501h+17
m+08.070s
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       73
171 Program_Fail_Count      0x000a   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       0
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       100
181 Program_Fail_Count      0x000a   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0012   099   099   000    Old_age   Always       -       1
189 Airflow_Temperature_Cel 0x0000   032   036   000    Old_age   Offline      -       32 (Min/
Max 21/36)
194 Temperature_Celsius     0x0022   032   036   000    Old_age   Always       -       32 (Min/
Max 21/36)
195 ECC_Uncorr_Error_Count  0x001c   112   112   000    Old_age   Offline      -       1/657345
24
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
201 Unc_Soft_Read_Err_Rate  0x001c   112   112   000    Old_age   Offline      -       1/657345
24
204 Soft_ECC_Correct_Rate   0x001c   112   112   000    Old_age   Offline      -       1/657345
24
230 Life_Curve_Status       0x0013   100   100   000    Pre-fail  Always       -       100
231 SSD_Life_Left           0x0013   001   001   010    Pre-fail  Always   FAILING_NOW 42949672
97
233 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       6173
234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       0
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       0
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       0

SMART Error Log not supported

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The longer deployed old SSD which is still intact:

Code:
2018-06-19 10:34:41 root@FIRSTOLDSERVER:[~]:$ smartctl -a /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.117-1-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SandForce Driven SSDs
Device Model:     KINGSTON SV300S37A120G
Serial Number:    50026B785201E0E9
LU WWN Device Id: 5 0026b7 85201e0e9
Firmware Version: 583ABBF0
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jun 19 14:31:21 2018 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x7d) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Abort Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
                                        General Purpose Logging supported.            [14/423]
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  48) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x0025) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VAL
UE
  1 Raw_Read_Error_Rate     0x0032   095   095   050    Old_age   Always       -       0/85515
543
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   071   071   000    Old_age   Always       -       26124h+
03m+37.820s
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       22
171 Program_Fail_Count      0x000a   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       12
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       99
181 Program_Fail_Count      0x000a   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0012   100   100   000    Old_age   Always       -       0
189 Airflow_Temperature_Cel 0x0000   029   042   000    Old_age   Offline      -       29 (Min
/Max 19/42)
194 Temperature_Celsius     0x0022   029   042   000    Old_age   Always       -       29 (Min
/Max 19/42)
195 ECC_Uncorr_Error_Count  0x001c   120   120   000    Old_age   Offline      -       0/85515
543
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
201 Unc_Soft_Read_Err_Rate  0x001c   120   120   000    Old_age   Offline      -       0/85515
543
204 Soft_ECC_Correct_Rate   0x001c   120   120   000    Old_age   Offline      -       0/85515
543
230 Life_Curve_Status       0x0013   100   100   000    Pre-fail  Always       -       100
231 SSD_Life_Left           0x0013   097   097   010    Pre-fail  Always       -       1
233 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       14966
234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       3726
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       3726
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       131

SMART Error Log not supported

SMART Self-test log structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Kingston SSDNow V300 120GB, SATA (SV300S37A/120G)
Is rated for 64 TBW.
IMHO these types of SSD's are not suitable as a caching device.

Edit: You use twice as many OSDs in the rapid wearout Server. Basically you have created 4 times as many writes to the RapidWearoutServers OSD compared to the old server, assuming they produce the same amount of writes.
 
Last edited:
  • Like
Reactions: Yvan Watchman

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!