[SOLVED] Disk Wearout jumped alarmingly

keeka · Nov 29, 2020

I have LVM thin provisioned on a 1 TB WD Blue SSD. In use for about 10 months and has, until recently, always reported 0% Wearout in PVE web UI.
Recently, this has jumped to 98%! The disk is puportedly good for 400TBW and I don't believe I have over taxed it. It's about 25% capacity, I have done a few dozen VM/CT restores from snapshots and backups, but nothing beyond a normal home lab workload IMO. All the VM's have modest disk size/write requirements.

AIUI, disk wearout reported in PVE Web UI is `100 - Media_Wearout_Indicator`. However, since monitoring started (shortly after installation) this value has mostly been 1, rising recently to 2. PVE UI has reported 0% Wearout until very recently. What has increased steadily, albeit with a few jumps, is the raw value of said attribute. I have no idea what the raw value actually represents or how the attribute value (currently 2) relates to it.

So, I'm wondering is the disk really near end of life already and how do we interpret that raw value (0x020b0032020b)?

I'd be grateful for any insight and advice on the state of the SSD!

Many thanks.

smartctl currently reports:

Code:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.73-1-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     WD Blue and Green SSDs
Device Model:     WDC  WDS100T2B0A-00SM50
Serial Number:    194526801184
LU WWN Device Id: 5 001b44 8b1149bcb
Firmware Version: 411030WD
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Nov 29 14:17:08 2020 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (    0) seconds.
Offline data collection
capabilities:              (0x11) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    No Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  10) minutes.

SMART Attributes Data Structure revision number: 4
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   ---    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   ---    Old_age   Always       -       7526
 12 Power_Cycle_Count       0x0032   100   100   ---    Old_age   Always       -       124
165 Block_Erase_Count       0x0032   100   100   ---    Old_age   Always       -       102106146
166 Minimum_PE_Cycles_TLC   0x0032   100   100   ---    Old_age   Always       -       1
167 Max_Bad_Blocks_per_Die  0x0032   100   100   ---    Old_age   Always       -       72
168 Maximum_PE_Cycles_TLC   0x0032   100   100   ---    Old_age   Always       -       14
169 Total_Bad_Blocks        0x0032   100   100   ---    Old_age   Always       -       609
170 Grown_Bad_Blocks        0x0032   100   100   ---    Old_age   Always       -       0
171 Program_Fail_Count      0x0032   100   100   ---    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   ---    Old_age   Always       -       0
173 Average_PE_Cycles_TLC   0x0032   100   100   ---    Old_age   Always       -       5
174 Unexpected_Power_Loss   0x0032   100   100   ---    Old_age   Always       -       1
184 End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   ---    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   ---    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   071   040   ---    Old_age   Always       -       29 (Min/Max 14/40)
199 UDMA_CRC_Error_Count    0x0032   100   100   ---    Old_age   Always       -       0
230 Media_Wearout_Indicator 0x0032   002   002   ---    Old_age   Always       -       0x020b0032020b
232 Available_Reservd_Space 0x0033   100   100   004    Pre-fail  Always       -       100
233 NAND_GB_Written_TLC     0x0032   100   100   ---    Old_age   Always       -       5542
234 NAND_GB_Written_SLC     0x0032   100   100   ---    Old_age   Always       -       16535
241 Host_Writes_GiB         0x0030   253   253   ---    Old_age   Offline      -       13958
242 Host_Reads_GiB          0x0030   253   253   ---    Old_age   Offline      -       8508
244 Temp_Throttle_Status    0x0032   000   100   ---    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      7515         -
# 2  Short offline       Completed without error       00%      7491         -
# 3  Short offline       Completed without error       00%      7467         -
# 4  Extended offline    Completed without error       00%      7445         -
# 5  Short offline       Completed without error       00%      7421         -
# 6  Short offline       Completed without error       00%      7397         -
# 7  Short offline       Completed without error       00%      7373         -
# 8  Short offline       Completed without error       00%      7348         -
# 9  Short offline       Completed without error       00%      7324         -
#10  Short offline       Completed without error       00%      7300         -
#11  Extended offline    Completed without error       00%      7276         -
#12  Short offline       Completed without error       00%      7252         -
#13  Short offline       Completed without error       00%      7228         -
#14  Short offline       Completed without error       00%      7204         -
#15  Short offline       Completed without error       00%      7180         -
#16  Short offline       Completed without error       00%      7156         -
#17  Short offline       Completed without error       00%      7132         -
#18  Extended offline    Completed without error       00%      7108         -
#19  Short offline       Completed without error       00%      7084         -
#20  Short offline       Completed without error       00%      7059         -
#21  Short offline       Completed without error       00%      7035         -

Selective Self-tests/Logging not supported

dcsapak · Nov 30, 2020

this is just a side effect of a recent change in how we collect that information: https://lists.proxmox.com/pipermail/pve-devel/2020-October/045643.html

previously we did a fallback onto the attribute id 233 (which is wrong in your case) but now select based on the label

sadly western digital has reversed the logic of that field it seems (normally the value is 100 and goes down as the drive wears out; but in your case it counts up)

i am afraid there is no good solution here, as i am really opposed to have a vendor/drive-specific mapping that we have to update constantly ...

keeka · Nov 30, 2020

Thank you Dominik. That is good to know. A shame WD do not conform to the standard. Many thanks for explaining it. I will hold fire on buying a new SSD for a while, though doubt it will be WD next time.

dcsapak · Nov 30, 2020

keeka said:
conform to the standard

the problem is that smart values are not standardized, so it is not really western digitals fault, and i am sure other vendors do it the same way....

keeka · Nov 30, 2020

Given the attribute refers to wearout, it should intuitively increase over time. Perhaps attribute 233 + Media_Wearout_Indicator could be treated as such. I wonder, if it is standard with western digital drives, if it could be catered for?

EDIT: OK there's no standard but AIUI it seems convention is for attributes to decrement from 100. So WD are departing from convention in that sense.

DanielRouleau · Mar 26, 2021

Thank you for all the information
I had the same issue, but since I had 2 WD Blue SSD same model in ZFS RAID 1 in the the same machine 1 was displaying 0% and the other 98% I was under the impression I had a deffective drive and I went throu the whole process of replacing 1 ZFS member.This was a new host without any VM on it so good practice for the futur. This was before I found this reference.
But actually what I now realize is that Proxmox GUI will show 0% until there are some Wearout and then will show 99% and lower so It would make sense to have 2 devices 1 showing 0% and the other 98% (meaning only 2% Wearout) just a bit confusing when you are not used ot it ;-)

@noxy · Apr 18, 2023

Hello, we need help also. We have 2 LVM using samsung ssd 870 QVO and when we check its wearout, one shows 55% and the other shows 1% and with athat said we now experience lot of backup failures for our CTs and VMs and we use LVM on the disks it works on for backups. How can we go about getting the backups to work fine, should we use LVM or ZFS to avoid such errors? See attached screenshots of the disks and the backup errors

leesteken · Apr 18, 2023

@noxy said:
Hello, we need help also. We have 2 LVM using samsung ssd 870 QVO and when we check its wearout, one shows 55% and the other shows 1% and with athat said we now experience lot of backup failures for our CTs and VMs and we use LVM on the disks it works on for backups. How can we go about getting the backups to work fine, should we use LVM or ZFS to avoid such errors? See attached screenshots of the disks and the backup errors

Don't use QLC-drives and better yet use enterprise SSDs (with PLP) or use spinning HDDs instead (as they perform more consistently than QLC-drives). Please search the forum about all the problems with QLC.

@noxy · Apr 18, 2023

We have like 31 vms and cts on the proxmox and it was working fine for 3yrs until now. SO what could've been the cause to now start having the backup errors?

[SOLVED] Disk Wearout jumped alarmingly

keeka

Renowned Member

dcsapak

Proxmox Staff Member

keeka

Renowned Member

dcsapak

Proxmox Staff Member

keeka

Renowned Member

DanielRouleau

Active Member

@noxy

New Member

Attachments

leesteken

Distinguished Member

@noxy

New Member

We value your privacy