Nvme disk wearout

Ihor · Mar 31, 2020

Hi.

I'm seeing a gradual increase in "wearout" indicator last week (please see attached file). But I can't find anything in smartctl report.
What does wearout indicator mean in my case ? Can wearout value equal the "Percentage Used" value ?
pve version 6.1-7

Smartctl output:

root@o1-ger:~# smartctl -a /dev/nvme0
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.18-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: WDC CL SN720 SDAQNTW-1T00-2000
Serial Number: 1851AF802923
Firmware Version: 10109122
PCI Vendor/Subsystem ID: 0x15b7
IEEE OUI Identifier: 0x001b44
Total NVM Capacity: 1,024,209,543,168 [1.02 TB]
Unallocated NVM Capacity: 0
Controller ID: 8215
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,024,209,543,168 [1.02 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 001b44 8b441db785
Local Time is: Tue Mar 31 08:30:53 2020 UTC
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size: 128 Pages
Warning Comp. Temp. Threshold: 80 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Namespace 1 Features (0x02): NA_Fields

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.00W - - 0 0 0 0 0 0
1 + 3.50W - - 1 1 1 1 0 0
2 + 3.00W - - 2 2 2 2 0 0
3 - 0.1000W - - 3 3 3 3 4000 10000
4 - 0.0025W - - 4 4 4 4 4000 45000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 37 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 6%
Data Units Read: 13,544,919 [6.93 TB]
Data Units Written: 36,727,753 [18.8 TB]
Host Read Commands: 115,301,440
Host Write Commands: 356,914,859
Controller Busy Time: 784
Power Cycles: 13
Power On Hours: 1,299
Unsafe Shutdowns: 10
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged
====

nvme output:

root@o1-ger:~# nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0
temperature : 37 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 6%
data_units_read : 13544922
data_units_written : 36731192
host_read_commands : 115301452
host_write_commands : 356938772
controller_busy_time : 784
power_cycles : 13
power_on_hours : 1299
unsafe_shutdowns : 10
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0

Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0

-----
Best regards.

aaron · Mar 31, 2020

Ihor said:
What does wearout indicator mean in my case ? Can wearout value equal the "Percentage Used" value ?

Yes, that should be the case. It indicates how much of the estimated lifetime of the SSD has been used. The lifetime of a SSD is usually limited by the number of writes the memory cells can handle.

Ihor · Mar 31, 2020

aaron said:
Yes, that should be the case. It indicates how much of the estimated lifetime of the SSD has been used. The lifetime of a SSD is usually limited by the number of writes the memory cells can handle.

Could you explain "Percentage Used" value ? Maybe it's the percentage of disk space used ?

aaron · Mar 31, 2020

The memory cells in an SSD can only endure a limited number of write operations before they fail.
The "percentage_used" indicator shows exactly that. Additionally, you have the "available_spare" parameter which indicates how many of the spare memory cells are available.

Unless the "available_spare" is down to 1% and the "percentage_used" is going up to 100% the SSD is still fully functional.

Once you reach those limits though, you should think about replacing the SSD. Another indicator is the first line in the smart-log output "critical_warning" should it ever be 1.

It is hard to find a good legitimate source in how to interpret these values that I could link to :/

dcsapak · Mar 31, 2020

aaron said:
It is hard to find a good legitimate source in how to interpret these values that I could link to :/

the spec would be good start

https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf

chapter 5.14.1.2 SMART / Health Information (Log Identifier 02h)
page 122

Percentage Used:
Contains a vendor specific estimate of the percentage of NVM subsystem life used based on the actual usage and the manufacturer’s prediction of NVM life. A value of 100 indicates that the estimated endurance of the NVM in the NVM subsystem has been consumed, but may not indicate an NVM subsystem failure. The value is allowed to exceed 100. Percentages greater than 254 shall be represented as 255. This value shall be updated once per power-on hour (when the controller is not in a sleep state). Refer to the JEDEC JESD218A standard for SSD device life and endurance measurement techniques.

aaron · Mar 31, 2020

dcsapak said:
the spec would be good start

thx for finding it and now I know how the keywords are formatted ^^

Search

Search

Nvme disk wearout

Ihor

Member

Attachments

aaron

Proxmox Staff Member

Ihor

Member

aaron

Proxmox Staff Member

dcsapak

Proxmox Staff Member

aaron

Proxmox Staff Member