Nvme disk wearout

Ihor

Member
Mar 31, 2020
9
3
23
51
Hi.

I'm seeing a gradual increase in "wearout" indicator last week (please see attached file). But I can't find anything in smartctl report.
What does wearout indicator mean in my case ? Can wearout value equal the "Percentage Used" value ?
pve version 6.1-7

Smartctl output:

root@o1-ger:~# smartctl -a /dev/nvme0
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.18-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: WDC CL SN720 SDAQNTW-1T00-2000
Serial Number: 1851AF802923
Firmware Version: 10109122
PCI Vendor/Subsystem ID: 0x15b7
IEEE OUI Identifier: 0x001b44
Total NVM Capacity: 1,024,209,543,168 [1.02 TB]
Unallocated NVM Capacity: 0
Controller ID: 8215
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,024,209,543,168 [1.02 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 001b44 8b441db785
Local Time is: Tue Mar 31 08:30:53 2020 UTC
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size: 128 Pages
Warning Comp. Temp. Threshold: 80 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Namespace 1 Features (0x02): NA_Fields

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.00W - - 0 0 0 0 0 0
1 + 3.50W - - 1 1 1 1 0 0
2 + 3.00W - - 2 2 2 2 0 0
3 - 0.1000W - - 3 3 3 3 4000 10000
4 - 0.0025W - - 4 4 4 4 4000 45000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 37 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 6%
Data Units Read: 13,544,919 [6.93 TB]
Data Units Written: 36,727,753 [18.8 TB]
Host Read Commands: 115,301,440
Host Write Commands: 356,914,859
Controller Busy Time: 784
Power Cycles: 13
Power On Hours: 1,299
Unsafe Shutdowns: 10
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged
====

nvme output:

root@o1-ger:~# nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0
temperature : 37 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 6%
data_units_read : 13544922
data_units_written : 36731192
host_read_commands : 115301452
host_write_commands : 356938772
controller_busy_time : 784
power_cycles : 13
power_on_hours : 1299
unsafe_shutdowns : 10
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0

Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0

-----
Best regards.
 

Attachments

  • Screenshot_o1.png
    Screenshot_o1.png
    88.4 KB · Views: 46
  • Like
Reactions: gerami
What does wearout indicator mean in my case ? Can wearout value equal the "Percentage Used" value ?
Yes, that should be the case. It indicates how much of the estimated lifetime of the SSD has been used. The lifetime of a SSD is usually limited by the number of writes the memory cells can handle.
 
  • Like
Reactions: gerami
Yes, that should be the case. It indicates how much of the estimated lifetime of the SSD has been used. The lifetime of a SSD is usually limited by the number of writes the memory cells can handle.

Could you explain "Percentage Used" value ? Maybe it's the percentage of disk space used ?
 
The memory cells in an SSD can only endure a limited number of write operations before they fail.
The "percentage_used" indicator shows exactly that. Additionally, you have the "available_spare" parameter which indicates how many of the spare memory cells are available.

Unless the "available_spare" is down to 1% and the "percentage_used" is going up to 100% the SSD is still fully functional.

Once you reach those limits though, you should think about replacing the SSD. Another indicator is the first line in the smart-log output "critical_warning" should it ever be 1.

It is hard to find a good legitimate source in how to interpret these values that I could link to :/
 
It is hard to find a good legitimate source in how to interpret these values that I could link to :/
the spec would be good start ;)
https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf

chapter 5.14.1.2 SMART / Health Information (Log Identifier 02h)
page 122

Percentage Used:
Contains a vendor specific estimate of the percentage of NVM subsystem life used based on the actual usage and the manufacturer’s prediction of NVM life. A value of 100 indicates that the estimated endurance of the NVM in the NVM subsystem has been consumed, but may not indicate an NVM subsystem failure. The value is allowed to exceed 100. Percentages greater than 254 shall be represented as 255. This value shall be updated once per power-on hour (when the controller is not in a sleep state). Refer to the JEDEC JESD218A standard for SSD device life and endurance measurement techniques.
 
  • Like
Reactions: gerami and Ihor

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!