SMART/Health failure on Ceph install

jvdh

New Member
Feb 24, 2024
1
0
1
I have been running a small (3-node) homelab proxmox cluster with CEPH for almost 5 months.
It has been running great and I learned a lot on many levels.

So far so good, but yesterday I noticed that two of the three 1TB Samsung SSD Pro NVME drives were in degraded mode because of percentage used exceeding 100%.
By now I know that running consumer grade SSD's is not advised for Ceph and that livespan can be effected negatively.
However, I would expect wearout to be fairly equally balanced over the three nodes. Looking at Data Units Written/Read are somewhat even on all three SSD's. Wearout hugely differs ( 6%, 150%. 255%). See all figures below

I did some searching and found complaints about degredation of Samsung SSD NVME. Especially on the 990 series.
This leaves me with the following questions:
1. Is this my own fault because the components I used?
2. Shouln't wearout be pretty identical on all 3 SSD's in the cluster, especially considering that number of writes and reads are fairly identical?
3. Does anyone have affordable suggestions for VNME SSD's that are able to handle a load from CEPH?

Thanks in advance for anyone willing to read and hopefully respond to my question!

- NVM subsystem reliability has been degraded
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x04
Temperature: 51 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 150%
Data Units Read: 36,270,220 [18.5 TB]
Data Units Written: 40,540,558 [20.7 TB]
Host Read Commands: 587,321,451
Host Write Commands: 2,212,358,200
Controller Busy Time: 32,193
Power Cycles: 17
Power On Hours: 2,873
Unsafe Shutdowns: 7
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 51 Celsius
Temperature Sensor 2: 68 Celsius
---------------------------------------------------
MART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 52 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 6%
Data Units Read: 34,270,768 [17.5 TB]
Data Units Written: 39,984,582 [20.4 TB]
Host Read Commands: 544,967,927
Host Write Commands: 2,168,962,757
Controller Busy Time: 25,628
Power Cycles: 18
Power On Hours: 2,645
Unsafe Shutdowns: 9
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 52 Celsius
Temperature Sensor 2: 62 Celsius
----------------------------------------------

- NVM subsystem reliability has been degraded
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x04
Temperature: 53 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 255%
Data Units Read: 33,014,813 [16.9 TB]
Data Units Written: 40,227,753 [20.5 TB]
Host Read Commands: 614,663,963
Host Write Commands: 2,234,041,872
Controller Busy Time: 28,858
Power Cycles: 26
Power On Hours: 3,375
Unsafe Shutdowns: 16
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 53 Celsius
Temperature Sensor 2: 65 Celsius
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!