SMART/Health failure on Ceph install

jvdh · 2024-11-05T17:04:03+0100

I have been running a small (3-node) homelab proxmox cluster with CEPH for almost 5 months.
It has been running great and I learned a lot on many levels.

So far so good, but yesterday I noticed that two of the three 1TB Samsung SSD Pro NVME drives were in degraded mode because of percentage used exceeding 100%.
By now I know that running consumer grade SSD's is not advised for Ceph and that livespan can be effected negatively.
However, I would expect wearout to be fairly equally balanced over the three nodes. Looking at Data Units Written/Read are somewhat even on all three SSD's. Wearout hugely differs ( 6%, 150%. 255%). See all figures below

I did some searching and found complaints about degredation of Samsung SSD NVME. Especially on the 990 series.
This leaves me with the following questions:
1. Is this my own fault because the components I used?
2. Shouln't wearout be pretty identical on all 3 SSD's in the cluster, especially considering that number of writes and reads are fairly identical?
3. Does anyone have affordable suggestions for VNME SSD's that are able to handle a load from CEPH?

Thanks in advance for anyone willing to read and hopefully respond to my question!

- NVM subsystem reliability has been degraded
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x04
Temperature: 51 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 150%
Data Units Read: 36,270,220 [18.5 TB]
Data Units Written: 40,540,558 [20.7 TB]
Host Read Commands: 587,321,451
Host Write Commands: 2,212,358,200
Controller Busy Time: 32,193
Power Cycles: 17
Power On Hours: 2,873
Unsafe Shutdowns: 7
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 51 Celsius
Temperature Sensor 2: 68 Celsius
---------------------------------------------------
MART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 52 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 6%
Data Units Read: 34,270,768 [17.5 TB]
Data Units Written: 39,984,582 [20.4 TB]
Host Read Commands: 544,967,927
Host Write Commands: 2,168,962,757
Controller Busy Time: 25,628
Power Cycles: 18
Power On Hours: 2,645
Unsafe Shutdowns: 9
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 52 Celsius
Temperature Sensor 2: 62 Celsius
----------------------------------------------

- NVM subsystem reliability has been degraded
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x04
Temperature: 53 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 255%
Data Units Read: 33,014,813 [16.9 TB]
Data Units Written: 40,227,753 [20.5 TB]
Host Read Commands: 614,663,963
Host Write Commands: 2,234,041,872
Controller Busy Time: 28,858
Power Cycles: 26
Power On Hours: 3,375
Unsafe Shutdowns: 16
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 53 Celsius
Temperature Sensor 2: 65 Celsius

Search

Search

SMART/Health failure on Ceph install

jvdh

New Member