SMART/Health failure on Ceph install

jvdh

New Member
Feb 24, 2024
3
0
1
I have been running a small (3-node) homelab proxmox cluster with CEPH for almost 5 months.
It has been running great and I learned a lot on many levels.

So far so good, but yesterday I noticed that two of the three 1TB Samsung SSD Pro NVME drives were in degraded mode because of percentage used exceeding 100%.
By now I know that running consumer grade SSD's is not advised for Ceph and that livespan can be effected negatively.
However, I would expect wearout to be fairly equally balanced over the three nodes. Looking at Data Units Written/Read are somewhat even on all three SSD's. Wearout hugely differs ( 6%, 150%. 255%). See all figures below

I did some searching and found complaints about degredation of Samsung SSD NVME. Especially on the 990 series.
This leaves me with the following questions:
1. Is this my own fault because the components I used?
2. Shouln't wearout be pretty identical on all 3 SSD's in the cluster, especially considering that number of writes and reads are fairly identical?
3. Does anyone have affordable suggestions for VNME SSD's that are able to handle a load from CEPH?

Thanks in advance for anyone willing to read and hopefully respond to my question!

- NVM subsystem reliability has been degraded
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x04
Temperature: 51 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 150%
Data Units Read: 36,270,220 [18.5 TB]
Data Units Written: 40,540,558 [20.7 TB]
Host Read Commands: 587,321,451
Host Write Commands: 2,212,358,200
Controller Busy Time: 32,193
Power Cycles: 17
Power On Hours: 2,873
Unsafe Shutdowns: 7
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 51 Celsius
Temperature Sensor 2: 68 Celsius
---------------------------------------------------
MART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 52 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 6%
Data Units Read: 34,270,768 [17.5 TB]
Data Units Written: 39,984,582 [20.4 TB]
Host Read Commands: 544,967,927
Host Write Commands: 2,168,962,757
Controller Busy Time: 25,628
Power Cycles: 18
Power On Hours: 2,645
Unsafe Shutdowns: 9
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 52 Celsius
Temperature Sensor 2: 62 Celsius
----------------------------------------------

- NVM subsystem reliability has been degraded
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x04
Temperature: 53 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 255%
Data Units Read: 33,014,813 [16.9 TB]
Data Units Written: 40,227,753 [20.5 TB]
Host Read Commands: 614,663,963
Host Write Commands: 2,234,041,872
Controller Busy Time: 28,858
Power Cycles: 26
Power On Hours: 3,375
Unsafe Shutdowns: 16
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 53 Celsius
Temperature Sensor 2: 65 Celsius
 
I get that this is complicated question:
So let me limit the question. Is it normal to have a 6,150,255 percentage use for the same ceph storage?

your input is greatly appreciated by this novice.

cheers,
John
 
yesterday I noticed that two of the three 1TB Samsung SSD Pro NVME drives were in degraded mode
what does this mean? post the relevent command and output so there is no confusion in context.

1. Is this my own fault because the components I used?
always.

2. Shouln't wearout be pretty identical on all 3 SSD's in the cluster, especially considering that number of writes and reads are fairly identical?
not necessarily. the larger your PGs are (eg, the lower the pg count) the less even the distribution. Also, not all pgs get equal use.

3. Does anyone have affordable suggestions for VNME SSD's that are able to handle a load from CEPH?
"Affordable" is a fuzzy term. Depending on what you're doing, maybe just move your heavy IO VMs to a different storage pool, and use your nvme storage for "premium" usecases- eg boot devices, databases, etc.
 
I have included the total output of smart for each of the devices.
What other output can I post?

I have chosen a cluster setup with Ceph, mostly not to have a SPOF.
VM’s Homeassistant, opnsense, and some other low load vm’s, lxc’s en docker
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!