SMART/Health failure on Ceph install

jvdh · Nov 5, 2024

I have been running a small (3-node) homelab proxmox cluster with CEPH for almost 5 months.
It has been running great and I learned a lot on many levels.

So far so good, but yesterday I noticed that two of the three 1TB Samsung SSD Pro NVME drives were in degraded mode because of percentage used exceeding 100%.
By now I know that running consumer grade SSD's is not advised for Ceph and that livespan can be effected negatively.
However, I would expect wearout to be fairly equally balanced over the three nodes. Looking at Data Units Written/Read are somewhat even on all three SSD's. Wearout hugely differs ( 6%, 150%. 255%). See all figures below

I did some searching and found complaints about degredation of Samsung SSD NVME. Especially on the 990 series.
This leaves me with the following questions:
1. Is this my own fault because the components I used?
2. Shouln't wearout be pretty identical on all 3 SSD's in the cluster, especially considering that number of writes and reads are fairly identical?
3. Does anyone have affordable suggestions for VNME SSD's that are able to handle a load from CEPH?

Thanks in advance for anyone willing to read and hopefully respond to my question!

- NVM subsystem reliability has been degraded
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x04
Temperature: 51 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 150%
Data Units Read: 36,270,220 [18.5 TB]
Data Units Written: 40,540,558 [20.7 TB]
Host Read Commands: 587,321,451
Host Write Commands: 2,212,358,200
Controller Busy Time: 32,193
Power Cycles: 17
Power On Hours: 2,873
Unsafe Shutdowns: 7
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 51 Celsius
Temperature Sensor 2: 68 Celsius
---------------------------------------------------
MART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 52 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 6%
Data Units Read: 34,270,768 [17.5 TB]
Data Units Written: 39,984,582 [20.4 TB]
Host Read Commands: 544,967,927
Host Write Commands: 2,168,962,757
Controller Busy Time: 25,628
Power Cycles: 18
Power On Hours: 2,645
Unsafe Shutdowns: 9
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 52 Celsius
Temperature Sensor 2: 62 Celsius
----------------------------------------------

- NVM subsystem reliability has been degraded
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x04
Temperature: 53 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 255%
Data Units Read: 33,014,813 [16.9 TB]
Data Units Written: 40,227,753 [20.5 TB]
Host Read Commands: 614,663,963
Host Write Commands: 2,234,041,872
Controller Busy Time: 28,858
Power Cycles: 26
Power On Hours: 3,375
Unsafe Shutdowns: 16
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 53 Celsius
Temperature Sensor 2: 65 Celsius

jvdh · Nov 6, 2024

I get that this is complicated question:
So let me limit the question. Is it normal to have a 6,150,255 percentage use for the same ceph storage?

your input is greatly appreciated by this novice.

cheers,
John

alexskysilk · Nov 6, 2024

jvdh said:
yesterday I noticed that two of the three 1TB Samsung SSD Pro NVME drives were in degraded mode

what does this mean? post the relevent command and output so there is no confusion in context.

jvdh said:
1. Is this my own fault because the components I used?

always.

jvdh said:
2. Shouln't wearout be pretty identical on all 3 SSD's in the cluster, especially considering that number of writes and reads are fairly identical?

not necessarily. the larger your PGs are (eg, the lower the pg count) the less even the distribution. Also, not all pgs get equal use.

jvdh said:
3. Does anyone have affordable suggestions for VNME SSD's that are able to handle a load from CEPH?

"Affordable" is a fuzzy term. Depending on what you're doing, maybe just move your heavy IO VMs to a different storage pool, and use your nvme storage for "premium" usecases- eg boot devices, databases, etc.

jvdh · Nov 8, 2024

I have included the total output of smart for each of the devices.
What other output can I post?

I have chosen a cluster setup with Ceph, mostly not to have a SPOF.
VM’s Homeassistant, opnsense, and some other low load vm’s, lxc’s en docker

Search

Search

SMART/Health failure on Ceph install

jvdh

New Member

jvdh

New Member

alexskysilk

Distinguished Member

jvdh

New Member