Hi,
You have to look at the endurance specs of your SSDs, especially DWPD ((full) Disk (capacity) Write per day) or the TBW (TerraBytes Written).
For low budget, I prefered to take only 1 SSD for everything, but opted to a quite high end home one with 2400 TBW :
https://shop.sandisk.com/products/ssd/internal-ssd/wd-black-sn850x-nvme-ssd?sku=WDS400T2X0E-00BCA0
For an SSD running 24x7, you can convert with formula like:
Code:
TBW = DWPD × Drive Capacity (TB) × 365 × Warranty Years
Or use "Service years" instead of Warranty Years if you want to use them for different periods.
Once you selected your drives, you must then regularly watch the SMART data. I had an unsuited SSDs (for the same price) that I had to replace after few months. So, don't save 50€ on your SSDs or you might have to replace them
Ensure the SSDs have Cache (for read) and some SLC cache. That's why I opted for the SN850X. My previous SSD didn't had any of them, so it was terribly slow under load (concurrency), and aging very quickly (some 2% / month).
Real use case - After 10 months of use in a Home Game Servers & Lab - Intensive use + Ceph + CephFS on a 3 nodes cluster with full mesh network for the ceph storage daemons on a 4TB wd_black:
Code:
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 5%
Data Units Read: 260,749,322 [133 TB]
Data Units Written: 523,270,552 [267 TB]
Host Read Commands: 2,919,761,751
Host Write Commands: 13,178,553,837
Controller Busy Time: 32,990
Power Cycles: 92
Power On Hours: 6,705
Unsafe Shutdowns: 85
Media and Data Integrity Errors: 0
Error Information Log Entries: 2
Warning Comp. Temperature Time: 2
Critical Comp. Temperature Time: 0
I wrote 267 TB, which is a bit more than 10% of the 2400 TBW, and the Percentage used is only 5%, which is VERY good.
The unsafe Shutdowns count might make no sense, but it does: I had a poor Tuya Remote Power plug from AliExpress that was powering off-on this server every 5 or 10 minutes, and I took some time to realize the problem and replace the faulty plug.
6,705 hours are about 280 days.
So, whatever SSD you choose and even before the number of SSDs you'll place in your Servers, make sure to:
* Use SSDs with a high TBW or DWPD
* Use SSDs with cache (RAM), and not a pseudo cache made by the driver.
* Use SSDs with power safe write cache on SLC. (Single Layer Cell). This is a small zone of the SSD reserved for those operations.
* Run away from QLC (Quad Layers Cell) if you do not want do loose your money (and your data).
* Monitor your disks.
Professional SSDs will be 3-5x more expensive, but they will have more constant throughput under high load, with better reliability and predictibiilty:
* Have a much faster / bigger SLC write cache
* Have a faster controller to operate under higher loads
* 2-3x more TBW or DWPD (At least 3 DWPD).
In today's market, and since many people do not master those characteristics, you can find crappy SSDs at the same price as excellent ones. Watch out.