Hi,
I self-host Proxmox on a dedicated server, running on 2 SSDs in ZFS mirror + 2 hard drive with independant pool.
My SMART results on the SSDs start to worry me a bit, and I'm thinking about ditching ZFS.
Some recap:
Regards,
I self-host Proxmox on a dedicated server, running on 2 SSDs in ZFS mirror + 2 hard drive with independant pool.
My SMART results on the SSDs start to worry me a bit, and I'm thinking about ditching ZFS.
Code:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 6783
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 49
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 253 253 000 Old_age Always - 1549
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 20
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 48
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 065 038 000 Old_age Always - 35 (Min/Max 0/62)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 253 253 001 Old_age Offline - 103
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 93247333137
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 2210576455
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 26861405294
Code:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 6187
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 52
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 254 254 000 Old_age Always - 1544
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 23
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 47
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 064 037 000 Old_age Always - 36 (Min/Max 0/63)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 254 254 001 Old_age Offline - 102
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 93247421976
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 2218162594
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 2670722248
Some recap:
- It seems to me the "Power_On_Hours" is incorrect. This server has been running non-stop since March 2020.
- 6783/24 --> 282.6 days for the first SSD
- 14471/24 --602,9 days for the first hard drive
- The SSDs have been installed at the same time.
- "Total_LBAs_Written" value is huge.
- According to https://www.virten.net/2016/12/ssd-total-bytes-written-calculator/
- 93247421976 --> 43.42 TB
- If I take the "Power_On_Hours" of my hard drive, it gives 73,75Go written per day.
- Wearout Value in Datacenter > Node > Disk
- SSD1: -153%
- SSD2: -154%
- Absolutly 0 error on ZFS health check / status
- Are SMART values revelant ? As "Power_On_Hours" are likely incorrect, I'm wondering if "Total_LBAs_Written" are true or not.
- Does wearout values mean anything ?
- If I have to re-do server configuration, should I go for ZFS or not ?
- If I have to re-do my server, would it be okay to use the same SSDs ?
- What are the real advantages of ZFS in a homelab set up ?
Regards,