I have two Intel 530's in two seperate nodes that are used as journals for three ceph osd's each (three journal portions on the ssd). The SSD's have been in use for 18 months, for the past two months I've also been using them as a slog/cache device for a ZFS pool.
Partition layout is as follows:
ceph journal 1: 10GB
ceph journal 2: 10GB
ceph journal 3: 10GB
zfs slog : 1GB
zfs cache: 10GB
Free: 79GB
Last night I got a smartctl warning: "Device: /dev/sdb [SAT], Failed SMART usage Attribute: 170 Available_Reservd_Space." on one node.
smartctl -a shows:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 2
9 Power_On_Hours_and_Msec 0x0032 100 100 000 Old_age Always - 3177h+17m+14.350s
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 11
170 Available_Reservd_Space 0x0033 010 010 010 Pre-fail Always FAILING_NOW 0
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 1
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 5
183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always - 8
184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 033 042 000 Old_age Always - 33 (Min/Max 25/42)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 5
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1498365
226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 65535
227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 1
228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 65535
232 Available_Reservd_Space 0x0033 010 010 010 Pre-fail Always FAILING_NOW 0
233 Media_Wearout_Indicator 0x0032 064 064 000 Old_age Always - 0
241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1498365
242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 16144
249 NAND_Writes_1GiB 0x0032 100 100 000 Old_age Always - 195427
The other node has Available_Reservd_Space at 14%
Given the SSD has 79GB out 120GB free I find this weird in the extreme. All the partions are being used direct - no filesystem so I can't run fstrim. What I did do was:
Unfortunately Available_Reservd_Space is unchanged. This doesn't make sense to me - with 65% free space I thought I should have a lot longer than a 18month lifespan out of the two SSD's.
Am I missing something? short of a "Hail Mary" I'll be replacing both SSD's on Monday.
Partition layout is as follows:
ceph journal 1: 10GB
ceph journal 2: 10GB
ceph journal 3: 10GB
zfs slog : 1GB
zfs cache: 10GB
Free: 79GB
Last night I got a smartctl warning: "Device: /dev/sdb [SAT], Failed SMART usage Attribute: 170 Available_Reservd_Space." on one node.
smartctl -a shows:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 2
9 Power_On_Hours_and_Msec 0x0032 100 100 000 Old_age Always - 3177h+17m+14.350s
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 11
170 Available_Reservd_Space 0x0033 010 010 010 Pre-fail Always FAILING_NOW 0
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 1
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 5
183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always - 8
184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 033 042 000 Old_age Always - 33 (Min/Max 25/42)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 5
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1498365
226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 65535
227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 1
228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 65535
232 Available_Reservd_Space 0x0033 010 010 010 Pre-fail Always FAILING_NOW 0
233 Media_Wearout_Indicator 0x0032 064 064 000 Old_age Always - 0
241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1498365
242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 16144
249 NAND_Writes_1GiB 0x0032 100 100 000 Old_age Always - 195427
The other node has Available_Reservd_Space at 14%
Given the SSD has 79GB out 120GB free I find this weird in the extreme. All the partions are being used direct - no filesystem so I can't run fstrim. What I did do was:
- stop ceph
- flush the journals
- blkdiscard on the raw device (/dev/sdb), which erased the partition table as well.
- recreate the partions
- recreate the ceph jopurnals
- run a short and long smartctl test
Unfortunately Available_Reservd_Space is unchanged. This doesn't make sense to me - with 65% free space I thought I should have a lot longer than a 18month lifespan out of the two SSD's.
Am I missing something? short of a "Hail Mary" I'll be replacing both SSD's on Monday.