Intel 530 SSD - problem with reclaiming space

blackpaw · Feb 20, 2016

I have two Intel 530's in two seperate nodes that are used as journals for three ceph osd's each (three journal portions on the ssd). The SSD's have been in use for 18 months, for the past two months I've also been using them as a slog/cache device for a ZFS pool.

Partition layout is as follows:
ceph journal 1: 10GB
ceph journal 2: 10GB
ceph journal 3: 10GB
zfs slog : 1GB
zfs cache: 10GB
Free: 79GB

Last night I got a smartctl warning: "Device: /dev/sdb [SAT], Failed SMART usage Attribute: 170 Available_Reservd_Space." on one node.

smartctl -a shows:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 2
9 Power_On_Hours_and_Msec 0x0032 100 100 000 Old_age Always - 3177h+17m+14.350s
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 11
170 Available_Reservd_Space 0x0033 010 010 010 Pre-fail Always FAILING_NOW 0
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 1
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 5
183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always - 8
184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 033 042 000 Old_age Always - 33 (Min/Max 25/42)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 5
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1498365
226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 65535
227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 1
228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 65535
232 Available_Reservd_Space 0x0033 010 010 010 Pre-fail Always FAILING_NOW 0
233 Media_Wearout_Indicator 0x0032 064 064 000 Old_age Always - 0
241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1498365
242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 16144
249 NAND_Writes_1GiB 0x0032 100 100 000 Old_age Always - 195427

The other node has Available_Reservd_Space at 14%

Given the SSD has 79GB out 120GB free I find this weird in the extreme. All the partions are being used direct - no filesystem so I can't run fstrim. What I did do was:

stop ceph
flush the journals
blkdiscard on the raw device (/dev/sdb), which erased the partition table as well.
recreate the partions
recreate the ceph jopurnals
run a short and long smartctl test

Unfortunately Available_Reservd_Space is unchanged. This doesn't make sense to me - with 65% free space I thought I should have a lot longer than a 18month lifespan out of the two SSD's.

Am I missing something? short of a "Hail Mary" I'll be replacing both SSD's on Monday.

LnxBil · Feb 20, 2016

ZFS does not support trim yet, so it'll destroy your SSD eventually.

241 shows, that you wrote A LOT to the disk (100 times more than read):

32 MB * 1498365 Blocks / 1024 / 1024 = 45,726470947 TB

Maybe you're at the end of life for the 530er. How many DWPD are supported? You're at a little less than 1 DWPD if my math is correct.

blackpaw · Feb 20, 2016

Thanks LnxBill

I removed it from the ZFS pool, erased all the aprtions and ran blkdiscard on it, so that should have released any free blocks.

45 TB is a lot, through pretty normal for a ceph journal and its my understanding these devices in get to the petabyte range before failing in reality.

The official TBW is 34TB.

I don't understand how the Available_Reservd_Space can be so low while every other indicator si fine - no write errors etc.

Neverless, i guess I better get new drives ...

Search

Search

Intel 530 SSD - problem with reclaiming space

blackpaw

Renowned Member

LnxBil

Distinguished Member

blackpaw

Renowned Member

We value your privacy