smartd false positive SSD CurrentPendingSector?

Discussion in 'Proxmox VE: Installation and configuration' started by jermudgeon, Jun 15, 2018.

  1. jermudgeon

    jermudgeon New Member
    Proxmox Subscriber

    Joined:
    Apr 7, 2016
    Messages:
    8
    Likes Received:
    0
    Since April 30, I've gotten 7 warning emails from one host, 5 from another, and 2 from another;

    each email claims there is 1 CurrentPendingSector failed; that is, currently unreadable (pending).

    On each host, it's the same type of drive, a CT1000MX500SSD1 (Crucial 1TB)

    Running smartctl manually shows *no* sectors failed.

    systemctl status shows info like this:
    Jun 13 21:29:19 pm2 smartd[1196]: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors
    Jun 14 00:29:18 pm2 smartd[1196]: Device: /dev/sda [SAT], No more Currently unreadable (pending) sectors, warning condition reset after 1 email

    Does this imply that the drives are actually kicking up errors but fixing them? (I thought SSDs did that silently, until wearout, with a different SMART attribute indicating percentage.)

    It doesn't appear to be service impacting, but I'm not finding good info via Google.

    Has anybody else seen this, or is this an obvious question for Crucial?
     
  2. Andrew Hart

    Andrew Hart Member

    Joined:
    Dec 1, 2017
    Messages:
    67
    Likes Received:
    9
    At least with the newest INTEL ssd drives Pending Sector no longer means the same thing. It is used to indicate that a block will be re-mapped soon, (as far as I can tell.)
    On hdd it always meant that a sector could not be read and the drive is hoping that you write to it so that it can be re-mapped. (Also, as far as I know.)
     
  3. jermudgeon

    jermudgeon New Member
    Proxmox Subscriber

    Joined:
    Apr 7, 2016
    Messages:
    8
    Likes Received:
    0
    Thanks, Andrew. That makes sense.
     
  4. 123paul

    123paul New Member

    Joined:
    Aug 31, 2018
    Messages:
    4
    Likes Received:
    0
    Did you manage to find a solution for this? As I started to get these messages on my homelab since today.

    I did find this specification from Micron for all there SMART variables btw, might be useful for someone who is having the same problems and start worrying about wearout on their disks.
     
  5. jermudgeon

    jermudgeon New Member
    Proxmox Subscriber

    Joined:
    Apr 7, 2016
    Messages:
    8
    Likes Received:
    0
    No change, I still get these occasionally, and then the count resets to zero.
     
  6. Andrew Hart

    Andrew Hart Member

    Joined:
    Dec 1, 2017
    Messages:
    67
    Likes Received:
    9
    If it is the same problem, you'll find the pending sectors will increase maybe up to 17 and then reset to 0 and remapped will increase by just 1.

    17 is the highest I've seen I think. So keep an eye on it and check that your ssd has new firmware.

    If you think that there are pending sectors you can read the disk "dd if=/dev/sda of=/dev/null bs=1M". It will crash your system if there are real pending sectors probably.
     
  7. stormtronix

    stormtronix New Member

    Joined:
    Jul 23, 2014
    Messages:
    3
    Likes Received:
    0
    We have the same Issue with the same SSDs CT1000MX500SSD1 (Crucial 1TB).
    I also noticed that smart does not know many attributes from the ssds:

    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
    5 Reallocated_Sector_Ct 0x0032 100 100 010 Old_age Always - 0
    9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 561
    12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 7
    171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
    172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
    173 Unknown_Attribute 0x0032 098 098 000 Old_age Always - 34
    174 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 6
    180 Unused_Rsvd_Blk_Cnt_Tot 0x0033 000 000 000 Pre-fail Always - 42
    183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
    184 End-to-End_Error 0x0032 100 100 000 Old_age Always - 0
    187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
    194 Temperature_Celsius 0x0022 074 055 000 Old_age Always - 26 (Min/Max 0/45)
    196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
    197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
    198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
    199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
    202 Unknown_SSD_Attribute 0x0030 098 098 001 Old_age Offline - 2
    206 Unknown_SSD_Attribute 0x000e 100 100 000 Old_age Always - 0
    210 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
    246 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 4555280040
    247 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 74855687
    248 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 991046634

    I did not find anything about what these unknown attributes could be - any idea?
     
  8. 123paul

    123paul New Member

    Joined:
    Aug 31, 2018
    Messages:
    4
    Likes Received:
    0
    I see I forgot to link the document I found in my previous reply. Seems I can't paste external links as my account is too new.

    Just Google this "tnfd22_client_ssd_smart_attributes.pdf"

    I'm still having this issue, seems to be a Crucial specific issue.
     
  9. 123paul

    123paul New Member

    Joined:
    Aug 31, 2018
    Messages:
    4
    Likes Received:
    0
    Found out there has been released a firmware (M3CR022) update in june to fix this issue. I will try this and report back in a couple of days to inform if this fixed it.
     
  10. stormtronix

    stormtronix New Member

    Joined:
    Jul 23, 2014
    Messages:
    3
    Likes Received:
    0
    Hi Paul,
    any success with the new firmware?
     
  11. 123paul

    123paul New Member

    Joined:
    Aug 31, 2018
    Messages:
    4
    Likes Received:
    0
    I wasn't able to update the firmware yet as I didn't manage to boot from a USB drive with the firmware release from crucial. I would need to attach it to a windows machine to try the firmware update there. But only have macs around for now.

    So if someone else manages to test it sooner I would be curious to know the outcome.
     
  12. Paspao

    Paspao Member

    Joined:
    Aug 1, 2017
    Messages:
    32
    Likes Received:
    1
    I am getting same error since a couple of days and seems to auto fix after an hour.

    Apr 23 14:27:14 proxmox smartd[1495]: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors
    ...
    Apr 23 15:27:14 proxmox smartd[1495]: Device: /dev/sda [SAT], No more Currently unreadable (pending) sectors, warning condition reset after 1 email

    My drive is an Intel SSD DC S3520 1.2TB with latest firmware (N2010121).

    Is one drive part of 10 Ceph OSDs.

    I will keep it monitored.
     
  13. Paspao

    Paspao Member

    Joined:
    Aug 1, 2017
    Messages:
    32
    Likes Received:
    1
    Hello,

    I am still getting the same messages only for one of my Ceph OSDs.

    I run smartd tests that are successful:
    SMART Self-test log structure revision number 1

    Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
    # 1 Extended offline Completed without error 00% 1707 -
    # 2 Short offline Completed without error 00% 1705 -

    And I see 1 Reallocated_Sector_Ct:

    Code:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       1
      9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       1793
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       5
    170 Unknown_Attribute       0x0033   099   099   010    Pre-fail  Always       -       0
    171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
    172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
    174 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       3
    175 Program_Fail_Count_Chip 0x0033   100   100   010    Pre-fail  Always       -       69088392870
    183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
    184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
    187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
    190 Airflow_Temperature_Cel 0x0022   073   063   000    Old_age   Always       -       27 (Min/Max 16/37)
    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       3
    194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       27
    197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
    199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
    225 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       343954
    226 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       327
    227 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       19
    228 Power-off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       107596
    232 Available_Reservd_Space 0x0033   099   099   010    Pre-fail  Always       -       0
    233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       0
    234 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
    241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       343954
    242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       85337
    243 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       698199

    As this is a new SSD do I have to worry, ask for replacement or do you suggest to run other tests on it?

    Thank you.
    P.
     
  14. jjd

    jjd New Member

    Joined:
    May 17, 2019
    Messages:
    1
    Likes Received:
    0
    Hi,

    The drive database is out of date.

    Update it by grabbing the latest db file from here:-
    www-smartmontools-org/export/4914/trunk/smartmontools/drivedb.h
    replace dashes with dots. Stupid site wont let me post a url.

    Stick the file in here:-
    /var/lib/smartmontools/drivedb/drivedb.h

    And restart your smartd service on proxmox vm or the proxmox server.

    systemctl restart smartd.service

    /var/lib/smartmontools/smartd.CT1000MX500SSD1-1912E1F3465D.ata.state

    for example on my box: smartctl -P show /dev/sdb
    shows:-

    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-12-pve] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke

    Drive found in smartmontools Database. Drive identity strings:
    MODEL: CT1000MX500SSD1
    FIRMWARE: M3CR023
    match smartmontools Drive Database entry:
    MODEL REGEXP: Crucial_CT(128|256|512)MX100SSD1|Crucial_CT(200|250|256|500|512|1000|1024)MX200SSD[1346]|Crucial_CT(275|525|750|1050|2050)MX300SSD[14]|Crucial_CT(120|240|480|960)M500SSD[134]|Crucial_CT(128|256|512|1024)M550SSD[134]|CT(120|240|480)BX300SSD1|CT(120|240|480|960)BX500SSD1|CT(250|500|1000|2000)MX500SSD[14]|Micron_M500_MTFDDA[KTV](120|240|480|960)MAV|Micron_M500DC_(EE|MT)FDDA[AK](120|240|480|800)MBB|(Micron[_ ])?M500IT[_ ]MTFDDA[KTY](032|050|060|064|120|128|240|256)[MS]BD|(Micron_)?M510[_-]MTFDDA[KTV](128|256)MAZ|MICRON_M510DC_(EE|MT)FDDAK(120|240|480|800|960)MBP|(Micron_)?M550[_-]MTFDDA[KTV](064|128|256|512|1T0)MAY|Micron_M600_(EE|MT)FDDA[KTV](128|256|512|1T0)MBF[25Z]?|(Micron_1100_)?MTFDDA[KV](256|512|1T0|2T0)TBN|Micron 1100 SATA (256G|512G|1T|2T)B
    FIRMWARE REGEXP: .*
    MODEL FAMILY: Crucial/Micron BX/MX1/2/3/500, M5/600, 1100 SSDs
    ATTRIBUTE OPTIONS: 005 Reallocate_NAND_Blk_Cnt
    170 Reserved_Block_Count
    171 Program_Fail_Count
    172 Erase_Fail_Count
    173 Ave_Block-Erase_Count
    174 Unexpect_Power_Loss_Ct
    180 Unused_Reserve_NAND_Blk
    183 SATA_Interfac_Downshift
    184 Error_Correction_Count
    195 Cumulativ_Corrected_ECC
    202 Percent_Lifetime_Remain
    206 Write_Error_Rate
    210 Success_RAIN_Recov_Cnt
    246 Total_Host_Sector_Write
    247 Host_Program_Page_Count
    248 FTL_Program_Page_Count



    I am still assessing whether this will stop the error.
    Am waiting in anticipation.
    But at least with the drive recognized you stand a better chance.

    According to pdf ( google tnfd22_client_ssd_smart_attributes.pdf ) I have found 202 is the variable that holds the value of wear.

    In my case.

    202 Percent_Lifetime_Remain 0x0030 100 100 001

    So 1st column is actual value so 100% remaining. When it gets to 001 then its a fail and the device will go read only.

    "This value gives the threshold inverted value of the raw data value below. That is, if 30% of the lifetime has been used, this value will report 70%. A value of 0% indicates that 100% of the expected lifetime has been used."

    Regards
    Joe.
     
  15. Paspao

    Paspao Member

    Joined:
    Aug 1, 2017
    Messages:
    32
    Likes Received:
    1
    Hello,

    thank you, I updated the drive database and I have not received alerts since days.

    So those alert are to consider false positives?

    Thanks,
    P.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice