Crucial 4TB SSD has 1 bad sector

charleslcso

Member
Oct 1, 2022
45
1
13
My not too old Crucial 4TB SSD running Proxmox Backup Server (Debian) regularly returns email telling me there is 1 bad sector pending.

I'm not sure if it is causing the ext4 fs to become read-only, and hard crash the server, unpredictably.

Below is the output from smartctl.

Should I get this SSD replaced? It is under 5-year warranty.



root@pbs:~# smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.158-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: CT4000MX500SSD1
Serial Number: 2317E6CE78FA
LU WWN Device Id: 5 00a075 1e6ce78fa
Firmware Version: M3CR046
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Nov 12 19:46:59 2024 HKT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 8134
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 37
171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
173 Unknown_Attribute 0x0032 086 086 000 Old_age Always - 190
174 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 12
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033 000 000 000 Pre-fail Always - 174
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 060 026 000 Old_age Always - 40 (Min/Max 26/74)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Unknown_SSD_Attribute 0x0030 086 086 001 Old_age Offline - 14
206 Unknown_SSD_Attribute 0x000e 100 100 000 Old_age Always - 0
210 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
246 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 1027706144165
247 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 8276981584
248 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 2312853640

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Completed [00% left] (0-65535)
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
No, bad sector pending is handled by drive firmware and you don't need new disk yet.
You should run "smartctl -t short /dev/sda" weekly and checkout values on ssd/nvme with "-x" to see wearout also (Xtrem output but for hdd with "-a" is ok).
 
No, bad sector pending is handled by drive firmware and you don't need new disk yet.
You should run "smartctl -t short /dev/sda" weekly and checkout values on ssd/nvme with "-x" to see wearout also (Xtrem output but for hdd with "-a" is ok).
But it is causing hard crashes... i'll try "dmesg" when it happens, or when fs becomes read-only.
 
  • Like
Reactions: waltar
FS is becoming RO again...any idea on why this happens regularly? This system has been up and running without problems for more than 1 year. How can I troubleshoot this?

It is either the bad block causing this, or something else entirely different. Some idea and help needed.

Code:
root@pbs:/usr/local/# journalctl -xe

Nov 21 07:25:33 pbs postfix/postdrop[413749]: warning: mail_queue_enter: create file maildrop/254905.413749: Read-only file system
Nov 21 07:25:34 pbs cron[413659]: postdrop: warning: mail_queue_enter: create file maildrop/318735.413659: Read-only file system
Nov 21 07:25:34 pbs postfix/postdrop[413659]: warning: mail_queue_enter: create file maildrop/318735.413659: Read-only file system
Nov 21 07:25:36 pbs proxmox-backup-[815]: pbs proxmox-backup-api[815]: authentication failure; rhost=[::ffff:192.168.1.252]:44630 user=root@pam msg=open "/etc/proxmox-backup/tfa.json.lock" failed - E>
Nov 21 07:25:39 pbs proxmox-backup-[874]: pbs proxmox-backup-proxy[874]: rrd_sync_journal failed - EROFS: Read-only file system
Nov 21 07:25:39 pbs proxmox-backup-[815]: pbs proxmox-backup-api[815]: POST /api2/json/access/ticket: 401 Unauthorized: [client [::ffff:192.168.1.252]:44630] permission check failed.
Nov 21 07:25:43 pbs cron[413749]: postdrop: warning: mail_queue_enter: create file maildrop/255156.413749: Read-only file system
Nov 21 07:25:43 pbs postfix/postdrop[413749]: warning: mail_queue_enter: create file maildrop/255156.413749: Read-only file system
Nov 21 07:25:44 pbs cron[413659]: postdrop: warning: mail_queue_enter: create file maildrop/318906.413659: Read-only file system
Nov 21 07:25:44 pbs postfix/postdrop[413659]: warning: mail_queue_enter: create file maildrop/318906.413659: Read-only file system
Nov 21 07:25:46 pbs proxmox-backup-[815]: pbs proxmox-backup-api[815]: authentication failure; rhost=[::ffff:192.168.1.252]:37664 user=root@pam msg=open "/etc/proxmox-backup/tfa.json.lock" failed - E>
Nov 21 07:25:49 pbs proxmox-backup-[874]: pbs proxmox-backup-proxy[874]: rrd_sync_journal failed - EROFS: Read-only file system
Nov 21 07:25:49 pbs proxmox-backup-[815]: pbs proxmox-backup-api[815]: POST /api2/json/access/ticket: 401 Unauthorized: [client [::ffff:192.168.1.252]:37664] permission check failed.
Nov 21 07:25:53 pbs cron[413749]: postdrop: warning: mail_queue_enter: create file maildrop/255376.413749: Read-only file system
Nov 21 07:25:53 pbs postfix/postdrop[413749]: warning: mail_queue_enter: create file maildrop/255376.413749: Read-only file system
Nov 21 07:25:54 pbs cron[413659]: postdrop: warning: mail_queue_enter: create file maildrop/319081.413659: Read-only file system
Nov 21 07:25:54 pbs postfix/postdrop[413659]: warning: mail_queue_enter: create file maildrop/319081.413659: Read-only file system
Nov 21 07:25:54 pbs rsyslogd[729]: action 'action-9-builtin:omfile' suspended (module 'builtin:omfile'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2102.0>
Nov 21 07:25:54 pbs rsyslogd[729]: action 'action-9-builtin:omfile' resumed (module 'builtin:omfile') [v8.2102.0 try https://www.rsyslog.com/e/2359 ]
Nov 21 07:25:54 pbs rsyslogd[729]: action 'action-9-builtin:omfile' suspended (module 'builtin:omfile'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2102.0>
Nov 21 07:25:54 pbs rsyslogd[729]: action 'action-9-builtin:omfile' suspended (module 'builtin:omfile'), next retry is Thu Nov 21 07:26:24 2024, retry nbr 0. There should be messages before this one >
Nov 21 07:25:57 pbs proxmox-backup-[815]: pbs proxmox-backup-api[815]: authentication failure; rhost=[::ffff:192.168.1.252]:46540 user=root@pam msg=open "/etc/proxmox-backup/tfa.json.lock" failed - E>
Nov 21 07:25:59 pbs proxmox-backup-[874]: pbs proxmox-backup-proxy[874]: rrd_sync_journal failed - EROFS: Read-only file system
Nov 21 07:26:00 pbs proxmox-backup-proxy[874]: lookup_datastore failed - open "/etc/proxmox-backup/.datastore.lck" failed - EROFS: Read-only file system
Nov 21 07:26:00 pbs proxmox-backup-[815]: pbs proxmox-backup-api[815]: POST /api2/json/access/ticket: 401 Unauthorized: [client [::ffff:192.168.1.252]:46540] permission check failed.
 
Last edited:
Hi

Did you run update-smart-drivedb and check the smart values at a regular basis?

The following does not read that good and may indicate a coming drive failure…

180 Unused_Rsvd_Blk_Cnt_Tot 0x0033 000 000 000 Pre-fail Always - 174
 
Hi

Did you run update-smart-drivedb and check the smart values at a regular basis?

The following does not read that good and may indicate a coming drive failure…
Just did that. Still at 7.2:

Code:
root@pbs:~# update-smart-drivedb
/var/lib/smartmontools/drivedb/drivedb.h updated from branches/RELEASE_7_2_DRIVEDB
root@pbs:~# smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.158-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron Client SSDs
Device Model:     CT4000MX500SSD1
Serial Number:    2317E6CE78FA
LU WWN Device Id: 5 00a075 1e6ce78fa
Firmware Version: M3CR046
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Nov 24 09:39:06 2024 HKT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (    0) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  30) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x0031)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       8414
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       38
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   084   084   000    Old_age   Always       -       214
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       13
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       174
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   034   026   000    Old_age   Always       -       66 (Min/Max 26/74)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   084   084   001    Old_age   Offline      -       16
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       1188454344591
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       9564987072
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       2591091654

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Completed [00% left] (0-65535)
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@pbs:~#
 
The smart info seems ok, but I would still suspect the ssd. What would be interesting are the log lines before the ssd went read only … maybe there is some hints