MX500 SSD Smart Errors

Valhera

New Member
Jun 26, 2016
22
0
1
45
Hi Guys,

I have a bunch of Proxmox servers all running new MX500 SSD's using ZFS, all is going well and performance is amazing.

With this said though I keep getting smart errors like below for every drive. I am not overly worried as ZFS does it's own check's and it's not seeing any errors with the drives but it would be good to know what can be done to fix these alerts.

This message was generated by the smartd daemon running on:

host name: removed
DNS domain: removed.com.au

The following warning/error was logged by the smartd daemon:

Device: /dev/sdg [SAT], 1 Currently unreadable (pending) sectors

Device info:
CT1000MX500SSD1, S/N:1846E1D768FD, WWN:5-00a075-1e1d768fd, FW:M3CR023, 1.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another message will be sent in 24 hours if the problem persists.
 

LnxBil

Famous Member
Feb 21, 2015
4,439
442
103
Germany
Please also check the wearout. Those cheap, non-enterprise-grade SSD tend to wearout very fast.
 

Valhera

New Member
Jun 26, 2016
22
0
1
45
Actually thats not the case for this model of SSD, we have hundred's installed and the wear out rate is not very fast as you have mentioned, we have been using this model for a number of years now specifically for this reason. In relation to these drives in question they are also brand new so there is no wear out rate at all.
 

tim

Proxmox Staff Member
Staff member
Oct 1, 2018
184
18
18
If you want to dig into this, there is a wiki article about it:
https://www.smartmontools.org/wiki/BadBlockHowto

In general I wouldn't be much worried about it, if you are running hundreds of these and only one reported one single unreadable sector.
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
2,298
240
63
AFAIK the MX500 has a firmware problem where the Pending Sectors go to 1 and then back to 0 - There was a firmware update by crucial, but they withdrew it again (probably because it caused other problems), however it seems as long as the Pending Sectors go down, without the reallocated sectors going up, it's nothing to worry about too much
See e.g.:
https://forums.unraid.net/topic/79358-keep-getting-current-pending-sector-is-1-warnings-solved/
https://www.ixsystems.com/community/threads/what-does-critical-error-currently-unreadable-pending-sectors-mean.64309/

(crucial's forums also have 1-2 threads, but the site seems down for me currently)

Hope this helps!
 
  • Like
Reactions: PlOrAdmin

dr0

New Member
Aug 29, 2019
4
0
1
Ukraine
A few weeks ago I bought a 1TB MX500 for my personal system [i7-3770/32GB RAM/H77/GTX1050Ti/Win 7 x64 SP1] to replace a combo of a smaller SSD and an HDD. My MX500 also has this weird behavior when occasionally the drive reports attribute #197 (Current Pending Sector Count) to go from 0 to 1, but in a few minutes, it changes back to 0. Reallocation Event Count (#196) never changes and always stays at 0.



I have never seen anything like this with any SSD that I had a chance to use. My other SSDs have 10s of thousands of hours of active use and 10s of TB of lifetime writes but not a single one of them have pending sector(s) in the log files, including my trusty Crucial M4 that I've been using since 2011, whereas my MX500 has only 0.68TB lifetime writes and only 91 hours of active time but already had 3 events of pending sector. I wonder what causes this? And if it's indeed just a firmware bug then why Crucial hasn't fixed this to this day, after more than 18 month of MX500 being on the market?
 

Jarvar

Member
Aug 27, 2019
40
1
8
Just noticed after putting an MX500 into use roughly 2 weeks ago, maybe a few extra days and the drive shows 94% life left. This is running Windows Server 2019 Standard. Is this normal?
It's also got an ADATA XPG SX8200 PRO for the boot drive and they are setup with a local-zfs structure for VM storage. The ADATA shows 1% lifetime used, but then again it has a 600 TBW write endurance compared to around 360 TBW for the Crucial. The TBW write shows 7.47 TB for Adata and around 1 TBW less for the Crucial at 6.48 TBW. Are these numbers normal? or is something eating away at all the writes?
I have been doing lots of backups, cloning VMs and testing, it's been running the Windows Server 24/7 since.
 

dr0

New Member
Aug 29, 2019
4
0
1
Ukraine
What is your Wear leveling (#173) decimate value? How much free space was on the drive while it was intensively written to?
 

Jarvar

Member
Aug 27, 2019
40
1
8
The value for #173 is 51. I'm not sure how much free space was on the disk, but right now it is at 78% full as is shown using the Proxmox GUI under summary. I wouldn't say that it's that intensive, but just that it's more intensive than normal for me. I created VM,s deleted VMs, cloned them, backed them up and relocated disks from the local storage to the disk and vice versa.
I know these are not enterprise level drives, but I do find it odd.
I have the old server that was Enterprise level that I took out of use and running a home lab and their at 99% life left despite being run since May 2018.
Here is a snapshot of the output from smartctl -a using the shell.
 

Attachments

dr0

New Member
Aug 29, 2019
4
0
1
Ukraine
maybe a few extra days and the drive shows 94% life left
Judging by the attachment you added it's at 97% now.

I wouldn't worry too much about these numbers anyway. At least on previous firmware version (M3CR022) the drive would reset this value to 0% after reaching 255%.

202.png

In torture test it sustained more than 6000 P/E cycles. And yours is only at 51 now.

173.png

First errors started to occur after 5300 P/E cycles. After 6400 P/E cycles the drive lost the ability to write new data to it.

196.png

PS: the more free space you will have on your drive the more data you'll be able to write per each P/E cycle.
 

Jarvar

Member
Aug 27, 2019
40
1
8
Judging by the attachment you added it's at 97% now.

I wouldn't worry too much about these numbers anyway. At least on previous firmware version (M3CR022) the drive would reset this value to 0% after reaching 255%.

PS: the more free space you will have on your drive the more data you'll be able to write per each P/E cycle.
Thanks, Yeah I mixed it up. The 94% left is on my home personal computer which I have used to rip dvds and transcode them extensively into MKV format. it has 1215 Power On hours, smart-246 shows 17941354390 and 173 shows 103 erases in comparison.

The MX500 inside the proxmox server has only 466 Power on hours in comparison. The ADAT XPG SX8200 has 851 hours in comparision and 7.6TB written now.
I hope it tapers off, but I'll continue to monitor it. I am just trying to figure out what is causing these high write numbers.

I have another homelab setup using the old office server hardware which utilizes S4500 and S4600 Intel Drives from Dell. I don't understand how those write numbers are so low, but yet again they are Enterprise Level drives with 12, 615 Power On Hours, but writes LBA Writes on Smart 241 are very low.
 

Jarvar

Member
Aug 27, 2019
40
1
8
Judging by the attachment you added it's at 97% now.

I wouldn't worry too much about these numbers anyway. At least on previous firmware version (M3CR022) the drive would reset this value to 0% after reaching 255%.

PS: the more free space you will have on your drive the more data you'll be able to write per each P/E cycle.
I checked, I have the latest firmware M3CR023 on all of my MX500s.

Another thing, I deleted some extra VMs that I wasn't using to keep the free space at a better level, but I heard of some people over provisioning by partitioning a smaller amount of the drive for use when installing. Do you think it makes a difference? or just not using up the whole drive should be okay?

I don't think I could let my drives reach 255%, I would have to change the drives out way before that. This is in a small dental office at the moment so they need the unit to be operational.
 

dr0

New Member
Aug 29, 2019
4
0
1
Ukraine
Different amount of written data per each P/E cycle can be due to write amplification which can significantly differ between various SSD brands / models.

As long as TRIM is enabled and working correctly your drive should see free space on it in the same way as if it was unallocated by the file system.

Nothing bad will happen after reaching 255%. This number does not represent real drive health and is used by the manufacturer for warranty purposes. Just do your backups and pay attention to #1, #180, #196.
 

Jarvar

Member
Aug 27, 2019
40
1
8
Different amount of written data per each P/E cycle can be due to write amplification which can significantly differ between various SSD brands / models.

As long as TRIM is enabled and working correctly your drive should see free space on it in the same way as if it was unallocated by the file system.

Nothing bad will happen after reaching 255%. This number does not represent real drive health and is used by the manufacturer for warranty purposes. Just do your backups and pay attention to #1, #180, #196.
I don't know if it makes a difference but I setup an NVME and 2.5" SATA SSD into ZFS Raid 1 mirror on an Intel NUC7i7DNHE. ZFS does not have TRIM support. I know usually raid is setup with devices that are atleast matching in type. but there wasn't anymore expansion capacity.
The Asrock A300 looks interesting, it has dual NVME and Dual SATA ports, but lacks the Out Of Bands Management function which is what I would prefer.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!