SSD Samsung 990 Evo Failed Status

simrsta · Jun 19, 2024

Hello everyone
The server uses a Samsung 990 SSD but it shows failed status and this repeats.
when the server is restarted the status returns to passed but over time it fails again, why do you think??

Is there a way to upgrade the Firemware SSD via Proxmox??

Will upgrading the firmware lose the data on the SSD?

Falk R. · Jun 19, 2024

Firmware upgrades are always provided by the manufacturer.
You would have to find out for yourself whether Samsung offers a new firmware for download.
A firmware update never deletes data unless the SSD is defective and can no longer be read.

simrsta · Jun 19, 2024

Falk R. said:
Firmware upgrades are always provided by the manufacturer.
You would have to find out for yourself whether Samsung offers a new firmware for download.
A firmware update never deletes data unless the SSD is defective and can no longer be read.

I mean is there a direct way via Proxmox CLI to upgrade the Samsung SSD?

Or do we have to turn off the server, remove the SSD and upgrade the firmware??

gfngfn256 · Jun 19, 2024

simrsta said:
The server uses a Samsung 990 SSD but it shows failed status and this repeats.

SMART data is to be treated as a warning indicator & not as reliable analysis.
However if you value your data, I would replace both those NVMe's.

It also appears like the PVE host OS runs on those drives with ZFS (mirror of some sort?). Don't know how long they have been running, but it is not surprising that they are starting to fail, given the overheads of the OS itself (PVE is not light) coupled with the ZFS raid.

YOU NEED TO MAKE SURE YOU HAVE BACKUPS OF ALL VMS & LXCS.
SO ALSO MAKE SURE YOU HAVE BACKUPS OF NECESSARY PVE OS FILES/DIRS INCASE YOU NEED TO RECREATE THE PVE INSTANCE.
NOTES ON YOUR SETUP/S ARE GOING TO ALSO BE YOUR FRIENDS!

Good luck.

EDIT: If your PVE is a data-critical environment, you may want to invest in some enterprise-grade drive/s

EDIT2: Re-looking at the provided image, it does not appear that you run the OS on that drive.

Falk R. · Jun 19, 2024

simrsta said:
I mean is there a direct way via Proxmox CLI to upgrade the Samsung SSD?

Or do we have to turn off the server, remove the SSD and upgrade the firmware??

This always depends on the manufacturer's upgrade package.
But even if the manufacturer allows an upgrade during operation, you will need to reboot afterwards for the new firmware to become active.

Kingneutron · Jun 19, 2024

https://www.samsung.com/ca/support/memory-storage/update-the-firmware-of-your-samsung-ssd/

Looks like it's Windows software, so schedule some downtime and remove the disk, put it in a Windows box to upgrade the FW

Ramalama · Jun 19, 2024

simrsta said:
Hello everyone
The server uses a Samsung 990 SSD but it shows failed status and this repeats.
when the server is restarted the status returns to passed but over time it fails again, why do you think??

Is there a way to upgrade the Firemware SSD via Proxmox??

Will upgrading the firmware lose the data on the SSD?

I have the exact same issues with 990 Pro's.
The whole 990 line is simply Crap, i replace them every 2-3 Months.
I have 4 of them for a ZFS Metadata + Small Files Cache, and the 4 that are left, are stable. They passed now 4 Months of runtime without failing.
But to get that far, i had to replace 7 or 8 Times.
So you have your Statistic, out of 12 990 Pro's total, 4 are okay now.

Never happened with 980 or any previous Consumer Samsung SSD's/NVME's.
just the 990 are Crap.

The thing with the restart is, they get a powercycle and start to work again for some time, but in my case maximal a week, till they fail again and need a restart again.
However, thats because they are simply crap.
No other SSD's/NVME's are as unreliable as the 990 Pro/Evo Nvme Drives.
Or basically no other nvme ever failed here and i have a lot of servers, 12 in total, 2 of them are nvme only servers, but with enterprise nvme drives.

Cheers

PS: Most that failed, failed already in early Stadium, around 15TB Written. All failed before 50TB Written.
The ones that are left (4) survived so far 40TB, so i hope that i have now the good ones out of that 12 total.
That are all 2TB 990 Pro's that i use in that particular server.
I use them only in one Server, since i found out that they are the worst crap.
Cheers

Falk R. · Jun 19, 2024

Ramalama said:
I have the exact same issues with 990 Pro's.
The whole 990 line is simply Crap, i replace them every 2-3 Months.
I have 4 of them for a ZFS Metadata + Small Files Cache, and the 4 that are left, are stable. They passed now 4 Months of runtime without failing.
But to get that far, i had to replace 7 or 8 Times.
So you have your Statistic, out of 12 990 Pro's total, 4 are okay now.

Never happened with 980 or any previous Consumer Samsung SSD's/NVME's.
just the 990 are Crap.

The thing with the restart is, they get a powercycle and start to work again for some time, but in my case maximal a week, till they fail again and need a restart again.
However, thats because they are simply crap.
No other SSD's/NVME's are as unreliable as the 990 Pro/Evo Nvme Drives.
Or basically no other nvme ever failed here and i have a lot of servers, 12 in total, 2 of them are nvme only servers, but with enterprise nvme drives.

Cheers

PS: Most that failed, failed already in early Stadium, around 15TB Written. All failed before 50TB Written.
The ones that are left (4) survived so far 40TB, so i hope that i have now the good ones out of that 12 total.
That are all 2TB 990 Pro's that i use in that particular server.
I use them only in one Server, since i found out that they are the worst crap.
Cheers

All consumer SSDs are not suitable for use with ZFS.

Spaneta · Jun 20, 2024

Hello,

First I would explore the firmware question because of the following points (before throwing 990 Pro's to trash ...) :
- because wearout indicator is only 1-2 % and the FAILED status is not permanent probably means the (supposed) failure condition is not permanent
- temperature higher than some preset warning after a (long lasting) high load could fit this condition, especially for non-heatsink version of drives
- this hypothesis is stronger for recent PCIe gen 5 and high-end PCIe gen 4 drives, which the 990 Pro line is from
- you could look with the "Show S.M.A.R.T. values" button for maybe confirming this hypothesis (but optional, please read below)
- Samsung 990 Pro (in your case, but also 980 Pro) is sadly known in 2023 for different problems related to its (older version of) firmware
- including some bad interpretation of SMART data, temperature or wearout by the first firmware versions, leading to numerous "defective" SSD's
- it seems that the latest firmware solved these problems, at least for the SSD that where not nearly dead
- firmware from Samsung is available as ISO that you can boot from (with Ventoy for example, I recommend it if you not already know it)
- a quick search land me to THIS page from Samsung where you can download ISO in the Firmware section (below the annoying Magician software)
- I did a dozen firmware updates in the past with those ISO's on 800 and 900 series SSD, which is way simpler than to move them to a Windows box

With a little touch of humor I would add : "This is not consumer SSD that are not suitable for ZFS, this is ZFS that is less suitable as local storage for PVE nodes" ... also based on my (humble) experience going back and forth between EXT4 and ZFS on consumer SSD's and some (simple) tests like CrystalDiskMark (up to 3x more perfs in write with EXT4 + qcow2 against ZFS ...).

Hope it helps !

Kingneutron · Jun 20, 2024

> All consumer SSDs are not suitable for use with ZFS.

Welllllll.... there's maybe some exceptions to "all" here. At least for homelab.

Can only speak for myself, but the Lexar 1TB NM790 nvme that I've had running 24/7 in my Qotom pve since Feb is still at 0% wear, and reasonably fast with zfs. There are some fairly easy mitigations for ssd wear - turn off cluster services for single nodes, turn off atime everywhere (including inside vm guests), zram, log2ram, and I recently heard about folder2ram. And you can leave extra unused space at the end of the disk.

But still, don't buy stuff like Crucial BX - or QLC crap with low TBW ratings.

You can also relegate vm disks that don't need ssd-speed to spinning disk. All depends on how good your sysadmin skillz are, and what you can afford.

Spaneta · Jun 20, 2024

IMHO this type of mitigation is not of great interest for most "home" labbers (versus "DC" labbers).

By "home" labbers - with quote to emphasize - I mean those that use consumer grade hardware for their lab to learn skill mainly in the software area.
For me a "good portion" of them, for cost reasons and/or also KISS reasons, don't want to tinker with the low-level stack : server HW, SAS drive, hypervisor tuning, ... thus would not want to play with the tools you mentionned.
I guess there are of more interest for those I call "DC" labbers that are more prone to search any better option in the low-level area, going to benchmark different ZFS pool layout for a given load or playing with iSCSI pultipathing to test redudancy.

Maybe my opinion on @simrsta is biased solely by the fact he use a consumer SSD and asked for firmware instructions, but I guess he is more of what I call the "home" labbers.

With kind regards,

Retro1982 · Jun 20, 2024

Hi,

have also two of them in my workstation. Let me guess its the variant with the heat sink.
I had the same problem at the start, SMART reported a FAILURE because of high temp (if reported the temp reported stays high, until a reboot).
It is a firmware problem, after upgrading to the latest firmware they run stable since a few month, so dont panic, they are not dead but need some firmware update.

With kind regards

PS.: Firmware 4B2QJXD7 is the latest for my models.

simrsta · Jun 20, 2024

gfngfn256 said:
SMART data is to be treated as a warning indicator & not as reliable analysis.
However if you value your data, I would replace both those NVMe's.

It also appears like the PVE host OS runs on those drives with ZFS (mirror of some sort?). Don't know how long they have been running, but it is not surprising that they are starting to fail, given the overheads of the OS itself (PVE is not light) coupled with the ZFS raid.

YOU NEED TO MAKE SURE YOU HAVE BACKUPS OF ALL VMS & LXCS.
SO ALSO MAKE SURE YOU HAVE BACKUPS OF NECESSARY PVE OS FILES/DIRS INCASE YOU NEED TO RECREATE THE PVE INSTANCE.
NOTES ON YOUR SETUP/S ARE GOING TO ALSO BE YOUR FRIENDS!

Good luck.

EDIT: If your PVE is a data-critical environment, you may want to invest in some enterprise-grade drive/s

EDIT2: Re-looking at the provided image, it does not appear that you run the OS on that drive.

I just installed the two nvmes in March 2024

I read on the Proxmox forum that many people said there was a firmware bug and we just had to update the firmware, so I asked if there was a way other than having to remove it from the server and then update it on Windows.

yes for OS i install in another driver raid in only 256GB SSD

simrsta · Jun 20, 2024

Retro1982 said:
Hi,

have also two of them in my workstation. Let me guess its the variant with the heat sink.
I had the same problem at the start, SMART reported a FAILURE because of high temp (if reported the temp reported stays high, until a reboot).
It is a firmware problem, after upgrading to the latest firmware they run stable since a few month, so dont panic, they are not dead but need some firmware update.

With kind regards

PS.: Firmware 4B2QJXD7 is the latest for my models.

How do I upgrade the Firemware with Proxmox CLI???
or was it removed and upgraded on Windows with Samsung Magician?

after upgrading firemware is the data safe?

gfngfn256 · Jun 20, 2024

simrsta said:
after upgrading firemware is the data safe?

In general you should unmount the SSD before attempting a FW update.

For Linux follow this guide.

BACKUPS & BACKUPS !!!!!!

Spaneta · Jun 20, 2024

With all respect due to other's experience and recommendation, which seems perfectly doable but overcomplicated compared to ISO + Ventoy method ...

- take a free / empty USB key
- install Ventoy on it -> official SITE
- copy any number of ISO you want to boot in the future : PVE install, Linux install, SSD firmware update when available as ISO like Samsung's one
- boot on said USB key and select the Samsung ISO
- follow on screen menu
- Done !

This on your server where you would have unplugged all storage except the 990 Pro's.

EDIT: like other's pointed out you can never be totally sure so ... BACKUPS !

simrsta · Jun 20, 2024

Spaneta said:
With all respect due to other's experience and recommendation, which seems perfectly doable but overcomplicated compared to ISO + Ventoy method ...

- take a free / empty USB key
- install Ventoy on it -> official SITE
- copy any number of ISO you want to boot in the future : PVE install, Linux install, SSD firmware update when available as ISO like Samsung's one
- boot on said USB key and select the Samsung ISO
- follow on screen menu
- Done !

This on your server where you would have unplugged all storage except the 990 Pro's.

EDIT: like other's pointed out you can never be totally sure so ... BACKUPS !

oke thanks for information
i already backup everyday in NAS Synology

gfngfn256 · Jun 20, 2024

Spaneta said:
boot on said USB key and select the Samsung ISO

AFAIK (never tried) Samsung's ISOs usually don't work. Do a web-search "samsung iso not working"

Ramalama · Jun 20, 2024

Spaneta said:
Hello,

First I would explore the firmware question because of the following points (before throwing 990 Pro's to trash ...) :
- because wearout indicator is only 1-2 % and the FAILED status is not permanent probably means the (supposed) failure condition is not permanent
- temperature higher than some preset warning after a (long lasting) high load could fit this condition, especially for non-heatsink version of drives
- this hypothesis is stronger for recent PCIe gen 5 and high-end PCIe gen 4 drives, which the 990 Pro line is from
- you could look with the "Show S.M.A.R.T. values" button for maybe confirming this hypothesis (but optional, please read below)
- Samsung 990 Pro (in your case, but also 980 Pro) is sadly known in 2023 for different problems related to its (older version of) firmware
- including some bad interpretation of SMART data, temperature or wearout by the first firmware versions, leading to numerous "defective" SSD's
- it seems that the latest firmware solved these problems, at least for the SSD that where not nearly dead
- firmware from Samsung is available as ISO that you can boot from (with Ventoy for example, I recommend it if you not already know it)
- a quick search land me to THIS page from Samsung where you can download ISO in the Firmware section (below the annoying Magician software)
- I did a dozen firmware updates in the past with those ISO's on 800 and 900 series SSD, which is way simpler than to move them to a Windows box

With a little touch of humor I would add : "This is not consumer SSD that are not suitable for ZFS, this is ZFS that is less suitable as local storage for PVE nodes" ... also based on my (humble) experience going back and forth between EXT4 and ZFS on consumer SSD's and some (simple) tests like CrystalDiskMark (up to 3x more perfs in write with EXT4 + qcow2 against ZFS ...).

Hope it helps !

You explain basically exactly the issues here, with 990 Pro's

"explore the firmware question"
-> Newest firmware doesn't change anything, they maybe fail slightly later. I thought updating helps too.

"indicator is only 1-2 %"
-> Correct, here either, they still fail.

"temperature higher than some preset warning"
-> Impossible here, since they sit on a Bifurbication Card with Own Fan and a gigantic heatsink for the SSD's. I even used Cryonaut paste between the chips and heatsink.

And for everything else, 980 Pros never had such issue and are Rock-Stable. 970 Evo Plus are Rock-Stable either.
My Favourites are the 870 EVO "SSD's", they are Performance wise a bit slow, extremely slow tbh, but none of those died ever in out Company. We use them in an Open-E JovianDSS Storage Cluster.
Thats basically a ZFS HA-Storage Solution over iscsi. There are 48x 970 EVO 2TB Drives.

So in all Servers that i have Private or in the Company, i just stumbled across 990 Series and Micron MX500 that are crap.
For the Microns its even worse, because as the MX500 Camed out first, they were TLC based and were undestroyable almost, but they changed mid Production to QLC and the same MX500 Drive got absolutely worst piece of crap.
You can only Distinquish the good from the bad ones by the Firmware Version. The good ones have 0010 FW and the bad ones everything above 0024.

We encountered this, while building an new Server and were wondering why the same Micron MX500 drives die weekly. It took forever to find out the reason, but after dissasembling an old drive and new drive, it was absolutely Clear!

Here is a picture, on the Left side are the old MX500 with 0010 FW and on the Right is a new MX500 with 0024 FW.

The worst part is, that you don't know that before, you expect that if you have good experience with a Product, and you buy the same Product, that you get what you expect. But in this case, micron sells those new MX500 SSD's as the same model, and they are pure crap.

Cheers

PS: Both are Genuine and both are MX500 2TB. The left ones didnt died ever, the right ones die after max 2 weeks in our company.
But in the meantime we exchanged all 48 of those MX500 with Samsung 870 EVO's, simply because you can't get anymore the old MX500. So we cannot replace them in the long term, so they had to be exchanged.

gfngfn256 · Jun 20, 2024

Ramalama said:
Here is a picture

A picture is worth a thousand words!

The best IT advice is always experience. Thanks.

SSD Samsung 990 Evo Failed Status

Member

Attachments

Distinguished Member

Member

Renowned Member

Distinguished Member

Active Member

Well-Known Member

Distinguished Member

Member

Active Member

Member

New Member

Member

Member

Renowned Member

Member

Member

Renowned Member

Well-Known Member

Renowned Member