Proxmox destroyed SSD?

Superspyi

New Member
Feb 21, 2023
3
0
1
Hey guys. I posted this on proxmox subreddit and someone mentioned I should post here as well. I setup my Proxmox server a few months ago around December of 2022 and I've started running into issues with not being able to access the GUI, or my vms, containers, and host all being unreachable. I looked into the logs and found IO errors which then lead me to my 1TB Samsung 860 EVO 2.5" SSD. Looking at it I see it is at 99% wear and that makes sense for the types of errors I was receiving, but still unsure why it would be that high. I pulled this from 3 other 1TB SSDs from my main PC to use and they each have about 34k power on hours and 5.5TB, 7.8TB, and almost 16TB written to each. I checked the smart data on this proxmox drive and it has almost 1.95PB of data written to it according to the "241 Total_LBAs_Written" section. I'm unsure of how this could have happened as it was only running for about 3 months. Any thoughts? Here's what we've looked at so far through reddit:

  1. Not an enterprise drive - although it's only supported on enterprise drives I do still feel like there's something else happening as 1.95PB of data is insanity.
  2. I was running the pve-ha-lrm & pve-ha-crm services even though I don't use HA or Clustering. - Disabled these as others said they had high disk wear while using these but the damage is already done
  3. I am not running ZFS as far as I'm aware - Running zpool iostat gives me a "no pools available" message. My main volume for my VMs I believe was LVM
  4. I am not running a RAID controller - Just using a single SSD.
  5. I believe I setup proxmox as the "ext4" filesystem when installing.
  6. I setup Proxmox with the 7.3-3 version but have since upgraded to 7.3-6.
  7. I have not setup any sort of metrics or monitoring systems but will to see if hopefully it can show anything through historical data
  8. I only use 6-7 VMs along with 2-3 containers. The VMs are mostly Ubuntu Linux VMs with one Kali Linux and 1 Windows 11 VM. They do all write to the same drive as I was only using 1x 1TB SSD (I know bad practice but this was mostly homelab type stuff) but anything that did write data that I needed to keep (mostly linux vms) was sent to my Synology NAS.
  9. I am running docker in one of my linux servers and it is running Portainer, Prowlarr, Overseerr, Radarr, Sonarr, Qbittorrent, Plex, Nginx Reverse Proxy, Bazarr, HomeBridge, and a few other containers for testing.
  10. While all my VMs and Containers were running I did not notice anything hitting swap but that doesn't mean it wasn't.
I can provide logs but right now I am limited on what I can provide right now as the machine is off as much as possible while waiting for the new SSDs to come in. The system still functions but it is giving me the IO errors more frequently than before. I am also planning on using 2x 1TB Samsung 980 (non pro due to firmware issues) NVMe SSD for Proxmox with one being the boot drive and the other holding VMs and Containers to see if that makes any difference.
 
I looked into the logs and found IO errors which then lead me to my 1TB Samsung 860 EVO 2.5" SSD. Looking at it I see it is at 99% wear and that makes sense for the types of errors I was receiving,
Not really sure what you're after. "Proxmox" doesnt write anything, the workload does- but in any case, did you ever look at the remaining write endurance 3 months ago? occams razor suggests you were probably at 5% remaining write endurance.
 
Not really sure what you're after. "Proxmox" doesnt write anything, the workload does- but in any case, did you ever look at the remaining write endurance 3 months ago? occams razor suggests you were probably at 5% remaining write endurance.
I did not check the wear % when I installed Proxmox. While I suppose it is possible, I find it extremely unlikely that I would've had anything higher than 16TB written to the drive already. Let alone anything that would get me close to the almost 2PB mark.
 
so, are you suggesting its more likely you wrote 2PB in 3 months? you might want to reexamine your assumptions.
No I'm not. I'm trying to figure out how this happened. I can guarantee that I had not written Petabytes of data to that drive as it was just sitting in my PC as an extra storage drive for years. Should I have checked the drive stats before and after proxmox installation as a baseline? Sure. Did I? No because I didn't know I needed to as all of my other drives have been fine.
 
Its not always the flash that fails because it weared too much. I had a lot of SSDs here (consumer and enterprise) that failed with a complete death and wasn't even recognized by the BIOS anymore and some that started to only throw out IO errors while still 95-99% of life was left. Its often the SSDs controller that fails and that without any warning while SMART always reported that everything was fine with basically no wear.
 
Last edited:
Defective proxmox SSD comes back to life.

My SSD always dies after a year and a half to two years. New SSD in the NUC, install everything again and then on to the next round.

After the last time, I further examined the defective SSD in Windows. No life in sight, as expected. I got no further with diskmgmt and diskpart. Until I wrote a random ISO to the SSD with balenaEtcher and at the time of writing manually interrupted the action. Then I could create a new partition with diskpart and format it. Now the SSD works without any problems (in Windows). I haven't tried it yet in the NUC with Proxmox. I now wonder what 'wear' means in Proxmox and why SSD was not usable anymore in Proxmox.
 
Defective proxmox SSD comes back to life.

My SSD always dies after a year and a half to two years. New SSD in the NUC, install everything again and then on to the next round.

After the last time, I further examined the defective SSD in Windows. No life in sight, as expected. I got no further with diskmgmt and diskpart. Until I wrote a random ISO to the SSD with balenaEtcher and at the time of writing manually interrupted the action. Then I could create a new partition with diskpart and format it. Now the SSD works without any problems (in Windows). I haven't tried it yet in the NUC with Proxmox. I now wonder what 'wear' means in Proxmox and why SSD was not usable anymore in Proxmox.

That's just value from summary of a tool from smartmontools package [1] and verbatim the "Percentage used" value based on NVMe specs [2] - Admin command set, p. 122:

Percentage Used: Contains a vendor specific estimate of the percentage of NVM subsystem
life used based on the actual usage and the manufacturer’s prediction of NVM life. A value of
100 indicates that the estimated endurance of the NVM in the NVM subsystem has been
consumed, but may not indicate an NVM subsystem failure. The value is allowed to exceed
100. Percentages greater than 254 shall be represented as 255. This value shall be updated
once per power-on hour (when the controller is not in a sleep state).
Refer to the JEDEC JESD218A standard for SSD device life and endurance measurement
techniques.

So maybe you may want to tell us the brand and model and show other values. In some aspects, the nvme-cli [3] package is better for this.

[1] https://pve.proxmox.com/wiki/Disk_Health_Monitoring
[2] https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf
[3] https://packages.debian.org/bookworm/nvme-cli
 
Last edited:
These are the values of the previous SSD (Samsung 980 PRO 250GB M.2) after it stopped functioning. I do not have this information for the latest Samsung 890 PRO 1TB. The previous SSD was replaced by Samsung free of charge.

What particularly interests me is that the SSD will function normally again after it has stopped working in Proxmox. I think I saw somewhere that the SSD had entered read-only mode. This may have nothing to do with wear, but it does have the same effect: no longer usable in the current Proxmox configuration.

Mwn8iAB4w8xUdLFdWVyMSEi6sD6LDwhwprJWzB7e1dUa6V64kX3xKJUbRG4bKDR6g2r6lk5vPmGLuDD42rGfMsk3OtxFSu5_gTrXtoTqSaCJZxIBWQ_t5R2rO9htzDUqLtwBnYoH4Ml-v6PrQdfLigY


NGdrSdmgO97ZVToHSMRGSkNoQ0Sh7n2JcU68DIKbQIuk6Grm3UM_BWewNouFlp-icjAWhRVmmLFVXN8zYvTmGnZR7iwkWKKdMslE5-3t-UOwdnNcPrhMl9JmR1-Gs3Y6wjZLqFLxSAlY5EWI_JkcAms
 
T
These are the values of the previous SSD (Samsung 980 PRO 250GB M.2) after it stopped functioning. I do not have this information for the latest Samsung 890 PRO 1TB. The previous SSD was replaced by Samsung free of charge.

What particularly interests me is that the SSD will function normally again after it has stopped working in Proxmox. I think I saw somewhere that the SSD had entered read-only mode. This may have nothing to do with wear, but it does have the same effect: no longer usable in the current Proxmox configuration.

Mwn8iAB4w8xUdLFdWVyMSEi6sD6LDwhwprJWzB7e1dUa6V64kX3xKJUbRG4bKDR6g2r6lk5vPmGLuDD42rGfMsk3OtxFSu5_gTrXtoTqSaCJZxIBWQ_t5R2rO9htzDUqLtwBnYoH4Ml-v6PrQdfLigY


NGdrSdmgO97ZVToHSMRGSkNoQ0Sh7n2JcU68DIKbQIuk6Grm3UM_BWewNouFlp-icjAWhRVmmLFVXN8zYvTmGnZR7iwkWKKdMslE5-3t-UOwdnNcPrhMl9JmR1-Gs3Y6wjZLqFLxSAlY5EWI_JkcAms

The FAIL is due to the 0x09 Critical Warning in the SMART output.

Blame Samsung for this one:
https://www.guru3d.com/story/samsung-issues-new-firmware-to-prevent-dying-980-pro-ssds/

990s started on the same note if I remember well.
 
  • Like
Reactions: Kingneutron
And if you wanted to dig deeper, you need to see the specs, quick guess is that 0x09 = 0b1001, so you have bit 0 and 3 set:

0th the available spare space has fallen below the threshold.
3rd the media has been placed in read only mode.
 
Yes, does sound like a firmware issue. What is the firmware version of the failed drives.

It's literally with details in the quoted article. For 980 PROs the earlier ones starting with 3xxx. For the others I don't know by heart, it's all a shame really.

But your takeout is very simple - if you RMA your SSD and got a replacement, you can be sure it was not because you exceeded the TBW. :) The worst thing, there was no transparency as far as I know about how exactly that firmware update did "fix" it. So one has to assume at the expense of e.g. performance.

If I had an existing 980 PRO on the 3xxx that has been already running, I would literally prefer it to die than to get new firmware that will delay that death until after warranty. If you get new out of the box as a replacement, they will be on the new one. If you buy some old stock and discover it's on the old firmware (Samsung's fw update tool is abysmal), you better upgrade it before you put into use.

I gave up on Samsungs after this double-fiasco. The 990s very reported to "maybe" have similar issue, etc. Who wants to keep sourcing from a product line like that?
 
I was wondering about the firmware version of the failed drive from the OP, if they matched the versions indicated by the Samsung article...

Samsung does have over double the market share of the 2nd largest SSD maker, so statistically it's not surprising they are in the news more... Considering how many models they have, it's not surprising some have issues. I am more concerned how difficult it is get issues corrected, and how proactive they are at reporting issues before I experience an issue myself.
 
I was wondering about the firmware version of the failed drive from the OP, if they matched the versions indicated by the Samsung article...

Samsung does have over double the market share of the 2nd largest SSD maker, so statistically it's not surprising they are in the news more... Considering how many models they have, it's not surprising some have issues. I am more concerned how difficult it is get issues corrected, and how proactive they are at reporting issues before I experience an issue myself.
Oh I see, sorry. :)

I have had different brands over time, not a fan of any particular, but considering that for Samsung these models are PROs ... it was a terrible PR. I suspect it's the price to pay for in-house controller. I mean competition is all good but why 980 Pro and then 990 Pro have virtually the same problem that a "firmware update" miraculously fixes.
 
Samsung Magician does not support the SSD in a USB adapter. I ordered an internal adapter. Hopefully I can read the firmware version and update the SSD (if necessary). I will then report the firmware version.
 
Samsung Magician does not support the SSD in a USB adapter. I ordered an internal adapter. Hopefully I can read the firmware version and update the SSD (if necessary). I will then report the firmware version.

If you are referring to the previously failed SSD that went into read-only as a result, it's too late for firmware update there.

You can read firmware version from smartctl -a /dev/nvme.... in the first section.
 
My second SSD stopped working in the same way as the first, but I didn't collect any details about the second one after it stopped working. I immediately replaced the SSD so as not to stand still for too long.

In the meantime, I do not have a Linux environment available for the second SSD and I am therefore waiting for the adapter to examine the SSD in Windows.
 
My second SSD stopped working in the same way as the first, but I didn't collect any details about the second one after it stopped working. I immediately replaced the SSD so as not to stand still for too long.

If it stopped working, it's likely in read-only mode as well. Didn't you try to RMA it?

In the meantime, I do not have a Linux environment available for the second SSD and I am therefore waiting for the adapter to examine the SSD in Windows.

You can live-boot: https://itsfoss.com/create-live-usb-of-ubuntu-in-windows/

But just to be clear, I cannot guarantee the USB adapter would read the SMART values through smartcl either.
 
My broken 980PRO turned out to be only 4% worn according to SMART. Unfortunately, I can no longer find out exactly what the reason for the malfunction was, because I immediately replaced it. It was very similar to the failure of the previous SSD, which certainly encountered the read-only problem. The firmware is now 5B2QGXA7, it may have been a different version during the crash. I assume that this SSD can be used again in Proxmox. I hope that my Core Parts Single Level Cell SSD will perform better (in terms of usage time and no read-only problems).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!