Incorrect drive readings?

MeeM_Kade · Apr 3, 2025

Yesterday while I was doing something, I got an email notifcation from my proxmox cluster saying that my second node, an R720 filled to the top with 2.5 SSD's, had failing drives. Upon opening said email, I saw this:

This wasnt the only email I got, too, it showed other drives like /dev/sdg, /dev/sda, /dev/sdh, and alot of the 240 GB drives (not sure if it showed any of my 940 GB ssd's), and the sector count it showed seemed absurdly high. (216 trillion sectors??)
Is proxmox reading my drives SMART data correctly? Are all of my drives failing? This server was bought used as well as the drives, but the wearout is 100% in proxmox hasnt dropped, so I assume theyre fine. I feel like this is a reading error, because I recently got 2 ironwolf hard drives brand new in a different server and they showed immediately as bad, but I found out proxmox was reading the data bad, but I dont know if thats the case or not here. If it is, is there a way to make proxmox read it correctly?
I also ran multiple smart command tests yesterday on the drives and they reported back as passed.
Here is a video of what one of the drives S.M.A.R.T data looks like in proxmox:
https://cdn.meemkade.com/u/36378530-eb42-499a-b6f2-d8d6df5408ad.mp4
The ZFS pool the drives are in also shows them all as OK and it hasnt degraded.

guruevi · Apr 4, 2025

No, if Proxmox says 100% wear-out, I believe they are “worn out” as you would expect from 11yo hardware. The problem with SMART is that the values it returns are not necessarily standardized, so presumably that class of error happened, but the values are useless. Especially on these early cheap SSD, the details you get can be all over the place, the goal is for you to download their ad-laden crapware which I’m sure only works on Windows 7 to ‘decode’ that your disk is broken. For Seagate there are online decoders, for “Edge” with a Sandforce controller, probably not.

SMART tests are useless, I have spinning drives that won’t load and are ticking their heads that pass SMART. Do a ZFS scrub to make sure your data is readable. As long as you have functional blocks though an SSD will attempt to relocate and repair the data to a block that hasn’t worn. All the SMART value says is that the number of usable blocks has been reduced. Eventually you’ll run out completely and then the entire disk will read-only or lock itself or start returning garbage, depending on firmware.

MeeM_Kade · Apr 4, 2025

My proxmox wear out values for SSD's are different, for example some start at 0 for a new ssd and count up, and some count down from 100
If it was actually at 100% wear out, I'm pretty sure that all of my data would be corrutped by now
Also, how come this is all happening to every single ssd all at once?

guruevi · Apr 4, 2025

Again, the values come from SMART. If an 11 year old SSD is wearout 0% then the wearout levels are untrustworthy,
Just because a SSD is worn out, doesn’t mean the data goes corrupted. It’s just what the vendor has indicated is the ‘warranty has expired’ like 10y/100k miles on your car, is just whichever comes first, an SSD like yours may have 3y/100TB written whichever comes first. The percentage is just how much TBW over the lifetime of the SSD in relationship to its total warranty TBW. Does that mean failure at 100% or after 3y, not necessarily, but the chances go up exponentially.

Why do the values come now, again, likely because they have exceeded some internal counter which the firmware is now reporting, given you are seeing those reports though and the age of the devices, my suggestion would be to replace them if you care about your data/uptime. Again, without the vendor input in what these numbers mean, we don’t really know how much life you got ‘left’ if any.

MeeM_Kade · Apr 4, 2025

guruevi said:
Again, the values come from SMART. If an 11 year old SSD is wearout 0% then the wearout levels are untrustworthy,
Just because a SSD is worn out, doesn’t mean the data goes corrupted. It’s just what the vendor has indicated is the ‘warranty has expired’ like 10y/100k miles on your car, is just whichever comes first, an SSD like yours may have 3y/100TB written whichever comes first. The percentage is just how much TBW over the lifetime of the SSD in relationship to its total warranty TBW. Does that mean failure at 100% or after 3y, not necessarily, but the chances go up exponentially.

Why do the values come now, again, likely because they have exceeded some internal counter which the firmware is now reporting, given you are seeing those reports though and the age of the devices, my suggestion would be to replace them if you care about your data/uptime. Again, without the vendor input in what these numbers mean, we don’t really know how much life you got ‘left’ if any.

I think the main issue with your explanation, from what im understanding, is that your saying that the drives are so old that the smart data is bad now. Yes, the drives were released 11 years ago what im getting from your explanation, but thats not when mine were made. Mine were made sometime 2019 according to the sticker, so I really doubt that I have 11 year old ssd's. (in term of their creation date/mfg)
Also, once again, why would all of them, specifcially the 240 GB ones all fail at once? One or two failing would make sense, but all eight??? I got this server, once again, was used by some sort of company, so I really doubt they would have gotten SSD's that old, and would have thrown them out if they got the same errors that I did.
I really doubt this is a drive issue and more of an SSD S.M.A.R.T data reading issue, since once again, those 2X 4TB ironwolf drives were brand new yet the smart data is screwed up on those. (which ~8 months later still work perfectly fine, proxmox was just reading their data wrong)
(P.S.: Said hard drives and SSD's read perfectly fine on CDISK info because CDISK understands how to read them)

guruevi · Apr 4, 2025

No, what I'm saying is that the drives have been written to so much, the vendor can't provide any further warranty to its proper function.

Those drives are basically re-badged OCZ Vertex 3, I used to have a set of those, they were absolute horrid to work with, got SMART errors, dropped out of arrays etc, after many firmware releases they got somewhat usable.

If they have a label on it from 2019, that does not mean they were manufactured, they may be refurbished in 2019, happens often with those off-brand. Eg. Water Panther does that too, they get old disks, test and reflash the firmware, sometimes reset the SMART values, put a new label over them, but they are really "used" Kioxia or WD or something else under the hood. A Sandforce controller "new" in 2019? Sandforce ceased to exist in 2012 when it was taken over by LSI (which became Avago which became Broadcom), so it would be hard to say you have a "new" Sandforce controller SSD 8 years after that model came to market, according to all the documentation that specific brand and model was released in 2011 and discontinued a few years later - in 2016 they were already listed on their website (Boost Server Pro with Power Fail) as discontinued.

As to why you get errors, again, maybe your alerts weren't working before, maybe they are written to somewhat equivalent and all went over their "counters" around the same time. Perhaps they did get the same errors and chucked the server, maybe you flashed the firmware and reset some counters all at the same time. The errors themselves do not seem to indicate any connectivity issues (Offline uncorrectable means bad blocks), but that is assuming the drive has proper SMART reporting. I have 21*7 NVMe drives, they are all within 5-10% of each other when it comes to writing (~160TB/piece) and that is after ~24 months in operation. If I have my SMART reporting e-mail set to weekly, these counters will all go over their 'maximum' at about the same time.

Now as I said you cannot stare yourself blind on SMART error reporting. Something triggered it, it's not Proxmox software, all Proxmox knows is that is what the drive is reporting. Does not mean your drives will 'die' tomorrow or even all at the same time. All it means is that the manufacturer is communicating something is going on that exceeded 'their' threshold. What that 'something' is is as good a guess as any, I tend to trust the SMART error means the disk is going to go at some point in the future.

If you have connectivity issues or controller issues across all the drives simultaneously, could potentially be an issue, but you should see 'other' issues like crashes, dmesg errors, timeouts etc. - if you see 'other disk-related issues' and you don't think it's your SSD, then you should investigate your log files and see if you can find anything else.

MeeM_Kade · Apr 4, 2025

So what do I do from here? Do I continue to use the SSD's? Do I buy new ones? Im not really sure what I should do

UdoB · Apr 4, 2025

MeeM_Kade said:
Im not really sure what I should do

Remove one of them, put it into another computer and try to examine/confirm the actual status there.

guruevi · Apr 4, 2025

I personally wouldn't trust them on age alone, whether it is 2019 or 2011, that's at least 5y ago and the Sandforce wasn't trustworthy to begin with (it bankrupted OCZ). 256GB SATA disks aren't expensive, even 512GB is just $25, a 512GB DC edition from Intel is $75.

Again, if you're a hobbyist and you have backups and you don't need uptime requirements, you may say "meh", but for "business use" you may even suggest replace the whole server to your boss.

MeeM_Kade · Apr 7, 2025

UdoB said:
Remove one of them, put it into another computer and try to examine/confirm the actual status there.

I removed two out of the 8, both fine in CDISK

MeeM_Kade · Apr 7, 2025

guruevi said:
I personally wouldn't trust them on age alone, whether it is 2019 or 2011, that's at least 5y ago and the Sandforce wasn't trustworthy to begin with (it bankrupted OCZ). 256GB SATA disks aren't expensive, even 512GB is just $25, a 512GB DC edition from Intel is $75.

Again, if you're a hobbyist and you have backups and you don't need uptime requirements, you may say "meh", but for "business use" you may even suggest replace the whole server to your boss.

I would replace them but im not really in the situation right now to be spending money on disks, I don't really have an income at the moment so I want to see if these SSD's are fine, theres 8 disks and if each one costs 8 then the whole array being replaced would be 200 dollars of disks.
The server that has these isnt even my primary server, this one is a backup server that I got already with the disks, so uptime isnt that big of a priority in this server in particular. It has 2.5 bays so sadly I can't get any bigger hard drives which would be much better for long term data storage

UdoB · Apr 7, 2025

MeeM_Kade said:
I removed two out of the 8, both fine in CDISK

Yes. Extraordinary fine, which is suspicious in itself. But yeah, no indication of trouble in sight...

Personally I would keep them running - under two preconditions: with redundancy and without actual "mission critical" data. (And with 3-2-1 automatic backups of course.)

guruevi · Apr 7, 2025

The error you got is that a value changed for the worse, so you would have to keep it running for a few days and see if any values changed. The numbers (either RAW or interpreted) also don't make any sense.

MeeM_Kade · Apr 8, 2025

UdoB said:
Yes. Extraordinary fine, which is suspicious in itself. But yeah, no indication of trouble in sight...

Personally I would keep them running - under two preconditions: with redundancy and without actual "mission critical" data. (And with 3-2-1 automatic backups of course.)

Thats fine for me, I have 8 SSD's in that raid array and the moment if one of them actually dies I'll just shutdown the node. Anything mission critical is on that other server I have which is running fine. (it uses hard drives)
Still dont know how to fix the incorrect reading since proxmox keeps emailing me every 5 hours.

MeeM_Kade · Apr 8, 2025

guruevi said:
The error you got is that a value changed for the worse, so you would have to keep it running for a few days and see if any values changed. The numbers (either RAW or interpreted) also don't make any sense.

I'll take out the SSD's again and read them on CDISK in maybe a week or so or the next time I shutdown that node in particular, but will report back when I do so.

Search

Search

Incorrect drive readings?

MeeM_Kade

New Member

guruevi

Well-Known Member

MeeM_Kade

New Member

guruevi

Well-Known Member

MeeM_Kade

New Member

guruevi

Well-Known Member

MeeM_Kade

New Member

UdoB

Distinguished Member

guruevi

Well-Known Member

MeeM_Kade

New Member

MeeM_Kade

New Member

UdoB

Distinguished Member

guruevi

Well-Known Member

MeeM_Kade

New Member

MeeM_Kade

New Member

We value your privacy