ZFS rpool degraded from IO write

pgesting · Apr 8, 2025

I have been running proxmox on a home lab for about 4 years. But I am by no means an expert. I have two Samsung consumer grade EVO 860s in a mirror for my rpool. (I know that ZFS will wear them out, just not sure if this is that occurring or just a fluke)

One of them appears to have been faulted from a write error. Here is what I get from zpool status:

Code:

 pool: rpool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: scrub repaired 0B in 0 days 00:35:12 with 0 errors on Sun Mar  9 00:59:15 2025
config:

    NAME                                                   STATE     READ WRITE CKSUM
    rpool                                                  DEGRADED     0     0     0
      mirror-0                                             DEGRADED     0     0     0
        ata-Samsung_SSD_860_EVO_1TB_S4X6NF0M840750T-part3  FAULTED      0    21     0  too many errors
        ata-Samsung_SSD_860_EVO_1TB_S4X6NF0MC07932E-part3  ONLINE       0     0     0

errors: No known data errors

I have attached the smart for the disk as well as the dmesg log for when I think the failure occurred.

I have looked through the forums, most advice has been "look at the dmesg" but I am not sure what to be looking for. Should I try to clear the error and scrub to see if it goes away? Or is that not a good idea on a ZFS mirror for the boot device (rpool)?

aj@root · Apr 8, 2025

~~100%~~ 90% that's a hardware failure - reported in ~~all three~~ 2 of 3 places you've shown. Time to order a new drive.

If any of that data is important to you, I recommend that you back it up immediately, or turn off that machine.

Since you have two of the same drive, and they've been in a mirror getting about the same number of writes, they're likely to fail at about the same time.

A good strategy when you're using consumer drives is to

buy different brands - or at least different models - so that they don't fail within days or hours of each other
don't just look at the Linus Tech Tip benchmarks - they're great for gaming machines, but not focused on servers
instead look for IOPS, DWPD, and MLC or TLC (the Evos have QLC which fails quickly)
buy MUCH bigger than you'll need - so that you get more wear-leveling
enterprise drives are more expensive (and longer) because they have more, lower-density chips
Consider NAS drives (i.e. WD Red)
Run smartctl tests on a weekly schedule or so - that way you get more accurate data when you run a summary

This is the site I recommend using for spec'ing out consumer drives, because it lists the stuff that consumers generally don't consider or know about: https://www.techpowerup.com/

For example, here's their review of a WD Red NAS drive and, unlike almost any other site, it actually covers DWPD (0.7 is VERY good for a consumer drive), and IOPS (not how "fast" the drive is but how "parallel" it is - kinda sorta - useful when running VMs):
https://www.techpowerup.com/ssd-specs/western-digital-red-sn700-4-tb.d1621
(caveat: I'm assuming it's accurate and they're not just making up numbers, but some of that info is hard to verify)

Also, here's what the wear-level info looks like:

Those consumer drives sure do go quick...

Another pro tip: if a drive isn't an enterprise drive, and it doesn't show that the wear-level is increasing after 2-3 months, it's probably a time bomb. Swap it for a better, name-brand drive as soon as you can. The larger the better.

UdoB · Apr 8, 2025

aj@root said:
enterprise drives are more expensive (and longer) because they have more, lower-density chips

"Enterprise class" (with PLP) gives us two features:

it guarantees that data written asynchronously will actually be written if power fails - this enhances data integrity
it allows to enable a local write-buffer for sync-writes which would be needed to actually written to disk immediately otherwise - which might be slow

For a reliable and fast system I really want to have both features.

UdoB · Apr 8, 2025

pgesting said:
FAULTED 0 21 0 too many errors

I have seen this on several hosts. On the first time I just look at SMART and usually go with "scrub + clear" - after physically checking cabling and connectors. On the second time I prepare for a replacement.

The decision to actually replace the disk depends on the situation, of course...

To actually replace a ZFS disk which is a member of "rpool", containing the boot-mechanism, needs some extra steps beside "zpool replace" to establish the expected partition table and the bootloader:

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_zfs_administration --> "Changing a failed bootable device"

pgesting · Apr 8, 2025

aj@root said:
100%, that's a hardware failure - reported in all three places you've shown. Time to order a new drive.

Can you show me what I should be looking for to see this? I don't see anything out of the ordinary in the SMART results

aj@root said:
buy MUCH bigger than you'll need - so that you get more wear-leveling

I am only using about 30% on these, I did that on purpose. They lasted about 4 or 5 years.

aj@root said:
Consider NAS drives (i.e. WD Red)

This is not my "tank". I have NAS WD Red drives for my data. This is my rpool, which is my boot device and has all my LXCs and VMs, etc.

aj@root said:
Run smartctl tests on a weekly schedule or so - that way you get more accurate data when you run a summary

I run short tests daily and long tests weekly. You can see that in the smart data I uploaded. There are no failed tests. This is why I was asking for help as to what I should be looking for to see for sure that it is a hardware failure.

aj@root said:
Also, here's what the wear-level info looks like:

I have been watching the wear info. It is at about 70% so it is not unexpected.

UdoB said:
On the first time I just look at SMART and usually go with "scrub + clear" - after physically checking cabling and connectors. On the second time I prepare for a replacement.

The decision to actually replace the disk depends on the situation, of course...

Okay, thanks. What do you look at in SMART? I can't see anything out of the ordinary, but I could be missing something....

UdoB said:
To actually replace a ZFS disk which is a member of "rpool", containing the boot-mechanism, needs some extra steps beside "zpool replace" to establish the expected partition table and the bootloader:

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_zfs_administration --> "Changing a failed bootable device"

Yeah I saw this on a previous post, thanks. One thing I was also looking for, is if I swap out this drive with an enterprise drive, they will be a different size. So if I add it, will it be able to reduce the size of the other drive so that they can mirror? I would then replace the good one with an enterprise drive as well, but I would prefer to do them one at a time so that the mirror stays running.

I guess alternatively I make a bunch of ZFS snapshots of all my data, send them to a backup, and start from scratch with two new drives?? I would prefer not to do that, I think.

aj@root · Apr 8, 2025

P.S. I may have misread the Smart Log. I had skipped down to the sections that typically indicates failure and saw high numbers:

Code:

177 Wear_Leveling_Count     PO--C-   028   028   000    -    1301
179 Used_Rsvd_Blk_Cnt_Tot   PO--C-   100   100   010    -    0

However, those readouts are similar-but-different on different drives. On mine, for example:

Code:

184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       47

And these numbers don't actually mean much without context for the brand - sometimes a threshold of 10 means "10% left" and other times it means "10% used" and other times it means "0x10, see manual for whatever that means".

It also says that it's passing, but depending on the drive a SMART "Passing" may or may not mean anything at all. I've had drives that are simultaneously completely unusable and registering as "Passing".

I would expect Samsung to report a correct status, however, so that may point to the drive actually being okay.

pgesting · Apr 8, 2025

aj@root said:
I would expect Samsung to report a correct status, however, so that may point to the drive actually being okay.

Okay, thanks. I will look up and see what those numbers indicate.

EDIT: Looks like for Samsung, the Wear_Leveling_Count starts at 100 and goes to 0. So 22 means something like 22% of life left. This makes sense, because the Wearout in Proxmox is 72% (Maybe the wearout in proxmox is calculated from this value?)

Looks like by Total LBA is about 27TB.

It hasn't used the reserved block yet, but from what I have read, that doesn't mean anything. It might not do that until failure for a consumer grade device. So you might be right, I might be living on borrowed time.

UdoB · Apr 8, 2025

pgesting said:
What do you look at in SMART? I can't see anything out of the ordinary, but I could be missing something....

Your "smart.txt" look fine - for me. Look at 5, 177, 179, 187.

aj@root said:
However, those readouts are similar-but-different on different drives. On mine, for example:

Unfortunately they are not really well defined and effectively Vendor specific.

Your "179 Used_Rsvd_Blk_Cnt_Tot PO--C- 100 100 010 - 0" tells me (without re-reading the documentation) there is 100 % still available. That 100 shall get lower when some Blocks are used. When it goes below 10 it counts as a potentially problematic usage.

aj@root · Apr 8, 2025

I run short tests daily and long tests weekly. You can see that in the smart data I uploaded. There are no failed tests. This is why I was asking for help as to what I should be looking for to see for sure that it is a hardware failure.

I skimmed too quickly. I was seeing

Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.

And I looked at that as "never started" and 0 runs, never been run (but it's 0-is-success), and just skipped ahead to the raw values section.

On the server I typically use a script I wrote that pulls out the most relevant stuff for me and when I want more detail I have a graphical tool I run that shows the raw values and interpreted values inline.

I got overconfident in thinking I remembered how to read the raw data... but I don't. Sorry about that.

See two failures, expect to see them in the third, skim over and see things that aren't there. Confirmation bias strikes again.

JensF · Apr 8, 2025

More than 4 years is pretty impressive for those crappy SSDs as PVE root devices!
I think this is the problem:

Code:

199 CRC_Error_Count         -OSRCK   099   099   000    -    2

UdoB · Apr 8, 2025

pgesting said:
if I swap out this drive with an enterprise drive, they will be a different size. So if I add it, will it be able to reduce the size of the other drive so that they can mirror?

Shrinking a pool is somewhat difficult. You can not attach a smaller device to an existing vdev; a vdev here means a single drive or an existing mirror.

Of course there are ways to replace a large rpool with smaller devices. But that's tedious and requires several intermediate steps.

In an incomplete nutshell: add the new devices as a new, additional, mirrored vdev. Nowadays you can simply remove the old vdev. But only if there are only mirrors involved. This "remove" will transfer all data to the new mirrored vdev. The "this-must-be-bootable" details must be considered correctly...

pgesting · Apr 8, 2025

UdoB said:
Shrinking a pool is somewhat difficult. You can not attach a smaller device to an existing vdev; a vdev here means a single drive or an existing mirror.

Of course there are ways to replace a large rpool with smaller devices. But that's tedious and requires several intermediate steps.

In an incomplete nutshell: add the new devices as a new, additional, mirrored vdev. Nowadays you can simply remove the old vdev. But only if there are only mirrors involved. This "remove" will transfer all data to the new mirrored vdev. The "this-must-be-bootable" details must be considered correctly...

Thanks, this is what I was afraid of. I have been searching online but haven't found any good "here's a guide how to do that." Do you have any suggestions? I am not sure if I physically have 2 spare SATA ports to do this, I will have to check.

UdoB · Apr 8, 2025

pgesting said:
I have been searching online but haven't found any good "here's a guide how to do that." Do you have any suggestions?

No, not specific ones, only a generic one: have a (validated) backup before starting that adventure.

pgesting said:
I am not sure if I physically have 2 spare SATA ports to do this, I will have to check.

My mentioned approach to add two drives additionally is so valuable (from my point of view) that you may buy two USB-to-SATA adapters. They are cheap nowadays. While I can not recommend something like that for productive use they are very valuable for migration tasks like this. And this makes you realize why adding devices ".../disks/by-id/*" (iirc) instead of using "sdc" is a good advice...

Actually I have done exactly that (replacing a mirrored "rpool"-vdev by a smaller pair) a year ago with exactly this approach. Don't remember the reason...

It worked as it should. One of the important final steps is to actually test the bootability of both devices.

Search

Search

ZFS rpool degraded from IO write

pgesting

Member

Attachments

aj@root

Member

UdoB

Distinguished Member

UdoB

Distinguished Member

pgesting

Member

aj@root

Member

pgesting

Member

UdoB

Distinguished Member

aj@root

Member

JensF

Renowned Member

UdoB

Distinguished Member

pgesting

Member

UdoB

Distinguished Member

We value your privacy