[SOLVED] ZPOOL DEGRADED - is disk really faulty?

otoman

Member
Mar 25, 2022
35
4
8
Hi all,

today I found out that one of my HDDs in a zpool might be faulty. The scrub is running as we speak. The status shows 12 read and 9 checksum errors. I'd already replaced one of the drives a year ago because of this. Drives are Western Digital GOLD 8004FRYZ. SMART says passed and dmesg with messages up to "notice" level shows a few errors I don't understand. Is this an actually faulty drive or a transient error? The dmesg output is attached below.



Thanks in advance!
 

Attachments

Here's the output:

pool: GOLDR10
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub in progress since Tue Jul 25 13:51:30 2023
4.82T scanned at 332M/s, 4.50T issued at 310M/s, 5.43T total
336K repaired, 82.78% done, 00:52:45 to go
config:

NAME STATE READ WRITE CKSUM
GOLDR10 DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD8004FRYZ-01VAEB0_VYG3W91M ONLINE 0 0 0
ata-WDC_WD8004FRYZ-01VAEB0_VYG1G1YM ONLINE 0 0 0
mirror-1 DEGRADED 0 0 0
ata-WDC_WD8004FRYZ-01VAEB0_VRKWNDEK ONLINE 0 0 0
ata-WDC_WD8004FRYZ-01VAEB0_VYG0KJ5R FAULTED 12 0 9 too many errors

errors: No known data errors


I'll try clearing the meesages after scrubbing and another long SMART test, but the dmesg worries me. The replaced drive was from the same batch as these other 3 as you can see so that might be it.
 
Here's the output:

pool: GOLDR10
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub in progress since Tue Jul 25 13:51:30 2023
4.82T scanned at 332M/s, 4.50T issued at 310M/s, 5.43T total
336K repaired, 82.78% done, 00:52:45 to go
config:

NAME STATE READ WRITE CKSUM
GOLDR10 DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD8004FRYZ-01VAEB0_VYG3W91M ONLINE 0 0 0
ata-WDC_WD8004FRYZ-01VAEB0_VYG1G1YM ONLINE 0 0 0
mirror-1 DEGRADED 0 0 0
ata-WDC_WD8004FRYZ-01VAEB0_VRKWNDEK ONLINE 0 0 0
ata-WDC_WD8004FRYZ-01VAEB0_VYG0KJ5R FAULTED 12 0 9 too many errors

errors: No known data errors


I'll try clearing the meesages after scrubbing and another long SMART test, but the dmesg worries me. The replaced drive was from the same batch as these other 3 as you can see so that might be it.
Smart tests aren't always 100% reliable if it comes to faulty drives or cables.
At least in my experience.

I had situations, where we replaced every 3 months samsung 850 pro drivers.
But that was in an 24drive all ssd san.

However, it might be just a faulty cable, or the drive.
As you replaced the hdd last time, was it the same bay or cable?

Do a scrub increase the errors?

Do an clear + reboot + scrub makes errors again?
If so then it's indeed probably drive or cable.
 
Also keep in mind that smart does self tests. So the disks firmware tests itself without transfering data over that cable/backplane/disk controller. So a smart selftest could tell you that everything is fine as the problem is somewhere up the line which it won't test.
If you clear the errors and the same disks errors again, I would switch the cables, slots and ports to see if it is really the disk or something else.
 
Last edited:
Before I started the scrub, status said "One or more applications experienced an unrecoverable error..." And I believe the read error count was 0 but checksum was about 9. Scrubbing produced the current numbers and as it is not complete yet, it's still at that level. The faulted disk is next to the one that had been replaced so the cable might be fine. I'll try clearing and rescrubbing after this one finishes. Is there any other way to test the disk for faults other than what ZFS provides?
 
Before I started the scrub, status said "One or more applications experienced an unrecoverable error..." And I believe the read error count was 0 but checksum was about 9. Scrubbing produced the current numbers and as it is not complete yet, it's still at that level. The faulted disk is next to the one that had been replaced so the cable might be fine. I'll try clearing and rescrubbing after this one finishes. Is there any other way to test the disk for faults other than what ZFS provides?
Sure, you can do a simple
dd if=/dev/urandom of=/mystorage/destination/trashfile
Not sure if you need to enter blocksize or count just to trash the pool with data.

But it will be the same as writing files to it.
The problem is, that you drive could have only some damaged blocks and they are all near to each other somewhere in the beginning.

However, tbh, my instinct tells me that your drive is indeed damaged, especially because you said that one of those already died. So it's just probably the second one and the wd golds are just not as good as the name tells.

Wish you good luck, hopefully you can exchange it through warranty.
At least here in Germany it's still possible after one year.

Cheers
 
Thanks for the advice. I also think the drive is the issue. Will let you know what happens when I do all the testing.
 
  • Like
Reactions: Ramalama
Smart tests aren't always 100% reliable if it comes to faulty drives or cables.
This is true, but only in one direction.

If a disk passes smart long test, it doesnt mean the disk is faultless, just that the internal CRC checks pass. If the disk FAILS smart long test, its hosed and you should replace it.

ata-WDC_WD8004FRYZ-01VAEB0_VYG0KJ5R FAULTED 12 0 9 too many errors
did you perform a smart long test on this drive? did it pass?
 
Hey guys,

so the weirdest thing just happened. My friend came by today to replace the faulted disk and I did zpool status to check the serial number and all of a sudden there are NO ERRORS AT ALL! Status says it resilvered 1.37G on July 30th with 0 errors. There are no more read/write nor checksum errors.

Could it really have been a false positive? I didn't clear the error after my last post and rescrub did increase the error count as I said.

This is really perplexing and worrying. How reliable is my pool?

Thanks in advance!
 
Hey guys,

so the weirdest thing just happened. My friend came by today to replace the faulted disk and I did zpool status to check the serial number and all of a sudden there are NO ERRORS AT ALL! Status says it resilvered 1.37G on July 30th with 0 errors. There are no more read/write nor checksum errors.

Could it really have been a false positive? I didn't clear the error after my last post and rescrub did increase the error count as I said.

This is really perplexing and worrying. How reliable is my pool?

Thanks in advance!
Zpool status errors go only away if you clear the errors manually!
Otherwise it would make no sense to see historical errors, if they would disappear by themselves

Another topic is, if it resilvered without errors, that just means, that it resilvered without errors, means the blocks on which the files are located, are okay.
But in terms of a faulty disk, i wouldn't lay my hand into fire to say, that all disks are fine.

However resilvering errors and zpool status errors are 2 separate things.
If resilvering succeeded without errors it only means that zpool status errors weren't increased through the resilvering process.

So in the end, previous zpool status errors cannot disappear anyway without a manual clear.
Doesn't matter if it resilvered with or without errors.

In the end, it says simply nothing and you cannot get any conclusion of it.

So you have actually 2 options:
- replace the disk now
- wait till zpool status errors appear again to be sure and replace the disk later, when they appear again.

As others said, it's not easy to tell, if it's the disk/memory or cables.
On Ssds at least, if they die, they usually die completely, so it's easier to know if it's something else or the ssd.

On HDD's, especially in cases where you suddenly get errors and sometimes not, it's harder to tell if it's the hdd or sth else.

Cheers :-)
 
I know errors shouldn't disappear on their own...but they did. Literally nobody but me uses this machine and I had no reason to clear the errors. Before it was about 30 read and 40 checksum errorsm and now all columns say 0. The disk that was twice in a row marked as faulted now says online. The only thing that happned was the automatic scrub every other sunday.

The zpool history only shows my scrub, then clear and another scrub. All other entries are zpool import -c path -aN...

I should clarify that I don't keep the machine running 24/7. Sometimes I shut it down for the night.
My manual scrubs and clearances happened a week before this last resilver.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!