How Do I Troubleshoot If I'm At Fault, Or The Hardware Is?

DR4GON · Mar 4, 2022

I keep having WD RED drives give the following error after approximately a year of use. I have tried replacing drives, cables, backplates, hba cards. I'm at a loss.

Code:

zpool status NAS

Code:

  pool: NAS
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub in progress since Fri Mar  4 17:46:27 2022
        22.5T scanned at 1.17G/s, 17.2T issued at 913M/s, 29.0T total
        1.04M repaired, 59.17% done, 0 days 03:46:40 to go
config:
        NAME                                          STATE     READ WRITE CKSUM
        NAS                                           DEGRADED     0     0     0
          raidz2-0                                    DEGRADED     0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-W****  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-W****  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-W****  ONLINE       0     0     0
            ata-WDC_WD40EFZX-68AWUN0_WD-W****  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-W****  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-W****  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-W****  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-W****  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-W****  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-W****  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-W****  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-W****  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-W****  FAULTED    114     0     0  too many errors  (repairing)
            ata-WDC_WD40EFRX-68N32N0_WD-W****  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-W****  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-W****  ONLINE       0     0     0
errors: No known data errors

Also, I don't understand if this is a real error, or how to test if the drive is actually faulty or something else has caused the error. I've tried zpool clear before, and had no faults for two months, only to have a different drive to give me the same "too many errors" instead. Drives successfully pass SMART and will scrub with no faults. I've replaced both drives and then had a third drive give the same error. I've replaced that drive and had a 4th one do it. All the drives are less than 2 years old, some less than 1 year old. all different batches. People estimate I should be able to get 4-5 years, but I can barely get 1. Help?

LnxBil · Mar 4, 2022

Is it always the same slot and only one disk that fails?

apoc · Mar 4, 2022

Are you using a SAS-Expander?
Those have given me trouble so i replaced them using pure SAS-HBAs.
Maybe some software-update has ruined your setup over time/after a year.

DR4GON · Mar 5, 2022

LnxBil said:
Is it always the same slot and only one disk that fails?

No. Of everything that I've tried; different ports in the backplane, different drives, different cables (tried SAS to SATA, now using SAS to MiniSAS), different HBA's. I've replaced the ram, going from 24GB to 96GB. I've swapped out the two CPU's, I even just swapped cases, upgrading from a 16bay to a 24bay. The only thing consistent is that it's always one drive, so running RaidZ2 is keeping my data safe, but what else can I try?

apoc said:
Are you using a SAS-Expander?
Those have given me trouble so i replaced them using pure SAS-HBAs.
Maybe some software-update has ruined your setup over time/after a year.

No, I'm using 2 LSI HBA SAS cards, though with the new chassis I will have to look for a solution that allows connecting more than 16 drives to 2 HBA cards. Any suggestions if SAS Expanders aren't the way to go? I dont have any spare PCI, so taking out a HBA to put in something new is the only option. I don't recall updating anything that would mess with ZFS, and I don't even have chassis vibrations that I can pinpoint it too.

Today I woke up to a completed scrub, could this file be the bane of my existence? I've never had it tell me there is actually an error, so [zpool status NAS -v] has never shown me any information:

Code:

zpool status NAS -v
  pool: NAS
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 1.04M in 0 days 09:28:14 with 32 errors on Sat Mar  5 03:14:41 2022
config:
        NAME                                          STATE     READ WRITE CKSUM
        NAS                                           DEGRADED     0     0     0
          raidz2-0                                    DEGRADED     0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-W*  DEGRADED     0     0    82  too many errors
            ata-WDC_WD40EFRX-68N32N0_WD-W*  DEGRADED     0     0    82  too many errors
            ata-WDC_WD40EFRX-68N32N0_WD-W*  DEGRADED     0     0    82  too many errors
            ata-WDC_WD40EFZX-68AWUN0_WD-W*  DEGRADED     0     0    82  too many errors
            ata-WDC_WD40EFRX-68N32N0_WD-W*  DEGRADED     0     0    82  too many errors
            ata-WDC_WD40EFRX-68N32N0_WD-W*  DEGRADED     0     0    82  too many errors
            ata-WDC_WD40EFRX-68N32N0_WD-W*  DEGRADED     0     0    82  too many errors
            ata-WDC_WD40EFRX-68N32N0_WD-W*  DEGRADED     0     0    82  too many errors
            ata-WDC_WD40EFRX-68N32N0_WD-W*  DEGRADED     0     0    82  too many errors
            ata-WDC_WD40EFRX-68N32N0_WD-W*  DEGRADED     0     0    82  too many errors
            ata-WDC_WD40EFRX-68N32N0_WD-W*  DEGRADED     0     0    82  too many errors
            ata-WDC_WD40EFRX-68N32N0_WD-W*  DEGRADED     0     0    82  too many errors
            ata-WDC_WD40EFRX-68N32N0_WD-W*  FAULTED    114     0     0  too many errors
            ata-WDC_WD40EFRX-68N32N0_WD-W*  DEGRADED     0     0    82  too many errors
            ata-WDC_WD40EFRX-68N32N0_WD-W*  DEGRADED     0     0    82  too many errors
            ata-WDC_WD40EFRX-68N32N0_WD-W*  DEGRADED     0     0    82  too many errors
errors: Permanent errors have been detected in the following files:
        /mnt/NAS/Videos/TV Shows/Show Name/Season 03/Totally Legit File.mkv

I'm going to remove that file and do zpool clear NAS and then zpool scrub NAS. I'll let you know how it goes.

DR4GON · Mar 5, 2022

Well the result was the same. I've pulled the faulted drive, and have replaced it. This time it didn't tell me a specific file contained errors, but that there was still "too many".

What else could be at fault?

apoc · Mar 5, 2022

DR4GON said:
Any suggestions if SAS Expanders aren't the way to go?

I had a ton of trouble when moving my well working setup with LSI9211-8i and a SAS Expanddd from MDADM to ZFS.
Issues were similar to yours. Ver y inconsistent, moving, nasty.
In the end I was tired and just tried it without expander, using a second 9211-8i and all my issues were gone. I have to admit that this was a few years ago but I am not keen to go back and retry things.

DR4GON said:
I've never had it tell me there is actually an error, so [zpool status NAS -v] has never shown me any information:

If there was no scrub to detect that issue and the file wasn't read, errors still could be on disk. That's the reason why regular scrubs are so important. Doing mine once a week on my primary system.
I don't think this is the source of your problem. I think this is a symptom, like the disks jumping around.

Sadly I can only do guesswork. I have no pointers at all :/ This is similar to my "25days uptime issue" which I just can't get around - other than doing a reboot.

I'd start documenting detailed what and when things happen. Maybe you find some pattern. All things I'd point out normally you already have switched.

DR4GON · Mar 6, 2022

apoc said:
[…] In the end I was tired and just tried it without expander, using a second 9211-8i and all my issues were gone. […]

I’m currently using two SAS HBA’s (model alludes me at the moment but looks remarkably similar to the 9211-8i), but at 4 drives per port, and only 4 ports, that’s my bottleneck. I need a single card with more than two ports, and the original plan was a SAS-Expander. Google isn’t really helping me search this, but I’m assuming there are 3-4 port SAS HBA’s?

apoc said:
[…] That's the reason why regular scrubs are so important. Doing mine once a week on my primary system.
I don't think this is the source of your problem. I think this is a symptom, like the disks jumping around.
[…]
I'd start documenting detailed what and when things happen. Maybe you find some pattern. All things I'd point out normally you already have switched.

That’s good to know that I’ve identified everything that could be at fault. I know at some point the motherboard should go, but while it’s still kicking, it seems like a waste to toss it. Considering I’ll need new CPU and ECC RAM when I make that jump, which will be a hit the the old wallet.

My last option is the power supply. It is the same age as Westmere, so 2010, and is a 2+1 redundant 750W (3x 375W). Is it worth replacing it for the lulz, or is it a case of “it works so it works”?

apoc · Mar 6, 2022

DR4GON said:
I know at some point the motherboard should go, but while it’s still kicking, it seems like a waste to toss it.

Same here

My Opteron Board is reliable (aside these reboots).

DR4GON said:
My last option is the power supply.

That could indeed be a good idea. I thought I have read that on your list.
If it is a server-grade PSU it is unlikely but I have had consumer PSU that failed on me because the 12v rail failed. But that also should then relay to load of your system (the pattern I have mentuoned.).
If your system has an ipmi board (e.g. mgmt interface) you can monitor your voltage on the mainboard.

DR4GON said:
but I’m assuming there are 3-4 port SAS HBA’s?

On my research I have come across such cards. However they were rare and expensive.
Here is an example
https://www.amazon.de/LSI-MegaRAID-SAS-9201-16i-LSI00244/dp/B003UNP05O

Some people have reported to disable ncq for ZFS does help and even improve performance.
Others recommended to set the disk scheduler in Linux to "none". Nothing helped on my end.

DR4GON · Mar 6, 2022

apoc said:
[...] If it is a server-grade PSU it is unlikely but I have had consumer PSU that failed on me because the 12v rail failed. But that also should then relay to load of your system (the pattern I have mentuoned.).
If your system has an ipmi board (e.g. mgmt interface) you can monitor your voltage on the mainboard. [...]

It is, and it's reading normal so I guess I'm not too worried about it then.

apoc said:
[...] On my research I have come across such cards. However they were rare and expensive.
Here is an example
https://www.amazon.de/LSI-MegaRAID-SAS-9201-16i-LSI00244/dp/B003UNP05O [...]

I could have totally gotten away with something like this if I was using a SATA backplane:
https://www.amazon.de/MZHOU-Ports-PCIE-SATA-Card-20SATA-16x/dp/B09K3KWZ54/

I guess I'll keep looking, I'm sure there is a niche card out there that can work for a 24bay, without taking up more than two pci-e slots.

How Do I Troubleshoot If I'm At Fault, Or The Hardware Is?

DR4GON

Member

LnxBil

Distinguished Member

apoc

Famous Member

DR4GON

Member

DR4GON

Member

apoc

Famous Member

DR4GON

Member

apoc

Famous Member

DR4GON

Member

We value your privacy