ZFS: How Common Are Errors?

Consonant1022 · Jul 25, 2023

I'm new to ZFS. This is the first pool I'm working with and it's on a server that I setup just recently. I decided to dive into ZFS vs a RAID5 based on what I had read online for data errors, so my knowledge is really basic.

I have a raidz1-0 and it's been up a month or so with some data running to it and I see a number of READ, WRITE and CKSUM errors popup. I've setup a cronjob to run every week to scrub, but I noticed that the errors had always been on drives 1 - 4. Those 4 are plugged into the SATA ports on my motherboard that face 3 o'clock, and the other 2 drives (with no read/write errors) are on the SATA plugs that face 6 o'clock. I thought perhaps the cables in the 1 - 4 were screwed up, so I replaced them all. I even changed the power cable from the PSU for those drives. I then scrubbed and things checked out (the zpool status -v files that were listed as errored seemed to be fine after these steps). After a bit more of data read and write, I noticed a few more errors popping up.

I also noticed all 4 of those drives were the same model number (ST14000NM001G), vs the other 2 being different (ST14000NM0018). I wondered if there was some issue between that and my board, so I ordered a new one of the 0018 and did a zpool replace on one of the 001G drives. I'm now running a badblocks -svwb 4096 /dev/sda on the drive I replaced in the pool as a test (note, the drive marked FAULTED went to a FAULTED state after I had decided to replace the first drive, otherwise I would have selected that one first). The first drive in this series is now attached to a USB > SATA converter, while /dev/sda is being badblock'ed. After badblocks is done, if the drive passes, I'll replace the ZL293B4B drive.

zpool status:

Code:

  pool: nas
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Mon Jul 24 06:03:48 2023
    20.0T scanned at 207M/s, 18.7T issued at 192M/s, 39.2T total
    0B repaired, 47.59% done, 1 days 07:06:57 to go
config:

    NAME                                   STATE     READ WRITE CKSUM
    nas                                    DEGRADED     0     0     0
      raidz1-0                             DEGRADED    55     0     0
        ata-ST14000NM0018-2H4101_ZHZ38XLQ  ONLINE       0     0    12
        ata-ST14000NM001G-2KJ103_WL202XJ9  ONLINE      28     6 10.6K
        ata-ST14000NM001G-2KJ103_ZL293B4B  FAULTED     15     8     0  too many errors
        ata-ST14000NM001G-2KJ103_ZTM089LF  ONLINE      48     6 10.6K
        ata-ST14000NM0018-2H4101_ZHZ32TWF  ONLINE       0     0 10.6K
        ata-ST14000NM0018-2H4101_ZHZ3WLKC  ONLINE       0     0 10.6K

errors: 8 data errors, use '-v' for a list

I guess my question is, what is normal? I've done SMART checks on all of these drives and they check just fine. I have all the data on these drives backed up, so I truly am using this as a learning experience, but I've written roughly 40TB to it. Should I expected to see some READ/ WRITE errors popup every now and then and get cleaned up with a scrub? Or should it always be 0's? Thanks!

LnxBil · Jul 25, 2023

Consonant1022 said:
I guess my question is, what is normal?

No. You normally have 0 errors.

Consonant1022 said:
Should I expected to see some READ/ WRITE errors popup every now and then and get cleaned up with a scrub? Or should it always be 0's?

I had in years run only into a few read errors (silent data corruption), unless I had drive failures, then the errors skyrocketed.

Check the SATA cables or switch to another HBA for those 4 ports that trouble you.

Dunuin · Jul 25, 2023

If only a few disks are causing IO errors its usually bad disks or bad cabling. If its all disks of a HBA/backplane than its usually the controller/backplane. If all disks show IO errors I would run memtest86+ to check for bad RAM.

Consonant1022 · Jul 25, 2023

LnxBil said:
Check the SATA cables or switch to another HBA for those 4 ports that trouble you.

Right now I'm using the built in SATA ports on the motherboard, and would like to keep this if possible just from a simplicity standpoint.

Dunuin said:
If only a few disks are causing IO errors its usually bad disks or bad cabling. If its all disks of a HBA/backplane than its usually the controller/backplane. If all disks show IO errors I would run memtest86+ to check for bad RAM.

I looked to see if there were driver updates for the SATA ports on my motherboard model, but didn't find anything. The 4 that are giving me problems are physically separated from the 2 that are not:

I'm also not positive this is the problem, which is why I ordered a different model drive to make sure. Because until I ran the recent replace command, all 4 ST14000NM001Gs were plugged into the group of SATA ports on the right.

I'm currently running a scrub and it says I have just over a day left, and the badblocks looks like it'll take another 7 hours. After that, I'll shut down the server and run a memtest.

Thanks for the input!

alexskysilk · Jul 25, 2023

Consonant1022 said:
ata-ST14000NM001G-2KJ103_ZL293B4B FAULTED 15 8 0 too many errors

did you run a smart long test on this drive? did it pass?

Consonant1022 · Jul 25, 2023

alexskysilk said:
did you run a smart long test on this drive? did it pass?

Not on this one yet. I wanted to get the badblocks to finish on the replaced ST14000NM001G that I took out of space 1. If it passes everything, then I'lll use that to replace ZL293B4B. Does that make sense, or is there no reason to wait? If that one disk is bad, would that cause the read/ write errors across others in the pool?

alexskysilk · Jul 26, 2023

Consonant1022 said:
would that cause the read/ write errors across others in the pool?

Maybe, but thats not all that important. You can run a smart test in parallel with everythingthing else- no reason to put it off.

Consonant1022 · Jul 26, 2023

I interpretted the badblocks process wrong, it's now in the reading and comparing stage at 33% done after 24 hours with no errors (this is running on /dev/sda, which is currently not in the pool).

smartctl -t long /dev/sdc (the faulted drive) seems fine:

Code:

Serial Number:    ZL293B4B
LU WWN Device Id: 5 000c50 0c738207e
Firmware Version: SN03
User Capacity:    14,000,519,643,136 bytes [14.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 26 07:07:25 2023 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 241)    Self-test routine in progress...
                    10% of test remaining.
Total time to complete Offline
data collection:         (  567) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      (1256) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x70bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   079   064   044    Pre-fail  Always       -       82390216
  3 Spin_Up_Time            0x0003   090   090   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       8
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   077   061   045    Pre-fail  Always       -       51437896
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       872
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       8
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   099   099   000    Old_age   Always       -       154621181988
190 Airflow_Temperature_Cel 0x0022   064   048   040    Old_age   Always       -       36 (Min/Max 27/39)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   099   099   000    Old_age   Always       -       3004
194 Temperature_Celsius     0x0022   036   052   000    Old_age   Always       -       36 (0 21 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       329h+40m+15.544s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       14338705647
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       84174027898

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 10%       872         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

niteshadow · Jul 26, 2023

The first 4 SATA ports are Intel B660: good.
The other 2 are ASMedia ASM1061: not the best on Linux, avoid if possible.

alexskysilk · Jul 26, 2023

Consonant1022 said:
# 1 Extended offline Self-test routine in progress 10% 872 -

your test wasnt finished.

Consonant1022 · Jul 26, 2023

niteshadow said:
The first 4 SATA ports are Intel B660: good.
The other 2 are ASMedia ASM1061: not the best on Linux, avoid if possible.

Ironically... the first 4 are attached to the drives giving me issues.

alexskysilk said:
your test wasnt finished.

Ha. Too much going on in my head this morning. Will repost that when it's done. Thanks!

Consonant1022 · Jul 26, 2023

SMART finished on that faulted drive with no errors:

Code:

Serial Number:    ZL293B4B
LU WWN Device Id: 5 000c50 0c738207e
Firmware Version: SN03
User Capacity:    14,000,519,643,136 bytes [14.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 26 10:08:54 2023 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  567) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      (1256) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x70bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   079   064   044    Pre-fail  Always       -       82390216
  3 Spin_Up_Time            0x0003   090   090   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       8
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   077   061   045    Pre-fail  Always       -       52647110
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       875
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       8
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   099   099   000    Old_age   Always       -       154621181988
190 Airflow_Temperature_Cel 0x0022   067   048   040    Old_age   Always       -       33 (Min/Max 27/39)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   099   099   000    Old_age   Always       -       3004
194 Temperature_Celsius     0x0022   033   052   000    Old_age   Always       -       33 (0 21 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       332h+38m+24.243s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       14338705647
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       84174027898

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       875         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Ramalama · Jul 26, 2023

Consonant1022 said:

SMART finished on that faulted drive with no errors:

Code:

Serial Number:    ZL293B4B
LU WWN Device Id: 5 000c50 0c738207e
Firmware Version: SN03
User Capacity:    14,000,519,643,136 bytes [14.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 26 10:08:54 2023 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  567) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      (1256) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x70bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   079   064   044    Pre-fail  Always       -       82390216
  3 Spin_Up_Time            0x0003   090   090   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       8
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   077   061   045    Pre-fail  Always       -       52647110
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       875
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       8
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   099   099   000    Old_age   Always       -       154621181988
190 Airflow_Temperature_Cel 0x0022   067   048   040    Old_age   Always       -       33 (Min/Max 27/39)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   099   099   000    Old_age   Always       -       3004
194 Temperature_Celsius     0x0022   033   052   000    Old_age   Always       -       33 (0 21 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       332h+38m+24.243s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       14338705647
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       84174027898

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       875         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

From your post with zpool status, it's looking like all drives have some sort of errors, at least checksum errors.

I would do what @Dunuin said, check your memory.
If you would have ECC memory, just for example, it would be as easy as checking with edac utility for reported errors and you would know instantly if it's ram or not at least.
Memtest86 takes just forever:-(

And i had a case where memtest86 even reported no errors and i indeed had errors with the memory, which i fixed by downclocking. (To be exact, the memory controller on my i9-10990xe couldn't handle speeds above 4000mhz, which was fixed after some bios updates)
It's all painful to debug in my findings.

Did you rebooted once with the replaced drive and checked if the errors increase on scrub for example?
I would tbh replace the drive, clear the errors, reboot to clear memory also and do a scrub to check if any errors appear again.

If they do, then it's likely something with the main board itself or memory.

If not, then it's indeed just the memory and a memory cycle/flush helped temporarily.
Or to be more clear, if the errors start to appear after some time again.
If they don't appear again, then it was just the drive, but i doubt that tbh.

However it's all guessing, but the best thing we can probably do right now, if not some Genius Guy appears from nowhere and tells exactly what it is xD

Consonant1022 · Jul 26, 2023

Ramalama said:
From your post with zpool status, it's looking like all drives have some sort of errors, at least checksum errors.

I would do what @Dunuin said, check your memory.
If you would have ECC memory, just for example, it would be as easy as checking with edac utility for reported errors and you would know instantly if it's ram or not at least.
Memtest86 takes just forever:-(

And i had a case where memtest86 even reported no errors and i indeed had errors with the memory, which i fixed by downclocking. (To be exact, the memory controller on my i9-10990xe couldn't handle speeds above 4000mhz, which was fixed after some bios updates)
It's all painful to debug in my findings.

Did you rebooted once with the replaced drive and checked if the errors increase on scrub for example?
I would tbh replace the drive, clear the errors, reboot to clear memory also and do a scrub to check if any errors appear again.

If they do, then it's likely something with the main board itself or memory.

If not, then it's indeed just the memory and a memory cycle/flush helped temporarily.
Or to be more clear, if the errors start to appear after some time again.
If they don't appear again, then it was just the drive, but i doubt that tbh.

However it's all guessing, but the best thing we can probably do right now, if not some Genius Guy appears from nowhere and tells exactly what it is xD

I'll run the memtest after the current processes are done (70% of badblocks after 30 hours on /dev/sda, and scrub at 94.8% with 3 hours to go).

The board I have does not support ECC memory. It says you can use ECC memory in non-ECC mode, which I figure defeats the purpose. There is a recent BIOS update to the board that I haven't applied yet, I will try that next as well.

I have rebooted the server at times and the errors always seem to trickle a few back here and there.

if not some Genius Guy appears from nowhere and tells exactly what it is xD

I thought that's what this forum was for?!

Ramalama · Jul 26, 2023

Consonant1022 said:
I'll run the memtest after the current processes are done (70% of badblocks after 30 hours on /dev/sda, and scrub at 94.8% with 3 hours to go).

The board I have does not support ECC memory. It says you can use ECC memory in non-ECC mode, which I figure defeats the purpose. There is a recent BIOS update to the board that I haven't applied yet, I will try that next as well.

I have rebooted the server at times and the errors always seem to trickle a few back here and there.

I thought that's what this forum was for?!

Yeah, i meant if you would have ECC, it would be a lot easier to debug...
Like if you get next time new hardware, consider it as factor xD
There is in-band ecc memory devices upcomming, which seems to be a very nice alternative way.
sacrifying 1/32th of memory size + a little performance penalty for cheap and full working ecc memory (means detection/correction/reporting)

But yeah, using ecc memory in non-ecc mode is surely pretty usefull

Wish you good lock finding your root cause!
Cheers

Consonant1022 · Aug 4, 2023

Quick update. badblocks finally finished with no errors (took something like 160 hours). I then took that drive and replaced the FAULTED one in the pool. 2 days later, I was still getting plenty of errors, so I took the server down and ran memtest. The server randomly shut down overnight while running memtest, which seemed odd. After a bit of googling, I decided to change some settings in my bios and specifically set DDR4-3200 for my RAM vs the auto settings. I brought the server back up and am running a scrub. So far only 2 checksum errors (which I think are valid because I had 8 data errors, and I replaced 6 of the files from backups in the interim.

So far it has scrubbed over 2 TB and I'm feeling like my RAM settings were the issue. Scrub says it has 2 days to go, so I'll try to post a followup after that has completed. Thanks for the advice so far!

Consonant1022 · Aug 7, 2023

I spoke too soon. During the scrub, the errors shot back up, one of the discs went to FAULTED, and now 80 new files say they are corrupted. I have new RAM (different brand, but still on the motherboard's compatibility list). I'll swap that out when the scrub is finished and rescrub.

Consonant1022 · Sep 6, 2023

I spent a lot of time trying to track this down and am still not sure I have the answer, but will continue to provide information in case it's of use to someone else doing the same thing. I have swapped all SATA cables, all power cables, RAM, updated the motherboard BIOS, and now all of the ST14000NM001G HDDs. Yet, I was still getting plenty of read/ write errors (almost exclusively on SATA ~~0 - 3~~I used the term 1 - 4 above, so keeping consistent). I found a dmesg error that correlated to the read/write errors in my pool:

Code:

sd 4:0:0:0: [sda] tag#19 Sense Key : Illegal Request [current]
sd 4:0:0:0: [sda] tag#19 Add. Sense: Unaligned write command

A bit of googling with that led me to the following:

https://github.com/openzfs/zfs/issues/4873#issuecomment-269074579

I have the grub boot-loader, so I added libata.force=1.5 to /etc/default/grub and rebooted. I'm currently running another scrub to see if my errors jump back up.

I've not previously issued kernel boot parameters. Is there a way to verify they worked? I found a number of different options about what the kernel text file would be, but I think that the grub file I edited was the correct one. Thanks!

Consonant1022 · Sep 7, 2023

Another quick update: this didn't fix it. I verified that I was running at 1.5 gbps per dmesg:

Code:

[    1.290468] ata5: FORCE: PHY spd limit set to 1.5Gbps
[    1.290469] ata5: SATA max UDMA/133 abar m2048@0x70b02000 port 0x70b02300 irq 125
[    1.290470] ata6: FORCE: PHY spd limit set to 1.5Gbps
[    1.290472] ata6: SATA max UDMA/133 abar m2048@0x70b02000 port 0x70b02380 irq 125
[    1.290474] ata7: FORCE: PHY spd limit set to 1.5Gbps
[    1.290474] ata7: SATA max UDMA/133 abar m2048@0x70b02000 port 0x70b02400 irq 125
[    1.290476] ata8: FORCE: PHY spd limit set to 1.5Gbps
[    1.290477] ata8: SATA max UDMA/133 abar m2048@0x70b02000 port 0x70b02480 irq 125

However, I'm still getting tons of errors:

Code:

  pool: nas
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 524K in 23:30:42 with 2 errors on Wed Sep  6 16:35:50 2023
config:

    NAME                                   STATE     READ WRITE CKSUM
    nas                                    DEGRADED     0     0     0
      raidz1-0                             DEGRADED    57     0     0
        ata-ST14000NM0018-2H4101_ZHZ38XLQ  FAULTED     10     0     0  too many errors
        ata-ST14000NM0018-2H4101_ZHZ67X6X  ONLINE      17     0     4
        ata-ST14000NM0018-2H4101_ZHZ3VJCH  ONLINE      26     0     4
        ata-ST14000NM0018-2H4101_ZHZ6HXSM  ONLINE      19     0     4
        ata-ST14000NM0018-2H4101_ZHZ32TWF  ONLINE       5     0     4
        ata-ST14000NM0018-2H4101_ZHZ3WLKC  ONLINE       3     0     4

I've completed the RMA process for my motherboard, but I assume that will be a royal PITA. So I've also ordered an HBA. Going to try that next..

Consonant1022 · Sep 8, 2023

Hopefully the final update. Got the HBA and installed it. Put all 6 drives on it and have been running plenty of read/writes:

Code:

$ zpool status nas
  pool: nas
 state: ONLINE
  scan: resilvered 119M in 00:00:09 with 0 errors on Thu Sep  7 16:16:23 2023
config:

    NAME                                   STATE     READ WRITE CKSUM
    nas                                    ONLINE       0     0     0
      raidz1-0                             ONLINE       0     0     0
        ata-ST14000NM0018-2H4101_ZHZ38XLQ  ONLINE       0     0     0
        ata-ST14000NM0018-2H4101_ZHZ67X6X  ONLINE       0     0     0
        ata-ST14000NM0018-2H4101_ZHZ3VJCH  ONLINE       0     0     0
        ata-ST14000NM0018-2H4101_ZHZ6HXSM  ONLINE       0     0     0
        ata-ST14000NM0018-2H4101_ZHZ32TWF  ONLINE       0     0     0
        ata-ST14000NM0018-2H4101_ZHZ3WLKC  ONLINE       0     0     0

errors: No known data errors

I found plenty of similar type discussions online with B660M motherboards, so I'm guessing something is just not right between all of these components. I'm not thrilled about having to use an HBA, I specifically bought this board because it had 6 SATA ports, and using an HBA uses more watts. But if it makes my pool reliable, then it's worth it.

ZFS: How Common Are Errors?

New Member

Distinguished Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Member

Distinguished Member

New Member

New Member

Renowned Member

New Member

Renowned Member

New Member

New Member

New Member

New Member

New Member

We value your privacy