ZFS HDD fail, errors and data corruption issues

DemonChicken1111

New Member
Jun 5, 2025
3
0
1
Hi everyone, I run a small home lab for myself. My applications consist of a firewall and game servers or web apps. I've just had a drive start to fail (see SMART values) and I also have gotten IO errors in my VMs (I presume they have files in the bad blocks). How do I go about fixing this. I plan on replacing the drive but I'm not sure if that will fix the zfs errors. Any help is much appreciated. I will also provide any logs or outputs if you require anything else from me. Thank you!

Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   203   178   021    Pre-fail  Always       -       2825
  4 Start_Stop_Count        0x0032   094   094   000    Old_age   Always       -       6125
  5 Reallocated_Sector_Ct   0x0033   129   129   140    Pre-fail  Always   FAILING_NOW 561
  7 Seek_Error_Rate         0x002e   200   185   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   051   051   000    Old_age   Always       -       36221
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   094   094   000    Old_age   Always       -       6124
192 Power-Off_Retract_Count 0x0032   196   196   000    Old_age   Always       -       3736
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       1935126
194 Temperature_Celsius     0x0022   121   085   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       266
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

Code:
root@homelab:~# zpool status -x -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Thu Jun  5 16:02:50 2025
        67.0G / 67.0G scanned, 7.37G / 67.0G issued at 11.8M/s
        0B repaired, 11.00% done, 01:25:53 to go
config:

        NAME                                                STATE     READ WRITE CKSUM
        rpool                                               ONLINE       0     0     0
          mirror-0                                          ONLINE       0     0     0
            ata-WDC_WD10TPVT-00HT5T0_WD-WX51C10X9276-part3  ONLINE       0     0     0
            ata-ST91000640NS_9XG4G240-part3                 ONLINE       0     0     0
          mirror-1                                          ONLINE       0     0     0
            ata-WDC_WD10TPVT-00HT5T0_WD-WXF1AB0W3930-part3  ONLINE       0     0     0
            ata-WDC_WD10TPVT-00HT5T1_WD-WXH1AC0K4108-part3  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        rpool/data/vm-100-disk-0:<0x1>
        rpool/data/vm-101-disk-0:<0x1>
 

Attachments

  • Screenshot From 2025-06-05 16-40-28.png
    Screenshot From 2025-06-05 16-40-28.png
    101.9 KB · Views: 4
I ran zpool scrub and here is the result, it apparently repaired some of the errors but there are quite a few left:

Code:
root@homelab:~# zpool status -x -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 8K in 01:28:15 with 21 errors on Thu Jun  5 17:31:05 2025
config:

        NAME                                                STATE     READ WRITE CKSUM
        rpool                                               ONLINE       0     0     0
          mirror-0                                          ONLINE       0     0     0
            ata-WDC_WD10TPVT-00HT5T0_WD-WX51C10X9276-part3  ONLINE       0     0     0
            ata-ST91000640NS_9XG4G240-part3                 ONLINE       0     0     0
          mirror-1                                          ONLINE       0     0     0
            ata-WDC_WD10TPVT-00HT5T0_WD-WXF1AB0W3930-part3  ONLINE       0     0   346
            ata-WDC_WD10TPVT-00HT5T1_WD-WXH1AC0K4108-part3  ONLINE       0     0   344

errors: Permanent errors have been detected in the following files:

        rpool/data/vm-100-disk-0:<0x1>
        rpool/data/vm-101-disk-0:<0x1>
 
A mirror protects you from a failing drive. A 3 way mirror protects you from two failing drives.
You went with mirror. And in one mirror you put two drives of the same model.
No matter if RAIDZ2 or mirror or any other pool layout, I always recommend to use different drives.
This is because of something I call the "bad batch problem".

Let's make an example of what the bad batch problem is.
Samsung had Pro SSDs (I think it was the 990 series) that overheated. Instead slowing down to combat the heat, there was a firmware bug so they just turned off.
Now, imagine you have two of these Samsungs in a mirror. Both turn off and your pool is gone.
Another problem is aging. Imagine the WDC_WD10TPVT HDDs have some technical weak points, let's say that the motor often fails after 4 years.
Now, if after 4 years one drive fails and you put in a replacement, there is a high chance that the second drive (we are restoring from) also fails.
Way too long text about the bad batch problem if you are interested.

Anyway, back to your situation. The drives did not fail, but there were some checksum errors. Unfortunately this happened not on one but both drives.
Both your VM disks, 100 and 101 now have corrupted data!
If you have backups of them before there were errors, you should use the backups.
 
  • Like
Reactions: Johannes S
replace all sata cable with SATA 3 rated with clip. Yes you must delete these disks from 100 and 101.
The sata cable must run in your server without edge or sharp curve to the drives.
 
Last edited:
  • Like
Reactions: IsThisThingOn
A mirror protects you from a failing drive. A 3 way mirror protects you from two failing drives.
You went with mirror. And in one mirror you put two drives of the same model.
No matter if RAIDZ2 or mirror or any other pool layout, I always recommend to use different drives.
This is because of something I call the "bad batch problem".

Let's make an example of what the bad batch problem is.
Samsung had Pro SSDs (I think it was the 990 series) that overheated. Instead slowing down to combat the heat, there was a firmware bug so they just turned off.
Now, imagine you have two of these Samsungs in a mirror. Both turn off and your pool is gone.
Another problem is aging. Imagine the WDC_WD10TPVT HDDs have some technical weak points, let's say that the motor often fails after 4 years.
Now, if after 4 years one drive fails and you put in a replacement, there is a high chance that the second drive (we are restoring from) also fails.
Way too long text about the bad batch problem if you are interested.

Anyway, back to your situation. The drives did not fail, but there were some checksum errors. Unfortunately this happened not on one but both drives.
Both your VM disks, 100 and 101 now have corrupted data!
If you have backups of them before there were errors, you should use the backups.

Thanks for the response! I will be reading more into the bad batch problem lol. What are the SMART values indicating then if they haven't failed? I will be restoring from backups where possible. I also no longer have checksum errors in the zpool status.


Code:
root@homelab:~# zpool status -x -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 8K in 01:28:15 with 21 errors on Thu Jun  5 17:31:05 2025
config:

        NAME                                                STATE     READ WRITE CKSUM
        rpool                                               ONLINE       0     0     0
          mirror-0                                          ONLINE       0     0     0
            ata-WDC_WD10TPVT-00HT5T0_WD-WX51C10X9276-part3  ONLINE       0     0     0
            ata-ST91000640NS_9XG4G240-part3                 ONLINE       0     0     0
          mirror-1                                          ONLINE       0     0     0
            ata-WDC_WD10TPVT-00HT5T0_WD-WXF1AB0W3930-part3  ONLINE       0     0     0
            ata-WDC_WD10TPVT-00HT5T1_WD-WXH1AC0K4108-part3  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        rpool/data/vm-100-disk-0:<0x1>
        rpool/data/vm-101-disk-0:<0x1>
 
not sure if I understand your question about smart. It does show you bad sectors. But either way, it doesn’t matter.

Smart isn't precise, it is mostly a tool that sometimes (I remember 70% but could be wrong) can detect that I drive is going bad.

ZFS with its checksums on the other hand, that is the real deal.
 
Hi everyone, I run a small home lab for myself. My applications consist of a firewall and game servers or web apps. I've just had a drive start to fail (see SMART values) and I also have gotten IO errors in my VMs (I presume they have files in the bad blocks). How do I go about fixing this. I plan on replacing the drive but I'm not sure if that will fix the zfs errors. Any help is much appreciated. I will also provide any logs or outputs if you require anything else from me. Thank you!

Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   203   178   021    Pre-fail  Always       -       2825
  4 Start_Stop_Count        0x0032   094   094   000    Old_age   Always       -       6125
  5 Reallocated_Sector_Ct   0x0033   129   129   140    Pre-fail  Always   FAILING_NOW 561
  7 Seek_Error_Rate         0x002e   200   185   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   051   051   000    Old_age   Always       -       36221
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   094   094   000    Old_age   Always       -       6124
192 Power-Off_Retract_Count 0x0032   196   196   000    Old_age   Always       -       3736
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       1935126
194 Temperature_Celsius     0x0022   121   085   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       266
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0
This casino is known for quick cashouts, consistent performance, and a deep game portfolio website. From scratch cards to live dealers, there’s always something new to explore. Promotions are tailored to your play style and updated frequently.
Code:
root@homelab:~# zpool status -x -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Thu Jun  5 16:02:50 2025
        67.0G / 67.0G scanned, 7.37G / 67.0G issued at 11.8M/s
        0B repaired, 11.00% done, 01:25:53 to go
config:

        NAME                                                STATE     READ WRITE CKSUM
        rpool                                               ONLINE       0     0     0
          mirror-0                                          ONLINE       0     0     0
            ata-WDC_WD10TPVT-00HT5T0_WD-WX51C10X9276-part3  ONLINE       0     0     0
            ata-ST91000640NS_9XG4G240-part3                 ONLINE       0     0     0
          mirror-1                                          ONLINE       0     0     0
            ata-WDC_WD10TPVT-00HT5T0_WD-WXF1AB0W3930-part3  ONLINE       0     0     0
            ata-WDC_WD10TPVT-00HT5T1_WD-WXH1AC0K4108-part3  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        rpool/data/vm-100-disk-0:<0x1>
        rpool/data/vm-101-disk-0:<0x1>
Hey! Sounds like you're on the right track by planning to replace the failing drive. Once you swap it out, ZFS should let you resilver the new disk if your pool is redundant (like in a mirror or RAIDZ). If it's not redundant, you might need to try zpool scrub to see what can be repaired, but unfortunately, data in bad blocks may be unrecoverable. Happy to take a look at your SMART output and zpool status if you want to share!
 
Last edited: