Insufficient Replicas, Data loss on a sing drive failure on RAIDZ2

DR4GON

Member
Sep 7, 2021
40
0
11
36
I don't know how, but with no other drive failures, I've experienced "Insufficient Replicas" on a drive failure.

This is the result from:
Code:
zpool clear NAS
zpool status NAS

Code:
  pool: NAS
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Feb  1 17:29:29 2022
        26.2T scanned at 1.29G/s, 22.0T issued at 1.09G/s, 26.5T total
        631G resilvered, 83.07% done, 0 days 01:10:25 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        NAS                                             DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-01    ONLINE       0     0     0
            replacing-1                                 UNAVAIL      0     0     0  insufficient replicas
              ata-WDC_WD40EFRX-68N32N0_WD-02  UNAVAIL      0     0     0
              ata-WDC_WD40EFRX-68N_WD-02-NEW  FAULTED      0    34     0  too many errors  (resilvering)
            ata-WDC_WD40EFRX-68N32N0_WD-03    ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-04    ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-05    ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-06    ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-07    ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-08    ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-09    ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-10    ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-11    ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-12    ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-13    ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-14    ONLINE       0     0     0
            ata-ST4000DM004-2CV-15            ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-16    ONLINE       0     0     0

errors: No known data errors
 
Your ST4000DM004 is by the way a SMR disk that shouldn't be used with ZFS because it can get unresponsive and cause your pool to degrade because ZFS thinks the drive is dead because it is too slow to answer in time as soon as its cache gets fully filled up. Got one of it here and regulaily see average response times of over 1 minute and a write performance of few kb/s as soon as I try to write more than several GBs at once.

You should think of replacing that too.

Do you maybe got an power outage? In such a case all disks would loose the cached data at the same time so all disks data can be corrupted at the same time and parity won't help.
 
Last edited:
Sorry I forgot to post my question. Considering the data that is contained on the server is a media server and 100% volatile, is there a way to tell Proxmox to delete all the error data so that VM can get to work re-downloading missing video files? if I use zpool clear NAS, it just returns to attempting to resilvering and then fails leaving my setup in a degraded state.

You'll notice that I am in the process of replacing the failed drive.
 
Your ST4000DM004 is by the way a SMR disk that shouldn't be used with ZFS because it can get unresponsive and cause your pool to degrade because ZFS thinks the drive is dead because it is too slow to answer in time as soon as its cache gets fully filled up. Got one of it here and regulaily see average response times of over 1 minute and a write performance of few kb/s.
Okay cool I'll look at replacing it. Does that explain the data loss with RAIDZ2?
 
Do you maybe got an power outage? In such a case all disks would loose the cached data at the same time so all disks data can be corrupted at the same time and parity won't help.
I'm on a UPS, and have 80 days uptime, with at least 70 days with no error. Last time a performed a manual check of the system was before a went away for a week. So the server would have logged a restart, and my VM would be off because It's not set to restart. UPS can run the server for 60 minutes.
 
Okay cool I'll look at replacing it. Does that explain the data loss with RAIDZ2?
Resilvering is causing heavy load on your pool. Might be possible that the ST4000DM004 can't handle that. But normally this would show up right to the ST4000DM004 as read+write errors. But not sure how that should look like while resilvering.
Looks more to me that the "ata-WDC_WD40EFRX-68N_WD-02-NEW" is causing write errors. Did you verify that the disk wasn't dead on arrival? Rechecking the SATA cable might help too. Maybe it doesn't got a good connection after replacing the disk.
 
Last edited:
Resilvering is causing heavy load on your pool. Might be possible that the ST4000DM004 can't handle that. But normally this would show up right to the ST4000DM004 as read+write errors. But not sure how that should look like while resilvering.
Interesting... In the morning I'll get the ST out and retry the silvering. Thanks for the info.

For academic purposes though, is there a way to tell the system to purge the missing data?
 
Interesting... In the morning I'll get the ST out and retry the silvering. Thanks for the info.
Be careful. Then you got two missing disks and your pool can't handle any additions errors.
For academic purposes though, is there a way to tell the system to purge the missing data?
Not that I know. But if you run a scrub and the scrub finds checksum errors it will tell you exaclty which files got corrupted so you can manually delete them.
 
Resilvering is causing heavy load on your pool. Might be possible that the ST4000DM004 can't handle that. But normally this would show up right to the ST4000DM004 as read+write errors. But not sure how that should look like while resilvering.
Looks more to me that the "ata-WDC_WD40EFRX-68N_WD-02-NEW" is causing write errors. Did you verify that the disk wasn't dead on arrival? Rechecking the SATA cable might help too. Maybe it doesn't got a good connection after replacing the disk.
Yeah it was verified good. I'm in the process of looking for a new box, as the 16 bay was my first seconhand server and I have my eye on a new 24bay. If it is a sata connection, could my original drive not be faulty? This whole endevor started when the 02 drive originally threw a Insufficient Replicas, so if it is that ST drive, I'm going to be pretty happy I didn't just burn 2 drives.
 
Be careful. Then you got two missing disks and your pool can't handle any additions errors.

Not that I know. But if you run a scrub and the scrub finds checksum errors it will tell you exaclty which files got corrupted so you can manually delete them.
Thanks for the concern. If it fails right now, its just a job for sonar and radarr to fix itself, so I'm legitimately not concerned. It would be nice if it didn't fail, but the VM running on separate SSD's in the system is operation critical, so shutting down to be safe isn't an option.

Silly question, but how do I either see the results from a failed scrub, or run a scrub manually? I can get the end result "shit is broken" but not a detailed list of what files are broken.

Edit; I've pulled the ST drive and restarted the resilver. Best case I'm still down one drive, worst case I'm still basically down two drives on a non critical pool.
 
Last edited:
If I remeber right a zpool status showed a command to type in to get additional details about corrupted files. Something like "you got checksum error...type in XYZ to get more details".

You can start a scrub with zpool scrub YourPool.
 
Last edited:
If I remeber right a zpool status showed a command to type in to get additional details about corrupted files. Something like "you got checksum error...type in XYZ to get more details".

You can start a scrub with zpool scrub YourPool.
Ah awesome, thanks.

Lol so simple. I've just always used zpool clear NAS. Thanks
 
Ah awesome, thanks.

Lol so simple. I've just always used zpool clear NAS. Thanks
zpool clear is just to reset your error counters. This won't fix anything, it just hides the errors so you can't see them any longer.
 
zpool clear is just to reset your error counters. This won't fix anything, it just hides the errors so you can't see them any longer.
Yeah but it resilvers every time I clear the errors because a drive has failed and I'm in the rebuild process. Lol
 
Your ST4000DM004 is by the way a SMR disk that shouldn't be used with ZFS because it can get unresponsive and cause your pool to degrade because ZFS thinks the drive is dead because it is too slow to answer in time as soon as its cache gets fully filled up.
I had to google the difference between SMR and CMR, and the whole "WD Red" line are SMR drives relegating the new designation "WD Red Plus" to the CMR drives. How do I know all my drives aren't SMR, because the only drives that I know are CMR, are the two new drives that just came in the mail because they're WD Red Plus, and will be the next replacement drives.
 
"WD40EFRX" should be CMR. But you can check that using the smartctl command. If the disk doesn't support TRIM it is CMR.
 
"WD40EFRX" should be CMR. But you can check that using the smartctl command. If the disk doesn't support TRIM it is CMR.
Ah, thanks I got the two codes mixed up. EFAX was the DMSMR drives, newly designated as WD Red. The EFRX are the CMR drives, being re-designated as WD Red Plus.

Cheers, that was confusing. "Marketing; confusing consumers since the invention of trade."
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!