Scrub won't complete on degraded ZFS pool

headband

New Member
Jun 10, 2025
10
0
1
Good morning!

My media server had been acting funky so I decided to rsync the most recent data to my backup server in case something was wrong with a drive. When I started working with a particular drive I noticed it was rsyncing very very slowly. I pulled up the Proxmox web interface and navigated to Node > Disks but that page timed out after 30 seconds. After that I went to the shell and ran zpool status -v and that informed me that the troublesome drive was DEGRADED, 8 particular files are corrupted, and a scrub had been running since the day before that was at 98.98%. I ran zpool status a few more times throughout the day but the scrub didn't appear to budge from the 98.98%.

Eventually I rebooted the entire system and right after reboot the Node > Disks page in the web interface loaded great. Surprisingly, the drive that is giving me trouble shows it passes SMART tests. The files themselves were loading as expected. I ran zpool status -v again and it had the same output except this time the scrub complete percentage had started over and I could see by running the command repeatedly that it was progressing nicely. I figured I'd let it finish and see how things were looking then.

Unfortunately it seemed to get slower as it approached that 98% again and the Node > Disks page stopped responding around the same time. It again appears to be stuck above 98% and the data on the disk is responding very very slowly if at all.

Any ideas as to what is going on here? I'm not sure if it's the physical drive that's bad or the ZFS pool or what. I'm a bit out of my depth here so I would really appreciate any guidance anyone is willing to offer.

Thanks so much!
 
Last edited:
Quick update, it seems like the whole server gets slower and slower as it tries to continues the scrub up until Proxmox itself becomes unresponsive. For two days in a row I've rebooted first thing in the morning and everything worked fine at first then got slower and slower, then the questionable drive stops responding, then sometime in the night Proxmox won't load the web interface or even respond to SSH. The next morning I have to hard reboot again and the cycle repeats.
 
To me seems that that drive that ends up in DEGRADED state is dying in some funky way that causes the behavior you see. I would make sure you have a backup, remove the failing drive, connect a new one and use zfs replace to resilver it. You could even add a third drive if it is a mirror, but given the behavior you see, the resilver to that third drive may get stuck too at some point.

If backup doesn't end with two drives, remove the failing drive and try again.

The files themselves were loading as expected
There might be bit corruption that might not affect the file content or the scrub detected that both disk don't have the same bits in both drives, which makes sense if one is failing.
 
  • Like
Reactions: headband
To me seems that that drive that ends up in DEGRADED state is dying in some funky way that causes the behavior you see. I would make sure you have a backup, remove the failing drive, connect a new one and use zfs replace to resilver it. You could even add a third drive if it is a mirror, but given the behavior you see, the resilver to that third drive may get stuck too at some point.

If backup doesn't end with two drives, remove the failing drive and try again.


There might be bit corruption that might not affect the file content or the scrub detected that both disk don't have the same bits in both drives, which makes sense if one is failing.
Thanks a ton for the advice, I've been stuck on this for a couple days now. The pool itself is a single drive, if I replace the drive I'll have to recover via my backup copy and not via resilver, is that right? I'm not super familiar with the resilver process but I imagine it's similar to replacing a failed drive in a RAID array and if that's the case I don't think it applies to me since I only have a single disk.

Is the fact that this drive is passing SMART enough to indicate that the problem isn't the physical disc itself? I'm wondering if I should wipe and re-use the drive or replace it.
 
In ZFS you create vdevs to then create pools on it. A pool distributes the data across the devices in the vdev. So if one of the disks for instance is beginning to fail and data is corrupted, ZFS detects it and heals the failed data from the other devices.
In your case if you have a pool with only one disk, then ZFS has no other copies to heal it from.
In cases of creating pools with a single disk (stripe), is imperative to have backups. That is the only way to recover.
 
In ZFS you create vdevs to then create pools on it. A pool distributes the data across the devices in the vdev. So if one of the disks for instance is beginning to fail and data is corrupted, ZFS detects it and heals the failed data from the other devices.
In your case if you have a pool with only one disk, then ZFS has no other copies to heal it from.
In cases of creating pools with a single disk (stripe), is imperative to have backups. That is the only way to recover.

Ok cool that's kind of what I figured but it's good to hear confirmation on that. Thanks!

I'm going to go ahead and try and rsync the most recent data (which wasn't backed up) off of the failing drive on to the backup server. Then, I'm going to completely wipe the drive and create a new pool on it and then copy everything over to it again. Does that all sound rational to you all?

What do you all suppose it is that is causing my whole server to slow to a crawl? The stuck scrub? It's really weird that everything feels really snappy and responsive right after a reboot but after a few hours becomes basically unusable.

In the future is it a better idea to use multiple discs for my pools? What is the minimum number of disks I should use for disk failure redundancy like you were referring to?

Again thank you so much for the help, I know just enough to get myself in trouble here so your guidance is really appreciated.
 
Ok this is really weird.

My pool is attached to turnkey-fileserver, then an UbuntuServer VM mounts that via fstab.

To manually backup, I rsync from turnkey-fileserver to another instance of turnkey-fileserver on a different (unrelated) server.

I'm rsyncing the latest files off the bad drive to the backup server before I wipe it but it is going very slowly (I assumed because of the failing drive). I shut down the UbuntuServer VM since it isn't being used and I noticed as soon as it shut down the rsync speed shot WAY up which is very odd considering the UbuntuServer doesn't have anything to do with the rsync.

The scrub is now also unstuck, moving along at a nice pace. Now I'm even more confused but I'm excited to see what happens once the scrub finally completes here in about 15 minutes.
 
pool: TeeVeeStorage
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: hxxps://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 3 days 10:30:45 with 22 errors on Wed Jun 11 10:55:47 2025
config:

NAME STATE READ WRITE CKSUM
TeeVeeStorage DEGRADED 0 0 0
ata-WDC_WUH721414ALN6L4_9RHJR6WC DEGRADED 0 0 0 too many errors

errors: Permanent errors have been detected in the following files:

TeeVeeStorage/subvol-101-disk-0:<0x11309>
TeeVeeStorage/subvol-101-disk-0:<0x1120d>
TeeVeeStorage/subvol-101-disk-0:<0x1122f>
TeeVeeStorage/subvol-101-disk-0:<0x11095>
TeeVeeStorage/subvol-101-disk-0:<0x11299>
TeeVeeStorage/subvol-101-disk-0:<0x110a1>
TeeVeeStorage/subvol-101-disk-0:<0x9dac>
TeeVeeStorage/subvol-101-disk-0:<0x110c8>
TeeVeeStorage/subvol-101-disk-0:<0x8cfe>

Here's the output of zpool status -v now that the scrub is done. It says for actions I can restore the files in question if possible. The 7 files that were originally listed as corrupted are not listed anymore because I deleted them since they are easy to replace. Is there some command I can run to let the pool know that I'm good without those?
 
Last edited:
That drive is dying in a quite peculiar way, although I've seen other weird behaviors like that. Simply backup all data, buy a new drive and ditch the old one. I wouldn't use it for anything besides practicing with broken drives in a lab.

In the future is it a better idea to use multiple discs for my pools? What is the minimum number of disks I should use for disk failure redundancy like you were referring to?
At the very least, use a mirror of two drives (RAID1), preferably with drives of the same model, same firmware but different batches to avoid the edge case of getting a bad batch that affects both drives. Alternatively, use different makers/models of same capacity and try to get enterprise drives instead of consumer grade ones, even if budget forces you to buy second hand.
 
I've been backing my data up to a separate Proxmox + turnkey-fileserver instance via rsync. Once I have the new drive installed and ready to roll should I just rsync everything back over to it from the backup server or is there is there a better way to copy the whole pool via Proxmox?
 
Last edited:
Post the content of smartctl --all /dev/disk/by-id/ata-WDC_WUH721414ALN6L4_9RHJR6WC

Code:
root@proxmox:~# smartctl --all /dev/disk/by-id/ata-WDC_WUH721414ALN6L4_9RHJR6WC
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.2.16-3-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC  WUH721414ALN6L4
Serial Number:    9RHJR6WC
LU WWN Device Id: 5 000cca 264d5b0dc
Firmware Version: LDGNW2L0
User Capacity:    14,000,519,643,136 bytes [14.0 TB]
Sector Size:      4096 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jun 11 15:50:23 2025 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1524) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   001    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       96
  3 Spin_Up_Time            0x0007   083   083   001    Pre-fail  Always       -       339 (Average 334)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       27
  5 Reallocated_Sector_Ct   0x0033   086   086   001    Pre-fail  Always       -       725
  7 Seek_Error_Rate         0x000b   100   100   001    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   098   098   000    Old_age   Always       -       15190
 10 Spin_Retry_Count        0x0013   100   100   001    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       27
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       652
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       652
194 Temperature_Celsius     0x0002   045   045   000    Old_age   Always       -       47 (Min/Max 22/53)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       725
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       828
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       3
199 UDMA_CRC_Error_Count    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 394 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 394 occurred at disk power-on lifetime: 15185 hours (632 days + 17 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 1c 38 44 b8 bb 40 08      00:01:04.249  READ FPDMA QUEUED
  47 00 01 30 08 00 a0 08      00:01:01.390  READ LOG DMA EXT
  47 00 01 30 00 00 a0 08      00:01:01.390  READ LOG DMA EXT
  47 00 01 00 00 00 a0 08      00:01:01.389  READ LOG DMA EXT
  47 00 01 12 00 00 a0 08      00:01:01.387  READ LOG DMA EXT

Error 393 occurred at disk power-on lifetime: 15185 hours (632 days + 17 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 1c 28 44 b8 bb 40 08      00:01:01.016  READ FPDMA QUEUED
  60 1c 30 44 b8 bb 40 08      00:00:58.159  READ FPDMA QUEUED
  47 00 01 30 08 00 a0 08      00:00:58.158  READ LOG DMA EXT
  47 00 01 30 00 00 a0 08      00:00:58.157  READ LOG DMA EXT
  47 00 01 00 00 00 a0 08      00:00:58.157  READ LOG DMA EXT

Error 392 occurred at disk power-on lifetime: 15185 hours (632 days + 17 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 1c 10 44 b8 bb 40 08      00:00:58.099  READ FPDMA QUEUED
  60 1c 20 44 b8 bb 40 08      00:00:55.236  READ FPDMA QUEUED
  60 1c 18 44 b8 bb 40 08      00:00:55.236  READ FPDMA QUEUED
  47 00 01 30 08 00 a0 08      00:00:55.235  READ LOG DMA EXT
  47 00 01 30 00 00 a0 08      00:00:55.234  READ LOG DMA EXT

Error 391 occurred at disk power-on lifetime: 15185 hours (632 days + 17 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 1c b8 44 b8 bb 40 08      00:00:55.174  READ FPDMA QUEUED
  60 1c 08 44 b8 bb 40 08      00:00:52.333  READ FPDMA QUEUED
  60 1c 00 44 b8 bb 40 08      00:00:52.333  READ FPDMA QUEUED
  60 1c f8 44 b8 bb 40 08      00:00:52.333  READ FPDMA QUEUED
  47 00 01 30 08 00 a0 08      00:00:52.332  READ LOG DMA EXT

Error 390 occurred at disk power-on lifetime: 15185 hours (632 days + 17 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 1c 70 44 b8 bb 40 08      00:00:51.941  READ FPDMA QUEUED
  60 1c 00 84 f7 bb 40 08      00:00:48.883  READ FPDMA QUEUED
  60 1c a8 c4 b7 bb 40 08      00:00:48.883  READ FPDMA QUEUED
  60 1c a0 c4 b7 bb 40 08      00:00:48.883  READ FPDMA QUEUED
  60 1c 98 c4 b7 bb 40 08      00:00:48.883  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@proxmox:~#

If the disk passes a long test, you have problems with your SATA host/cabling.
Oooh, good info thank you - What's the best way to run a long test on that drive, and should I wait until after my rsync is complete to do it?
 
Last edited:
Code:
  5 Reallocated_Sector_Ct   0x0033   086   086   001    Pre-fail  Always       -       725
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       725
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       828
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       3
These are the main indicators of a dying drive. It is unhealthy.

The test shows PASSED but the indicators are there that it is unhealthy and it will continue giving problems. Best to replace it.
I've been backing my data up to a separate Proxmox + turnkey-fileserver instance via rsync. Once I have the new drive installed and ready to roll should I just rsync everything back over to it from the backup server or is there is there a better way to copy the whole pool via Proxmox?
rsync copies files and directories, works at the filesystem level.
ZFS provides another way to transfer the data and is super quick. If you have another server with sshd i.e where you can ssh into, then you can snapshot your zfs datasets and "zfs send". it can also be done at the pool level.
I just finished a backup setup for a machine that runs on ZFS. I do it as a pull not push. In other words the backup server connects to the machine to be backed up, snapshots the whole pool (also of a single disk stripe) and pulls recursively all datasets.
The result like this took about 10 seconds or less, in a gigabit link:
Code:
$ zfs list -rt all Deimos/firewall-backups/zroot
NAME                                                            USED  AVAIL     REFER  MOUNTPOINT
Deimos/firewall-backups/zroot                                  44.6G  5.05T      128K  /mnt/Deimos/firewall-backups/zroot
Deimos/firewall-backups/zroot/ROOT                             44.6G  5.05T      128K  /mnt/Deimos/firewall-backups/zroot/ROOT
Deimos/firewall-backups/zroot/ROOT/24.7                        44.6G  5.05T     34.0G  /mnt/Deimos/firewall-backups/zroot/ROOT/24.7
Deimos/firewall-backups/zroot/ROOT/24.7@2024-08-06-10:39:43-0  3.10G      -     9.61G  -
Deimos/firewall-backups/zroot/ROOT/24.7@2024-08-06-12:33:42-0  3.81G      -     10.3G  -
These are freeBSD machines but should work for linux too. I think we can help you but is a bit complex. Maybe something for after all is done if you can do rsync and are using tools familiar to you.
Can you show the results of $zfs list
 
  • Like
Reactions: headband
The test shows PASSED but the indicators are there that it is unhealthy and it will continue giving problems. Best to replace it.

rsync copies files and directories, works at the filesystem level.
ZFS provides another way to transfer the data and is super quick. If you have another server with sshd i.e where you can ssh into, then you can snapshot your zfs datasets and "zfs send". it can also be done at the pool level.
I just finished a backup setup for a machine that runs on ZFS. I do it as a pull not push. In other words the backup server connects to the machine to be backed up, snapshots the whole pool (also of a single disk stripe) and pulls recursively all datasets.
The result like this took about 10 seconds or less, in a gigabit link:
Code:
$ zfs list -rt all Deimos/firewall-backups/zroot
NAME                                                            USED  AVAIL     REFER  MOUNTPOINT
Deimos/firewall-backups/zroot                                  44.6G  5.05T      128K  /mnt/Deimos/firewall-backups/zroot
Deimos/firewall-backups/zroot/ROOT                             44.6G  5.05T      128K  /mnt/Deimos/firewall-backups/zroot/ROOT
Deimos/firewall-backups/zroot/ROOT/24.7                        44.6G  5.05T     34.0G  /mnt/Deimos/firewall-backups/zroot/ROOT/24.7
Deimos/firewall-backups/zroot/ROOT/24.7@2024-08-06-10:39:43-0  3.10G      -     9.61G  -
Deimos/firewall-backups/zroot/ROOT/24.7@2024-08-06-12:33:42-0  3.81G      -     10.3G  -
These are freeBSD machines but should work for linux too. I think we can help you but is a bit complex. Maybe something for after all is done if you can do rsync and are using tools familiar to you.
Woah, I'm definitely interested in learning more about this. You're probably right that it's best to use the tools that I'm familiar with for now though, but I'll definitely be digging more into zfs send later... Thanks!

Here's the output of $zfs list:
Code:
root@proxmox:~# zfs list
NAME                              USED  AVAIL     REFER  MOUNTPOINT
MediaStorage                     4.76T  4.21T       96K  /MediaStorage
MediaStorage/subvol-101-disk-0   4.76T  4.21T     4.76T  /MediaStorage/subvol-101-disk-0
TeeVeeStorage                    11.3T  1.28T       96K  /TeeVeeStorage
TeeVeeStorage/subvol-101-disk-0  11.3T  1.28T     11.3T  /TeeVeeStorage/subvol-101-disk-0
root@proxmox:~#
 
Last edited:
...and try to get enterprise drives instead of consumer grade ones, even if budget forces you to buy second hand.

Would this fit the bill in your opinion? It's a refurbished WD Ultrastar which they bill as their 'enterprise grade'.

https://serverpartdeals.com/product...-7-2k-rpm-sata-6gb-s-512e-3-5-refurbished-hdd

I believe it's an Ultrastar I'm replacing so if anyone has a suggestion otherwise I'm all ears!

Again, thank you everyone for all of the help - I would've been pretty lost otherwise.
 
Last edited:
The test shows PASSED
SMART Selective self-test log data structure revision number 1<br> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS<br> 1 0 0 Not_testing<br> 2 0 0 Not_testing<br> 3 0 0 Not_testing<br> 4 0 0 Not_testing<br> 5 0 0 Not_testing

There is no record of a test being performed.
What's the best way to run a long test on that drive

smartctl --test=long /dev/disk/by-id/ata-WDC_WUH721414ALN6L4_9RHJR6WC

IF the disk is good, the test should take about a day to complete, but, do you see all those errors trapped by your HDD firmware? chances are if will fail in the first 10 minutes. You can see progress by checking smartctl --all /dev/disk/by-id/ata-WDC_WUH721414ALN6L4_9RHJR6WC
 
  • Like
Reactions: headband
There is no record of a test being performed.


smartctl --test=long /dev/disk/by-id/ata-WDC_WUH721414ALN6L4_9RHJR6WC

IF the disk is good, the test should take about a day to complete, but, do you see all those errors trapped by your HDD firmware? chances are if will fail in the first 10 minutes. You can see progress by checking smartctl --all /dev/disk/by-id/ata-WDC_WUH721414ALN6L4_9RHJR6WC
One Important Thing (I don't see it mentioned by the OP): **IF** Cooling is an Issue in the Server, running a LONG Smart Test can basically cook the Drive (in Terms of Temperature !).

We agree that it should NOT happen if the Fans are correctly spinning and have enough static Pressure, but just a Thing to keep in Mind.

Ask me how I know :rolleyes: .
 
  • Like
Reactions: headband
running a LONG Smart Test can basically cook the Drive (in Terms of Temperature !).
A SMART test doesnt put any significant load on the drive, and will typically not even impact disk performance at all. A disk under smart test would not impact heat generation. You can cook a disk drive AT IDLE without adequate cooling. Dont believe me? pull the data sheet.
 
There is no record of a test being performed.


smartctl --test=long /dev/disk/by-id/ata-WDC_WUH721414ALN6L4_9RHJR6WC

IF the disk is good, the test should take about a day to complete, but, do you see all those errors trapped by your HDD firmware? chances are if will fail in the first 10 minutes. You can see progress by checking smartctl --all /dev/disk/by-id/ata-WDC_WUH721414ALN6L4_9RHJR6WC
Post #12 unless I'm reading it wrong and is for another disk/system.