Zfs Degraded

Dexter23

Member
Dec 23, 2021
144
6
23
33
Hi Everyone
Zfs send me an email for zfs degraded, but if i go on disk the smart is OK. i need to replace the disk anyway or it's only a software problem on the drive and not hardware?
1696835216242.png
ZFS
1696835236471.png
 
The drive reported errors to ZFS in 22 separate read actions. It did not silently return corrupted data (that would cause checksum errors), so it's probably not a cable/connector issue.
Be aware that your data is at risk (in at least those 22 places) because there are no redundant copies anymore. Maybe replace the drive quickly and investigate the old drive separately?

Does the output of smartctl -a /dev/disk/by-id/scsi-35000c500c21a9a13-part1 show information about the errors (at the end)? You could do a long SMART self-test (which will probably take hours) with smartctl -t long /dev/disk/by-id/scsi-35000c500c21a9a13-part1.
 
The drive reported errors to ZFS in 22 separate read actions. It did not silently return corrupted data (that would cause checksum errors), so it's probably not a cable/connector issue.
Be aware that your data is at risk (in at least those 22 places) because there are no redundant copies anymore. Maybe replace the drive quickly and investigate the old drive separately?

Does the output of smartctl -a /dev/disk/by-id/scsi-35000c500c21a9a13-part1 show information about the errors (at the end)? You could do a long SMART self-test (which will probably take hours) with smartctl -t long /dev/disk/by-id/scsi-35000c500c21a9a13-part1.
This is the output of the smart command you write:
https://termbin.com/0jiu
 
You did not run a (long) test according to that report.

Did you see that there were 8 uncorrected read errors?
I launch the test where i see the results?
1699437545724.png
Yes i see there 8 uncorrected read erros, so what i need to do?
 
I launch the test where i see the results?
View attachment 57751
You check with smartctl -a again after 85 minutes.
Yes i see there 8 uncorrected read erros, so what i need to do?
If the SMART long test does not succeed then maybe return to the store under warranty? Or otherwise replace it with a good drive and discard it as e-waste.

If SMART claims the drive itself is fine then maybe replace cables or connect it to another drive controller. However, since the drive already noticed 8 errors (which happened before it was send to the controller via the cable), the problem is probably inside the drive. Maybe the drive can replace the bad sectors with spare ones. Keep an eye on it and see if it gets worse? Maybe contact the manufacturer on how to deal with this?

None of this is specific to Proxmox, so you'll have the whole of the internet (and other support forums) you can ask for help.
 
Then I would replace the disk.
Hi i don't know how but i poweroff and poweron the server and the resilvering process start automatically once is finished with 0 errors now this is the situation of zfs:
1700553564081.png
So the problem was only logical and not hardware fault of the disk?
Thanks
 
You should take the error messages seriously, even if things are better now. An improvement can always occur with a restart. It could also be a cold solder joint or something else. Just because things are better now doesn't mean the problem has been solved.

Please post the current SMART values here again.
 
This is the output of: smartctl -a /dev/disk/by-id/scsi-35000c500c21a9a13-part1

1700554201289.png
Uncorrected errors remain 8, i need to run long test?
 
Please post the entire thing here in the code tag, including the SMART values themselves. This way you can't really assess the condition.
 
Here:

Code:
root@pve02:~# smartctl -a /dev/disk/by-id/scsi-35000c500c21a9a13-part1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.2.16-4-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST900MP0026
Revision:             KT39
Compliance:           SPC-4
User Capacity:        900,185,481,216 bytes [900 GB]
Logical block size:   512 bytes
Formatted with type 2 protection
8 bytes of protection information per logical block
LU is fully provisioned
Rotation Rate:        15000 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x5000c500c21a9a13
Serial number:        WAG0PH80
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Tue Nov 21 09:13:32 2023 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 0
Power on minutes since format <not available>
Current Drive Temperature:     43 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 35200:38
Manufactured in week 28 of year 2019
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  23
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  1483
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 372877709
  Blocks received from initiator = 1400384429
  Blocks read from cache and sent to initiator = 304073325
  Number of read and write commands whose size <= segment size = 924217086
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 35200.63
  number of minutes until next internal SMART test = 44

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1749055958        9         0  1749055967         17     950121.552           8
write:         0        0        24        24         24      47762.680           0
verify: 2871997389        0         0  2871997389          0     175697.373           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Failed in segment -->     104   34892         253395618 [0x3 0x11 0x0]
# 2  Background long   Failed in segment -->     104   34870         253395618 [0x3 0x11 0x0]

Long (extended) Self-test duration: 5100 seconds [85.0 minutes]
 
Ahh, it's a SAS drive, not an SSD like I thought. The hard drive is massively in need of correcting errors. Please take a look at the others; if the values differ significantly, you should urgently replace this hard drive.
 
Here all the drive of zpool01

Code:
root@pve02:~# smartctl -a /dev/disk/by-id/scsi-35000c500c21a9903-part1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.2.16-4-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST900MP0026
Revision:             KT39
Compliance:           SPC-4
User Capacity:        900,185,481,216 bytes [900 GB]
Logical block size:   512 bytes
Formatted with type 2 protection
8 bytes of protection information per logical block
LU is fully provisioned
Rotation Rate:        15000 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x5000c500c21a9903
Serial number:        WAG0PH8Y
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Tue Nov 21 09:23:08 2023 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 0
Power on minutes since format <not available>
Current Drive Temperature:     43 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 35200:39
Manufactured in week 28 of year 2019
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  23
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  1481
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 17926953
  Blocks received from initiator = 1012056319
  Blocks read from cache and sent to initiator = 76398904
  Number of read and write commands whose size <= segment size = 946376542
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 35200.65
  number of minutes until next internal SMART test = 34

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1331488635       18         0  1331488653         21     956633.673           3
write:         0        0         4         4          4      49786.528           0
verify: 2878565156        0         0  2878565156          0     175700.827           0

Non-medium error count:        0

No Self-tests have been logged

Code:
root@pve02:~# smartctl -a /dev/disk/by-id/scsi-35000c500c2177a77-part1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.2.16-4-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST900MP0026
Revision:             KT39
Compliance:           SPC-4
User Capacity:        900,185,481,216 bytes [900 GB]
Logical block size:   512 bytes
Formatted with type 2 protection
8 bytes of protection information per logical block
LU is fully provisioned
Rotation Rate:        15000 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x5000c500c2177a77
Serial number:        WAG0N5L8
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Tue Nov 21 09:24:28 2023 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 2
Power on minutes since format <not available>
Current Drive Temperature:     44 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 35200:44
Manufactured in week 27 of year 2019
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  23
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  1484
Elements in grown defect list: 2

Vendor (Seagate Cache) information
  Blocks sent to initiator = 2789590099
  Blocks received from initiator = 676903217
  Blocks read from cache and sent to initiator = 95464912
  Number of read and write commands whose size <= segment size = 944710217
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 35200.73
  number of minutes until next internal SMART test = 33

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   4060806362       15         0  4060806377         17     953599.587           2
write:         0        0         5         5          5      49636.701           0
verify: 2877908889        4         0  2877908893         17     175709.411          11

Non-medium error count:        0

No Self-tests have been logged

Code:
root@pve02:~# smartctl -a /dev/disk/by-id/scsi-35000c500c21a9a13-part1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.2.16-4-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST900MP0026
Revision:             KT39
Compliance:           SPC-4
User Capacity:        900,185,481,216 bytes [900 GB]
Logical block size:   512 bytes
Formatted with type 2 protection
8 bytes of protection information per logical block
LU is fully provisioned
Rotation Rate:        15000 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x5000c500c21a9a13
Serial number:        WAG0PH80
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Tue Nov 21 09:25:40 2023 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 0
Power on minutes since format <not available>
Current Drive Temperature:     43 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 35200:50
Manufactured in week 28 of year 2019
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  23
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  1483
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 372928322
  Blocks received from initiator = 1400653111
  Blocks read from cache and sent to initiator = 304074109
  Number of read and write commands whose size <= segment size = 924218833
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 35200.83
  number of minutes until next internal SMART test = 32

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1749104563        9         0  1749104572         17     950121.577           8
write:         0        0        24        24         24      47762.820           0
verify: 2871999397        0         0  2871999397          0     175697.374           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Failed in segment -->     104   34892         253395618 [0x3 0x11 0x0]
# 2  Background long   Failed in segment -->     104   34870         253395618 [0x3 0x11 0x0]

Long (extended) Self-test duration: 5100 seconds [85.0 minutes]

Code:
root@pve02:~# smartctl -a /dev/disk/by-id/scsi-35000c500c21a3a2f-part1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.2.16-4-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST900MP0026
Revision:             KT39
Compliance:           SPC-4
User Capacity:        900,185,481,216 bytes [900 GB]
Logical block size:   512 bytes
Formatted with type 2 protection
8 bytes of protection information per logical block
LU is fully provisioned
Rotation Rate:        15000 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x5000c500c21a3a2f
Serial number:        WAG0PL0S
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Tue Nov 21 09:26:32 2023 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 0
Power on minutes since format <not available>
Current Drive Temperature:     40 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 35200:58
Manufactured in week 28 of year 2019
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  23
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  1481
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 1363059
  Blocks received from initiator = 3167551529
  Blocks read from cache and sent to initiator = 13024261
  Number of read and write commands whose size <= segment size = 42639564
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 35200.97
  number of minutes until next internal SMART test = 31

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1706722085        0         0  1706722085          0        887.584           0
write:         0        0         2         2          2       3943.002           0
verify: 2589258053        0         0  2589258053          0     175550.382           0

Non-medium error count:        0

No Self-tests have been logged
 
That does also not look good:

Code:
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Failed in segment -->     104   34892         253395618 [0x3 0x11 0x0]
# 2  Background long   Failed in segment -->     104   34870         253395618 [0x3 0x11 0x0]
 
That does also not look good:

Code:
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Failed in segment -->     104   34892         253395618 [0x3 0x11 0x0]
# 2  Background long   Failed in segment -->     104   34870         253395618 [0x3 0x11 0x0]
Yes but this test is before the resilvering process, so if the number not increase i think it's of more or less
 
So the value at total corrected doesn't seem to be the problem. But you have uncorrected errors on almost all hard drives. Depending on how important the data is to you, I would exchange it.
 
Yes but this test is before the resilvering process, so if the number not increase i think it's of more or less
This smart long test was done inside of your disk and it failed. I would immediately check my backups and plan to exchange all drives with internal smart test failures.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!