ZFS degraded state

manfer · Feb 14, 2022

Hi,

I have a cluster with two machines pve1 and pve2. The pve2 machine is just used to have a replication of the virtual machines on pve1 but it is never running any virtual machine.

Yesterday I have received this message about pve2 zfs status,

ZFS has finished a scrub:

eid: 19814
class: scrub_finish
host: pve2
time: 2022-02-13 00:54:35+0100
pool: rpool
state: DEGRADED
status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device repaired.
scan: scrub repaired 1M in 0 days 00:30:33 with 0 errors on Sun Feb 13 00:54:35 2022 config:

NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
scsi-35000c5005a4355e3-part3 FAULTED 257 0 0 too many errors
scsi-35000c5005459c80b-part3 ONLINE 0 0 0
scsi-35000c50054e48e43-part3 ONLINE 0 0 0
scsi-35000c5005a752c7f-part3 ONLINE 0 0 0

errors: No known data errors

What should I understand? One of the disk on the pve2 machine which I use for replicas has many read errors and needs to be replaced? And if I have to replace it, what I have to take into account for the replacement of hardware? How to choose the correct drive to do the replacement? Same capacity? what to take into account for the device specs?

Is it normal the pve2 machine which I would expect to have many more writes than reads presents this failure?

Thanks.

Dunuin · Feb 14, 2022

You can check if the cable fits correctly. This might cause read/write errors too.

See the paragraph "Changing a failed bootable device" for how to replace a disk of your rpool: https://pve.proxmox.com/wiki/ZFS_on_Linux#_zfs_administration

manfer · Feb 15, 2022

Are there any tests I can run to be sure a drive is faulty? Something to do before buying a replacement and before going into the datacenter to physically inspect the server?

I had run smartctl with these results:

root@pve2:~# smartctl -a /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.78-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST9300605SS
Revision: CS08
Compliance: SPC-4
User Capacity: 300,000,000,000 bytes [300 GB]
Logical block size: 512 bytes
Rotation Rate: 10000 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x5000c5005a4355e3
Serial number: 6XP4HBH1
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Mon Feb 14 23:55:19 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature: 30 C
Drive Trip Temperature: 68 C

Manufactured in week 43 of year 2012
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 95
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 192
Elements in grown defect list: 28

Vendor (Seagate Cache) information
Blocks sent to initiator = 2045421711
Blocks received from initiator = 2396576030
Blocks read from cache and sent to initiator = 19379193
Number of read and write commands whose size <= segment size = 1241405539
Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
number of hours powered up = 22577.73
number of minutes until next internal SMART test = 43

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 519048549 0 0 519048549 1 5209.079 1
write: 0 0 1 1 4 19117.481 0
verify: 2924 0 0 2924 0 0.000 0

Non-medium error count: 23

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed 32 13 - [- - -]
# 2 Background long Completed 32 7 - [- - -]
# 3 Background short Completed 32 6 - [- - -]

Long (extended) Self-test duration: 2520 seconds [42.0 minutes]

root@pve2:~# smartctl -a /dev/sdb
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.78-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST9300605SS
Revision: CS08
Compliance: SPC-4
User Capacity: 300,000,000,000 bytes [300 GB]
Logical block size: 512 bytes
Rotation Rate: 10000 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x5000c5005459c80b
Serial number: 6XP3F168
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Mon Feb 14 23:57:32 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature: 31 C
Drive Trip Temperature: 68 C

Manufactured in week 23 of year 2012
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 89
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 91
Elements in grown defect list: 1

Vendor (Seagate Cache) information
Blocks sent to initiator = 2390102766
Blocks received from initiator = 2520624312
Blocks read from cache and sent to initiator = 22967766
Number of read and write commands whose size <= segment size = 1244449842
Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
number of hours powered up = 22576.33
number of minutes until next internal SMART test = 11

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 3980392432 0 0 3980392432 0 5089.203 0
write: 0 0 0 0 0 19174.516 0
verify: 3054 0 0 3054 0 0.000 0

Non-medium error count: 14

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed 32 12 - [- - -]
# 2 Background long Completed 32 5 - [- - -]
# 3 Background short Completed 32 5 - [- - -]

Long (extended) Self-test duration: 2520 seconds [42.0 minutes]

root@pve2:~# smartctl -a /dev/sdc
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.78-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST9300605SS
Revision: CS08
Compliance: SPC-4
User Capacity: 300,000,000,000 bytes [300 GB]
Logical block size: 512 bytes
Rotation Rate: 10000 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x5000c50054e48e43
Serial number: 6XP3P20R
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Mon Feb 14 23:58:02 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature: 31 C
Drive Trip Temperature: 68 C

Manufactured in week 28 of year 2012
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 73
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 94
Elements in grown defect list: 32

Vendor (Seagate Cache) information
Blocks sent to initiator = 2388699838
Blocks received from initiator = 2507310492
Blocks read from cache and sent to initiator = 21681451
Number of read and write commands whose size <= segment size = 1244475991
Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
number of hours powered up = 22576.38
number of minutes until next internal SMART test = 2

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 1287129487 0 0 1287129487 0 5041.240 0
write: 0 0 0 0 0 19178.313 0
verify: 3092 0 0 3092 0 0.000 0

Non-medium error count: 164

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed 32 12 - [- - -]
# 2 Background long Completed 32 6 - [- - -]
# 3 Background short Completed 32 5 - [- - -]

Long (extended) Self-test duration: 2520 seconds [42.0 minutes]

root@pve2:~# smartctl -a /dev/sdd
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.78-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST9300605SS
Revision: CS08
Compliance: SPC-4
User Capacity: 300,000,000,000 bytes [300 GB]
Logical block size: 512 bytes
Rotation Rate: 10000 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x5000c5005a752c7f
Serial number: 6XP4LAXM
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Mon Feb 14 23:58:10 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature: 32 C
Drive Trip Temperature: 68 C

Manufactured in week 44 of year 2012
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 70
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 71
Elements in grown defect list: 695

Vendor (Seagate Cache) information
Blocks sent to initiator = 1254591886
Blocks received from initiator = 2575572870
Blocks read from cache and sent to initiator = 21872024
Number of read and write commands whose size <= segment size = 1259301431
Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
number of hours powered up = 22573.28
number of minutes until next internal SMART test = 23

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 4221511630 1 0 4221511631 1 4980.268 0
write: 0 0 0 0 0 19640.732 0
verify: 3024 0 0 3024 0 0.000 0

Non-medium error count: 805

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed 32 9 - [- - -]
# 2 Background long Completed 32 3 - [- - -]
# 3 Background short Completed 32 2 - [- - -]

Long (extended) Self-test duration: 2520 seconds [42.0 minutes]

Dunuin · Feb 15, 2022

You can initialize a long selftest (smartctl -t long /dev/sdX) and see if it reports any errors.

Helmut101 · Feb 15, 2022

Check dmesg and smartctl --all for CRC errors, which are quite often related to cable or other issues, not the drive. If your long smart test succeeds without error, it would be another hint in that direction. I had quite a number of UDMA CRC Errors due to faulty (and too long) SATA Cable connections. After testing all connections, either restart, and ZFS will do a resilver, or use zpool clear to tell ZFS that the drive can be trusted again. (the first step to all of this would be having sustainable backups)

manfer · Feb 16, 2022

This is the result of a long selftest,

Code:

root@pve2:~# smartctl -l selftest /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.78-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                  32   22601                 - [-   -    -]
# 2  Background short  Completed                  32      13                 - [-   -    -]
# 3  Background long   Completed                  32       7                 - [-   -    -]
# 4  Background short  Completed                  32       6                 - [-   -    -]

Long (extended) Self-test duration: 2520 seconds [42.0 minutes]

manfer · Feb 18, 2022

And with dmesg I get this,

Code:

root@pve2:~# dmesg | grep error
[   13.876451] ACPI Error: Aborting method \_SB.PMI0._GHL due to previous error (AE_NOT_EXIST) (20190816/psparse-531)
[   13.876521] ACPI Error: Aborting method \_SB.PMI0._PMC due to previous error (AE_NOT_EXIST) (20190816/psparse-531)
[5406317.728335] sd 2:0:0:0: [sda] tag#710 Add. Sense: Unrecovered read error
[5406317.728346] blk_update_request: critical medium error, dev sda, sector 78579461 op 0x0:(READ) flags 0x700 phys_seg 7 prio class 0
[5406317.728423] zio pool=rpool vdev=/dev/disk/by-id/scsi-35000c5005a4355e3-part3 error=61 type=1 offset=39694110720 size=1048576 flags=40080cb0

Helmut101 · Feb 18, 2022

You won't be able to identify this issue with software alone. Try to replace parts and narrow down the cause.

1. Switch drives to different slots, if you have a backplane. Same errors? Likely not a drive, and not the backplane. Error stays with the same drive(s)? Perhaps it is the drive.
2. Replace Sata cables, SFF-8088/8087 pci brackets, if you have
3. Switch connections on HBA ports
4. Add a fan to cool your HBA
5. Switch different power supplies

All of this can cost. Just added a comment to HN that I bought 400$ Sata cables to narrow down the cause of CRC errors in my case.

manfer · Feb 18, 2022

Thanks.

But, would it be possible to confirm the error to be sure it was not a one time error because of any temporary issue? Maybe the scrub you talk about in your comment?

Or with actual details it indicates a faulty drive for sure?

The machine is not running anything. It only stores a replication of virtual machines from the production machine to replace the production machine if there is a hardware error on it. I would like to confirm the error to take more action or do what is needed to continue working with the current drives.

I'm not totally sure what I need to do in case I had to continue to use current drives. Just restart would be fine?

After testing all connections, either restart, and ZFS will do a resilver, or use zpool clear to tell ZFS that the drive can be trusted again

If I just restart for now without testing cables, would that be too risky? I mean if there is something to confirm the error and shows no error.

Helmut101 · Feb 18, 2022

If you restart, ZFS will attempt an automatic resilver. If (by chance), no errors are found, then your zpool will be labeled as fine again. This does not necessarily mean that everything is ok - only if you run a scrub, and if this finished without errors, you know for (pretty) sure that you do not have a HW issue. You can do the same by clearing the error and then running a scrub, without restarting. It is impossible to say whether this is risky or not, without knowing the cause.

Dunuin · Feb 18, 2022

Like Helmut already said, I would atleast shuffle the drives around. ZFS doesn't care on which port the disk is attached to, it won't cost you anything and if you get a similar error with the same drive but on another port then its more probably that the drive is the problem and not the cable/hba/backplane.

manfer · Feb 20, 2022

I restarted the machine and this is the result after resilver finished,

Code:

root@pve2:~# zpool status
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: resilvered 26.6G in 0 days 01:24:22 with 0 errors on Sat Feb 19 23:41:12 2022
config:

        NAME                              STATE     READ WRITE CKSUM
        rpool                             ONLINE       0     0     0
          raidz2-0                        ONLINE       0     0     0
            scsi-35000c5005a4355e3-part3  ONLINE       0     0     7
            scsi-35000c5005459c80b-part3  ONLINE       0     0     0
            scsi-35000c50054e48e43-part3  ONLINE       0     0     0
            scsi-35000c5005a752c7f-part3  ONLINE       0     0     0

errors: No known data errors

Then I had run a scrub with this result,

Code:

root@pve2:~# zpool status -v rpool
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 0 days 00:30:34 with 0 errors on Sun Feb 20 02:34:55 2022
config:

        NAME                              STATE     READ WRITE CKSUM
        rpool                             ONLINE       0     0     0
          raidz2-0                        ONLINE       0     0     0
            scsi-35000c5005a4355e3-part3  ONLINE       0     0     7
            scsi-35000c5005459c80b-part3  ONLINE       0     0     0
            scsi-35000c50054e48e43-part3  ONLINE       0     0     0
            scsi-35000c5005a752c7f-part3  ONLINE       0     0     0

errors: No known data errors

What would you deduce from these results?

Thanks.

Dunuin · Feb 20, 2022

If you still got checksum error after a scrub you pool got unrepairable errors so the corrupted data is lost (which in theory should mean that this data is corrupted on 3 of the 4 drives because otherwise a raidz2 should be able to repair it). Do you maybe encounted a power outage or kernel crash?

Tmanok · Feb 20, 2022

Dunuin is correct, your checksum errors are troublesome. It is possible that the drive or file system may have automatically stopped using the bad sectors, but you should not rely on that disk if it is the problem.

Did you move the drive from drive bay / cable 3 to another? I've also witnessed bad cables, inconsistent power (if individually powered), and bad backplanes. In ZFS the drive will still probably list as the third disk, but I'm not certain. Try changing the physical interface and running a scrub, either way ZFS scrub will show a different drive with CRC errors if the problem is cabling or backplane related.

These types of issues are very important to resolve to avoid data loss and to avoid wasting money on drives if the issue is a different component. Good luck.

Tmanok

Search

Search

ZFS degraded state

manfer

Active Member

Dunuin

Distinguished Member

manfer

Active Member

Dunuin

Distinguished Member

Helmut101

Member

manfer

Active Member

manfer

Active Member

Helmut101

Member

manfer

Active Member

Helmut101

Member

Dunuin

Distinguished Member

manfer

Active Member

Dunuin

Distinguished Member

Tmanok

Renowned Member