Pool constantly getting downgraded for unknown reasons

justjosh · Jul 14, 2020

Hi all,

I have a couple of ZFS pools in my proxmox cluster but one of the pools has all drives constantly throwing errors and getting degraded then faulted.

I have run full long SMART tests on all the drives multiple times and they all show up with 0 errors. Endurance use is also in the 2-3% range so it's really weird that they all fail at the same time without much use.

The drives are also connected to the same HBA as other pools and those pools do not have the same errors so it's unlikely to be a HBA issue either.

# smartctl -a /dev/sdg
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.3.10-1-pve] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: IBM-SSG
Product: HSRX400
Revision: B1F0
Compliance: SPC-4
User Capacity: 400,088,457,216 bytes [400 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: <SSD1>
Serial number: <Serial>
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Tue Jul 14 18:24:29 2020 +08
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 2%
Current Drive Temperature: 44 C
Drive Trip Temperature: 65 C

Manufactured in week 14 of year 2014
Specified cycle count over device lifetime: 0
Accumulated start-stop cycles: 0
Specified load-unload count over device lifetime: 0
Accumulated load-unload cycles: 0
defect list format 6 unknown
Elements in grown defect list: 0

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 169693.608 0
write: 0 0 0 0 0 316006.958 0
verify: 0 0 0 0 0 205347.081 0

Non-medium error count: 0

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed - 38273 - [- - -]
# 2 Background long Completed - 35567 - [- - -]
# 3 Background short Completed - 35567 - [- - -]
# 4 Background short Completed - 33678 - [- - -]

Long (extended) Self-test duration: 2774 seconds [46.2 minutes]

#cat /var/log/syslog.1 | grep <SSD1>
Jul 13 10:20:18 proxmox kernel: [50261.572894] zio pool=SSD vdev=/dev/disk/by-id/<SSD1>-part1 error=121 type=2 offset=386959925248 size=12288 flags=40080c80
Jul 13 10:20:18 proxmox kernel: [50261.586896] zio pool=SSD vdev=/dev/disk/by-id/<SSD1>-part1 error=121 type=2 offset=386959929344 size=8192 flags=188881
Jul 13 10:20:18 proxmox zed: eid=16 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:18 proxmox zed: eid=17 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:18 proxmox zed: eid=18 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:20 proxmox zed: eid=28 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:28 proxmox kernel: [50271.837919] zio pool=SSD vdev=/dev/disk/by-id/<SSD1>-part1 error=121 type=1 offset=386959925248 size=40960 flags=40080ca8
Jul 13 10:20:28 proxmox kernel: [50271.849934] zio pool=SSD vdev=/dev/disk/by-id/<SSD1>-part1 error=121 type=2 offset=386959929344 size=8192 flags=1808aa
Jul 13 10:20:28 proxmox kernel: [50271.873657] zio pool=SSD vdev=/dev/disk/by-id/<SSD1>-part1 error=121 type=2 offset=386959929344 size=8192 flags=1808ae
Jul 13 10:20:29 proxmox zed: eid=32 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:29 proxmox zed: eid=33 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:29 proxmox zed: eid=34 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:29 proxmox zed: eid=35 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:29 proxmox zed: eid=36 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:29 proxmox zed: eid=37 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:33 proxmox zed: eid=59 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:33 proxmox kernel: [50276.957280] zio pool=SSD vdev=/dev/disk/by-id/<SSD1>-part1 error=121 type=1 offset=386959925248 size=40960 flags=40080ca8
Jul 13 10:20:33 proxmox kernel: [50276.971565] zio pool=SSD vdev=/dev/disk/by-id/<SSD1>-part1 error=121 type=2 offset=386959929344 size=8192 flags=1808aa
Jul 13 10:20:33 proxmox kernel: [50276.993940] zio pool=SSD vdev=/dev/disk/by-id/<SSD1>-part1 error=121 type=2 offset=386959929344 size=8192 flags=1808ae
Jul 13 10:20:35 proxmox zed: eid=70 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:35 proxmox zed: eid=71 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:35 proxmox zed: eid=72 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:35 proxmox zed: eid=73 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:35 proxmox zed: eid=74 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:37 proxmox zed: eid=84 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:39 proxmox zed: eid=97 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:44 proxmox zed: eid=106 class=statechange pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1 vdev_state=FAULTED

LnxBil · Jul 14, 2020

What about the errors reported by zpool status?

justjosh · Jul 15, 2020

LnxBil said:
What about the errors reported by zpool status?

The errors from zpool status show that the container file is corrupted but the CTs start fine.

errors: Permanent errors have been detected in the following files:

SSD/subvol-102-disk-1:<0x0>

Also, earlier it was 104 and now it's 102. Unable to reboot the CT, have to reboot the entire node to remove the errors. Also, the proxmox node host is flooded with

blk_update_request: critical target error, dev sdg, sector 755783152 op 0x1WRITE) flags 0x700 phys_seg 1 prio class 0

Search

Search

Pool constantly getting downgraded for unknown reasons

justjosh

Well-Known Member

LnxBil

Distinguished Member

justjosh

Well-Known Member