Pool constantly getting downgraded for unknown reasons

justjosh

Member
Nov 4, 2019
93
0
11
58
Hi all,

I have a couple of ZFS pools in my proxmox cluster but one of the pools has all drives constantly throwing errors and getting degraded then faulted.

I have run full long SMART tests on all the drives multiple times and they all show up with 0 errors. Endurance use is also in the 2-3% range so it's really weird that they all fail at the same time without much use.

The drives are also connected to the same HBA as other pools and those pools do not have the same errors so it's unlikely to be a HBA issue either.

# smartctl -a /dev/sdg
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.3.10-1-pve] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: IBM-SSG
Product: HSRX400
Revision: B1F0
Compliance: SPC-4
User Capacity: 400,088,457,216 bytes [400 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: <SSD1>
Serial number: <Serial>
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Tue Jul 14 18:24:29 2020 +08
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 2%
Current Drive Temperature: 44 C
Drive Trip Temperature: 65 C

Manufactured in week 14 of year 2014
Specified cycle count over device lifetime: 0
Accumulated start-stop cycles: 0
Specified load-unload count over device lifetime: 0
Accumulated load-unload cycles: 0
defect list format 6 unknown
Elements in grown defect list: 0

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 169693.608 0
write: 0 0 0 0 0 316006.958 0
verify: 0 0 0 0 0 205347.081 0

Non-medium error count: 0

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed - 38273 - [- - -]
# 2 Background long Completed - 35567 - [- - -]
# 3 Background short Completed - 35567 - [- - -]
# 4 Background short Completed - 33678 - [- - -]

Long (extended) Self-test duration: 2774 seconds [46.2 minutes]

#cat /var/log/syslog.1 | grep <SSD1>
Jul 13 10:20:18 proxmox kernel: [50261.572894] zio pool=SSD vdev=/dev/disk/by-id/<SSD1>-part1 error=121 type=2 offset=386959925248 size=12288 flags=40080c80
Jul 13 10:20:18 proxmox kernel: [50261.586896] zio pool=SSD vdev=/dev/disk/by-id/<SSD1>-part1 error=121 type=2 offset=386959929344 size=8192 flags=188881
Jul 13 10:20:18 proxmox zed: eid=16 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:18 proxmox zed: eid=17 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:18 proxmox zed: eid=18 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:20 proxmox zed: eid=28 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:28 proxmox kernel: [50271.837919] zio pool=SSD vdev=/dev/disk/by-id/<SSD1>-part1 error=121 type=1 offset=386959925248 size=40960 flags=40080ca8
Jul 13 10:20:28 proxmox kernel: [50271.849934] zio pool=SSD vdev=/dev/disk/by-id/<SSD1>-part1 error=121 type=2 offset=386959929344 size=8192 flags=1808aa
Jul 13 10:20:28 proxmox kernel: [50271.873657] zio pool=SSD vdev=/dev/disk/by-id/<SSD1>-part1 error=121 type=2 offset=386959929344 size=8192 flags=1808ae
Jul 13 10:20:29 proxmox zed: eid=32 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:29 proxmox zed: eid=33 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:29 proxmox zed: eid=34 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:29 proxmox zed: eid=35 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:29 proxmox zed: eid=36 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:29 proxmox zed: eid=37 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:33 proxmox zed: eid=59 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:33 proxmox kernel: [50276.957280] zio pool=SSD vdev=/dev/disk/by-id/<SSD1>-part1 error=121 type=1 offset=386959925248 size=40960 flags=40080ca8
Jul 13 10:20:33 proxmox kernel: [50276.971565] zio pool=SSD vdev=/dev/disk/by-id/<SSD1>-part1 error=121 type=2 offset=386959929344 size=8192 flags=1808aa
Jul 13 10:20:33 proxmox kernel: [50276.993940] zio pool=SSD vdev=/dev/disk/by-id/<SSD1>-part1 error=121 type=2 offset=386959929344 size=8192 flags=1808ae
Jul 13 10:20:35 proxmox zed: eid=70 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:35 proxmox zed: eid=71 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:35 proxmox zed: eid=72 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:35 proxmox zed: eid=73 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:35 proxmox zed: eid=74 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:37 proxmox zed: eid=84 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:39 proxmox zed: eid=97 class=io pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1
Jul 13 10:20:44 proxmox zed: eid=106 class=statechange pool_guid=0xFA89EB75B38174A9 vdev_path=/dev/disk/by-id/<SSD1>-part1 vdev_state=FAULTED
 
What about the errors reported by zpool status?
The errors from zpool status show that the container file is corrupted but the CTs start fine.
errors: Permanent errors have been detected in the following files:

SSD/subvol-102-disk-1:<0x0>
Also, earlier it was 104 and now it's 102. Unable to reboot the CT, have to reboot the entire node to remove the errors. Also, the proxmox node host is flooded with

blk_update_request: critical target error, dev sdg, sector 755783152 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!