Drives being removed from ZFS pool when under load

aforeum

New Member
Aug 1, 2025
2
0
1
Hi all.

I've been trying to solve this issue for a while now and haven't managed to resolve it.

Hopefully someone may be able to offer some insight.

HARDWARE & ENVIRONMENT:

- Dell PowerEdge T420 with 8 × 3.5″ SAS/SATA back-plane.

- Dell PERC H710 D1 flashed to IT mode (LSI SAS2308), firmware 20.00.07.00.

- LSI 9211-8i (also SAS2308, IT mode).

- 4 × NetApp X438_TPM3V400AMD 400 GB SAS SSDs (firmware NA00) - Drives have all passed long smartctl tests with no errors.

- Proxmox VE 8.4 (latest stable) - Kernel 6.8.12-13-pve, mpt3sas driver 43.100.00.00.

I have setup the 4 SSDs in ZFS as a mirror.

There are no issues with the pool while idle and no virtual machines running, however, as soon as I put any load on the SSDs (i.e., start virtual machines), after approximately 3 minutes or so, drives (at random) are removed from the pool.

Here is the DMESG log (courtesy of ChatGPT):​

(I kept all [I][COLOR=black][FONT=Consolas]device_block[/FONT][/COLOR][/I], [I][COLOR=black][FONT=Consolas]task abort[/FONT][/COLOR][/I], [I][COLOR=black][FONT=Consolas]log_info[/FONT][/COLOR][/I], [I][COLOR=black][FONT=Consolas]transport_port_remove[/FONT][/COLOR][/I], [I][COLOR=black][FONT=Consolas]Synchronizing Cache[/FONT][/COLOR][/I], ZFS [I][COLOR=black][FONT=Consolas]zio … error=5[/FONT][/COLOR][/I], and pool-suspension lines. Network / VM-bridge chatter is omitted so you can focus on the SAS events.)


Boot time (s)
Kernel message(s) – verbatim order
Notes
243.092mpt3sas 0000:41:00.0: invalid VPD tag 0x00 (size 0) at offset 0; assume missing optional EEPROMCosmetic, happens once after driver loads
438.755sd 6:0:0:0: device_block, handle(0x0009)Drive slot 7 stalls
439.754mpt2sas_cm0: log_info(0x3112011a): originator(PL), code(0x12), sub_code(0x011a) × 3OPEN_REJECT / Queue Full
440.254sd 6:0:0:0: device_unblock and setting to running, handle(0x0009)HBA retries succeeded
440.260zio pool=virtual-machines vdev=/dev/disk/by-id/scsi-350000396ec883258-part1 error=5 … × 5

zio … error=5 type=5 offset=0 size=0 flags=1049728
ZFS sees I/O failures (error 5 = EIO)
440.284sd 6:0:0:0: [sda] Synchronizing SCSI cache

[sda] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT
HBA lost the link
440.285mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec883259)

enclosure logical id … slot(7)
Disk logically removed
474.279sd 6:0:2:0: [sdc] Synchronizing SCSI cacheDID_NO_CONNECTSlot 5 now
474.280mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec882f59)
474.784 – 480.004sd 6:0:3:0: attempting task abort! … three times

device_block, handle(0x000c)

log_info(0x3112011a) × 4
Slot 4 queue overflows
481.261zio pool=virtual-machines vdev=/dev/disk/by-id/scsi-350000396ec882ba0-part1 error=5 … × 5

WARNING: Pool 'virtual-machines' has encountered an uncorrectable I/O failure and has been suspended.
ZFS suspends pool
481.276sd 6:0:3:0: [sdd] Synchronizing SCSI cacheDID_NO_CONNECT
481.285mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec882ba1)Slot 4 finally yanked
949.633sd 6:0:1:0: device_block, handle(0x000a)Fresh boot – slot 6
950.632log_info(0x3112011a) × 4Queue over-flow
951.913mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec8832c9) … slot(6)Disk gone
1107.884sd 6:0:3:0: device_block, handle(0x000c)Slot 4 again
1108.633log_info(0x3112011a) × 4
1109.917mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec882ba1) … slot(4)Disk removed
1535.636sd 6:0:2:0: device_block, handle(0x000b)Slot 5
1536.385log_info(0x3112011a) × 6
1537.936mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec882f59) … slot(5)Finally removed

This issue occurs with both HBAs.

I've tried the following to troubleshoot the issue:

- Swapped PERC H710 for LSI 9211-8i (same behaviour).

- Moved card(s) to a different slot.

- Re-cabled both back-plane SAS ports.

- Tried running a dual port configuration (both SAS cables from the backplane plugged into each port on the HBA).


At this stage I am not sure what else to do.

For context, prior to installing Proxmox, I was running ESXI version 7.0.3 for over a year with the same SSDS and the same HBA and there were no issues.

I'm wondering if this is a firmware issue where the HBA(s) are either too old, or perhaps the firmware on the SSDs is too old.

Does anybody have any suggestions as I'm pulling my hair out trying to make sense of this!
 
root@pve:~# zpool status -v
pool: virtual-machines
state: ONLINE
scan: resilvered 46.7M in 00:00:00 with 0 errors on Fri Aug 1 20:46:38 2025
config:

NAME STATE READ WRITE CKSUM
virtual-machines ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
scsi-350000396ec882ba0 ONLINE 0 0 0
scsi-350000396ec882f58 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
scsi-350000396ec883258 ONLINE 0 0 0
scsi-350000396ec8832c8 ONLINE 0 0 0

errors: No known data errors

root@pve:~# zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
virtual-machines 744G 9.21G 735G - - 0% 1% 1.00x ONLINE -
mirror-0 372G 4.61G 367G - - 0% 1.24% - ONLINE
scsi-350000396ec882ba0 373G - - - - - - - ONLINE
scsi-350000396ec882f58 373G - - - - - - - ONLINE
mirror-1 372G 4.60G 367G - - 0% 1.23% - ONLINE
scsi-350000396ec883258 373G - - - - - - - ONLINE
scsi-350000396ec8832c8 373G - - - - - - - ONLINE

root@pve:~# lsscsi
[4:0:0:0] disk ATA INTEL SSDSC2CW12 400i /dev/sde
[6:0:0:0] disk NETAPP X438_TPM3V400AMD NA00 /dev/sda
[6:0:1:0] disk NETAPP X438_TPM3V400AMD NA00 /dev/sdb
[6:0:2:0] disk NETAPP X438_TPM3V400AMD NA00 /dev/sdc
[6:0:3:0] disk NETAPP X438_TPM3V400AMD NA00 /dev/sdd

root@pve:~# smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-13-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, [URL='http://www.smartmontools.org']www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Vendor: NETAPP
Product: X438_TPM3V400AMD
Revision: NA00
User Capacity: 400,088,457,216 bytes [400 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x50000396ec883258
Serial number: 3620A0F2TWLC
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Fri Aug 1 20:49:05 2025 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 5%
Current Drive Temperature: 28 C
Drive Trip Temperature: 64 C

Accumulated power on time, hours:minutes 56775:17
Manufactured in week 09 of year 2016
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 164044.239 0
write: 0 0 0 0 0 56024.095 0
verify: 0 0 0 0 0 19792.655 0

Non-medium error count: 12721

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Aborted (by user command) - 56593 - [- - -]

Long (extended) Self-test duration: 1800 seconds [30.0 minutes]


root@pve:~# smartctl -a /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-13-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, [URL='http://www.smartmontools.org']www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Vendor: NETAPP
Product: X438_TPM3V400AMD
Revision: NA00
User Capacity: 400,088,457,216 bytes [400 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x50000396ec8832c8
Serial number: 3620A0FWTWLC
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Fri Aug 1 20:49:09 2025 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 5%
Current Drive Temperature: 28 C
Drive Trip Temperature: 64 C

Accumulated power on time, hours:minutes 56775:30
Manufactured in week 09 of year 2016
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 166019.017 0
write: 0 0 0 0 0 61563.116 0
verify: 0 0 0 0 0 19792.501 0

Non-medium error count: 12693

No Self-tests have been logged


root@pve:~# smartctl -a /dev/sdc
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-13-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, [URL='http://www.smartmontools.org']www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Vendor: NETAPP
Product: X438_TPM3V400AMD
Revision: NA00
User Capacity: 400,088,457,216 bytes [400 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x50000396ec882f58
Serial number: 3620A09ETWLC
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Fri Aug 1 20:49:11 2025 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 6%
Current Drive Temperature: 28 C
Drive Trip Temperature: 64 C

Accumulated power on time, hours:minutes 56775:05
Manufactured in week 09 of year 2016
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 161812.747 0
write: 0 0 0 0 0 100356.700 0
verify: 0 0 0 0 0 19792.655 0

Non-medium error count: 12690

No Self-tests have been logged


root@pve:~# smartctl -a /dev/sdd
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-13-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, [URL='http://www.smartmontools.org']www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Vendor: NETAPP
Product: X438_TPM3V400AMD
Revision: NA00
User Capacity: 400,088,457,216 bytes [400 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x50000396ec882ba0
Serial number: 3620A02ETWLC
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Fri Aug 1 20:49:13 2025 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 6%
Current Drive Temperature: 28 C
Drive Trip Temperature: 64 C

Accumulated power on time, hours:minutes 56775:15
Manufactured in week 09 of year 2016
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 162046.623 0
write: 0 0 0 0 0 100890.216 0
verify: 0 0 0 0 0 19792.197 0

Non-medium error count: 12703

No Self-tests have been logged
 
Last edited: