Drives being removed from ZFS pool when under load

aforeum · Aug 1, 2025

Hi all.

I've been trying to solve this issue for a while now and haven't managed to resolve it.

Hopefully someone may be able to offer some insight.

HARDWARE & ENVIRONMENT:

- Dell PowerEdge T420 with 8 × 3.5″ SAS/SATA back-plane.

- Dell PERC H710 D1 flashed to IT mode (LSI SAS2308), firmware 20.00.07.00.

- LSI 9211-8i (also SAS2308, IT mode).

- 4 × NetApp X438_TPM3V400AMD 400 GB SAS SSDs (firmware NA00) - Drives have all passed long smartctl tests with no errors.

- Proxmox VE 8.4 (latest stable) - Kernel 6.8.12-13-pve, mpt3sas driver 43.100.00.00.

I have setup the 4 SSDs in ZFS as a mirror.

There are no issues with the pool while idle and no virtual machines running, however, as soon as I put any load on the SSDs (i.e., start virtual machines), after approximately 3 minutes or so, drives (at random) are removed from the pool.

Here is the DMESG log (courtesy of ChatGPT):

(I kept all [I][COLOR=black][FONT=Consolas]device_block[/FONT][/COLOR][/I], [I][COLOR=black][FONT=Consolas]task abort[/FONT][/COLOR][/I], [I][COLOR=black][FONT=Consolas]log_info[/FONT][/COLOR][/I], [I][COLOR=black][FONT=Consolas]transport_port_remove[/FONT][/COLOR][/I], [I][COLOR=black][FONT=Consolas]Synchronizing Cache[/FONT][/COLOR][/I], ZFS [I][COLOR=black][FONT=Consolas]zio … error=5[/FONT][/COLOR][/I], and pool-suspension lines. Network / VM-bridge chatter is omitted so you can focus on the SAS events.)

Boot time (s)	Kernel message(s) – verbatim order	Notes
243.092	`mpt3sas 0000:41:00.0: invalid VPD tag 0x00 (size 0) at offset 0; assume missing optional EEPROM`	Cosmetic, happens once after driver loads
438.755	`sd 6:0:0:0: device_block, handle(0x0009)`	Drive slot 7 stalls
439.754	`mpt2sas_cm0: log_info(0x3112011a): originator(PL), code(0x12), sub_code(0x011a)` × 3	OPEN_REJECT / Queue Full
440.254	`sd 6:0:0:0: device_unblock and setting to running, handle(0x0009)`	HBA retries succeeded
440.260	`zio pool=virtual-machines vdev=/dev/disk/by-id/scsi-350000396ec883258-part1 error=5 …` × 5 `zio … error=5 type=5 offset=0 size=0 flags=1049728`	ZFS sees I/O failures (error 5 = EIO)
440.284	`sd 6:0:0:0: [sda] Synchronizing SCSI cache` `[sda] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT`	HBA lost the link
440.285	`mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec883259)` `enclosure logical id … slot(7)`	Disk logically removed
474.279	`sd 6:0:2:0: [sdc] Synchronizing SCSI cache` → `DID_NO_CONNECT`	Slot 5 now
474.280	`mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec882f59)`	—
474.784 – 480.004	`sd 6:0:3:0: attempting task abort! …` three times `device_block, handle(0x000c)` `log_info(0x3112011a)` × 4	Slot 4 queue overflows
481.261	`zio pool=virtual-machines vdev=/dev/disk/by-id/scsi-350000396ec882ba0-part1 error=5 …` × 5 `WARNING: Pool 'virtual-machines' has encountered an uncorrectable I/O failure and has been suspended.`	ZFS suspends pool
481.276	`sd 6:0:3:0: [sdd] Synchronizing SCSI cache` → `DID_NO_CONNECT`	—
481.285	`mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec882ba1)`	Slot 4 finally yanked
949.633	`sd 6:0:1:0: device_block, handle(0x000a)`	Fresh boot – slot 6
950.632	`log_info(0x3112011a)` × 4	Queue over-flow
951.913	`mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec8832c9) … slot(6)`	Disk gone
1107.884	`sd 6:0:3:0: device_block, handle(0x000c)`	Slot 4 again
1108.633	`log_info(0x3112011a)` × 4	—
1109.917	`mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec882ba1) … slot(4)`	Disk removed
1535.636	`sd 6:0:2:0: device_block, handle(0x000b)`	Slot 5
1536.385	`log_info(0x3112011a)` × 6	—
1537.936	`mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec882f59) … slot(5)`	Finally removed

This issue occurs with both HBAs.

I've tried the following to troubleshoot the issue:

- Swapped PERC H710 for LSI 9211-8i (same behaviour).

- Moved card(s) to a different slot.

- Re-cabled both back-plane SAS ports.

- Tried running a dual port configuration (both SAS cables from the backplane plugged into each port on the HBA).

At this stage I am not sure what else to do.

For context, prior to installing Proxmox, I was running ESXI version 7.0.3 for over a year with the same SSDS and the same HBA and there were no issues.

I'm wondering if this is a firmware issue where the HBA(s) are either too old, or perhaps the firmware on the SSDs is too old.

Does anybody have any suggestions as I'm pulling my hair out trying to make sense of this!

news · Aug 1, 2025

Please show us
zpool status -v
zpool list -v
smartctl -a <your-device-numbers>

aforeum · Aug 1, 2025

root@pve:~# zpool status -v
pool: virtual-machines
state: ONLINE
scan: resilvered 46.7M in 00:00:00 with 0 errors on Fri Aug 1 20:46:38 2025
config:

NAME STATE READ WRITE CKSUM
virtual-machines ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
scsi-350000396ec882ba0 ONLINE 0 0 0
scsi-350000396ec882f58 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
scsi-350000396ec883258 ONLINE 0 0 0
scsi-350000396ec8832c8 ONLINE 0 0 0

errors: No known data errors

root@pve:~# zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
virtual-machines 744G 9.21G 735G - - 0% 1% 1.00x ONLINE -
mirror-0 372G 4.61G 367G - - 0% 1.24% - ONLINE
scsi-350000396ec882ba0 373G - - - - - - - ONLINE
scsi-350000396ec882f58 373G - - - - - - - ONLINE
mirror-1 372G 4.60G 367G - - 0% 1.23% - ONLINE
scsi-350000396ec883258 373G - - - - - - - ONLINE
scsi-350000396ec8832c8 373G - - - - - - - ONLINE

root@pve:~# lsscsi
[4:0:0:0] disk ATA INTEL SSDSC2CW12 400i /dev/sde
[6:0:0:0] disk NETAPP X438_TPM3V400AMD NA00 /dev/sda
[6:0:1:0] disk NETAPP X438_TPM3V400AMD NA00 /dev/sdb
[6:0:2:0] disk NETAPP X438_TPM3V400AMD NA00 /dev/sdc
[6:0:3:0] disk NETAPP X438_TPM3V400AMD NA00 /dev/sdd

root@pve:~# smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-13-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, [URL='http://www.smartmontools.org']www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Vendor: NETAPP
Product: X438_TPM3V400AMD
Revision: NA00
User Capacity: 400,088,457,216 bytes [400 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x50000396ec883258
Serial number: 3620A0F2TWLC
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Fri Aug 1 20:49:05 2025 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 5%
Current Drive Temperature: 28 C
Drive Trip Temperature: 64 C

Accumulated power on time, hours:minutes 56775:17
Manufactured in week 09 of year 2016
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 164044.239 0
write: 0 0 0 0 0 56024.095 0
verify: 0 0 0 0 0 19792.655 0

Non-medium error count: 12721

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Aborted (by user command) - 56593 - [- - -]

Long (extended) Self-test duration: 1800 seconds [30.0 minutes]

root@pve:~# smartctl -a /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-13-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, [URL='http://www.smartmontools.org']www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Vendor: NETAPP
Product: X438_TPM3V400AMD
Revision: NA00
User Capacity: 400,088,457,216 bytes [400 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x50000396ec8832c8
Serial number: 3620A0FWTWLC
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Fri Aug 1 20:49:09 2025 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 5%
Current Drive Temperature: 28 C
Drive Trip Temperature: 64 C

Accumulated power on time, hours:minutes 56775:30
Manufactured in week 09 of year 2016
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 166019.017 0
write: 0 0 0 0 0 61563.116 0
verify: 0 0 0 0 0 19792.501 0

Non-medium error count: 12693

No Self-tests have been logged

root@pve:~# smartctl -a /dev/sdc
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-13-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, [URL='http://www.smartmontools.org']www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Vendor: NETAPP
Product: X438_TPM3V400AMD
Revision: NA00
User Capacity: 400,088,457,216 bytes [400 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x50000396ec882f58
Serial number: 3620A09ETWLC
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Fri Aug 1 20:49:11 2025 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 6%
Current Drive Temperature: 28 C
Drive Trip Temperature: 64 C

Accumulated power on time, hours:minutes 56775:05
Manufactured in week 09 of year 2016
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 161812.747 0
write: 0 0 0 0 0 100356.700 0
verify: 0 0 0 0 0 19792.655 0

Non-medium error count: 12690

No Self-tests have been logged

root@pve:~# smartctl -a /dev/sdd
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-13-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, [URL='http://www.smartmontools.org']www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Vendor: NETAPP
Product: X438_TPM3V400AMD
Revision: NA00
User Capacity: 400,088,457,216 bytes [400 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x50000396ec882ba0
Serial number: 3620A02ETWLC
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Fri Aug 1 20:49:13 2025 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 6%
Current Drive Temperature: 28 C
Drive Trip Temperature: 64 C

Accumulated power on time, hours:minutes 56775:15
Manufactured in week 09 of year 2016
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 162046.623 0
write: 0 0 0 0 0 100890.216 0
verify: 0 0 0 0 0 19792.197 0

Non-medium error count: 12703

No Self-tests have been logged

Impact · Aug 1, 2025

Please use code blocks.

sturek · Aug 7, 2025

I have exactly the same issue, but use the HBA passed through to a vm. Everything is fine in idle, when I start copying files onto the drives everything is okay for a while and then the drives are just disconnected with the same error messages you got (minus the zfs errors, as that is handled in the vm).

I do get these lines, in addition to what you posted:

Code:

mpt2sas_cm0: sending message unit reset !!
mpt2sas_cm0: message unit reset: SUCCESS
vfio-pci 0000:02000.0: enbaling device (002 -> 003)

At least this shows me that the issue has nothing to do with the vm passthrough. My next step will be testing with a different OS and not proxmox, I guess. Or just not using an hba and getting a regular SATA PCIe Adapter.

Small Update: I just tried copying the same files in bare metal Unraid and it worked. No disconnecting drives. I will try Open Media Vault bare metal next, as that is also based on Debian. Unraid is based on slackware. Still don't trust my system, but it copied files for several hours without issue, while the disconnecting drives happened after a few minutes in proxmox.

Second Update: It worked fine in Open Media Vault. I will now try to just do a new, clean proxmox install.

Third Update: Getting the same error again. Lots of write errors in Open Media Vault (running in a vm). Reinstalled OMV bare metal, mounted the same BTRFS pool and now it works without a hitch again. So it seems there is some issue in proxmox, but I cannot figure out what I may have done wrong. I think I will just have to ditch proxmox and run the NAS OS bare metal...

Search

Search

Drives being removed from ZFS pool when under load

aforeum

New Member

Here is the DMESG log (courtesy of ChatGPT):

news

Renowned Member

aforeum

New Member

Impact

Renowned Member

sturek

New Member

We value your privacy

Drives being removed from ZFS pool when under load

aforeum

New Member

Here is the DMESG log (courtesy of ChatGPT):​

news

Renowned Member

aforeum

New Member

Impact

Renowned Member

sturek

New Member

We value your privacy

Here is the DMESG log (courtesy of ChatGPT):