Drives being removed from ZFS pool when under load

aforeum

New Member
Aug 1, 2025
2
0
1
Hi all.

I've been trying to solve this issue for a while now and haven't managed to resolve it.

Hopefully someone may be able to offer some insight.

HARDWARE & ENVIRONMENT:

- Dell PowerEdge T420 with 8 × 3.5″ SAS/SATA back-plane.

- Dell PERC H710 D1 flashed to IT mode (LSI SAS2308), firmware 20.00.07.00.

- LSI 9211-8i (also SAS2308, IT mode).

- 4 × NetApp X438_TPM3V400AMD 400 GB SAS SSDs (firmware NA00) - Drives have all passed long smartctl tests with no errors.

- Proxmox VE 8.4 (latest stable) - Kernel 6.8.12-13-pve, mpt3sas driver 43.100.00.00.

I have setup the 4 SSDs in ZFS as a mirror.

There are no issues with the pool while idle and no virtual machines running, however, as soon as I put any load on the SSDs (i.e., start virtual machines), after approximately 3 minutes or so, drives (at random) are removed from the pool.

Here is the DMESG log (courtesy of ChatGPT):​

(I kept all [I][COLOR=black][FONT=Consolas]device_block[/FONT][/COLOR][/I], [I][COLOR=black][FONT=Consolas]task abort[/FONT][/COLOR][/I], [I][COLOR=black][FONT=Consolas]log_info[/FONT][/COLOR][/I], [I][COLOR=black][FONT=Consolas]transport_port_remove[/FONT][/COLOR][/I], [I][COLOR=black][FONT=Consolas]Synchronizing Cache[/FONT][/COLOR][/I], ZFS [I][COLOR=black][FONT=Consolas]zio … error=5[/FONT][/COLOR][/I], and pool-suspension lines. Network / VM-bridge chatter is omitted so you can focus on the SAS events.)


Boot time (s)
Kernel message(s) – verbatim order
Notes
243.092mpt3sas 0000:41:00.0: invalid VPD tag 0x00 (size 0) at offset 0; assume missing optional EEPROMCosmetic, happens once after driver loads
438.755sd 6:0:0:0: device_block, handle(0x0009)Drive slot 7 stalls
439.754mpt2sas_cm0: log_info(0x3112011a): originator(PL), code(0x12), sub_code(0x011a) × 3OPEN_REJECT / Queue Full
440.254sd 6:0:0:0: device_unblock and setting to running, handle(0x0009)HBA retries succeeded
440.260zio pool=virtual-machines vdev=/dev/disk/by-id/scsi-350000396ec883258-part1 error=5 … × 5

zio … error=5 type=5 offset=0 size=0 flags=1049728
ZFS sees I/O failures (error 5 = EIO)
440.284sd 6:0:0:0: [sda] Synchronizing SCSI cache

[sda] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT
HBA lost the link
440.285mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec883259)

enclosure logical id … slot(7)
Disk logically removed
474.279sd 6:0:2:0: [sdc] Synchronizing SCSI cacheDID_NO_CONNECTSlot 5 now
474.280mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec882f59)
474.784 – 480.004sd 6:0:3:0: attempting task abort! … three times

device_block, handle(0x000c)

log_info(0x3112011a) × 4
Slot 4 queue overflows
481.261zio pool=virtual-machines vdev=/dev/disk/by-id/scsi-350000396ec882ba0-part1 error=5 … × 5

WARNING: Pool 'virtual-machines' has encountered an uncorrectable I/O failure and has been suspended.
ZFS suspends pool
481.276sd 6:0:3:0: [sdd] Synchronizing SCSI cacheDID_NO_CONNECT
481.285mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec882ba1)Slot 4 finally yanked
949.633sd 6:0:1:0: device_block, handle(0x000a)Fresh boot – slot 6
950.632log_info(0x3112011a) × 4Queue over-flow
951.913mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec8832c9) … slot(6)Disk gone
1107.884sd 6:0:3:0: device_block, handle(0x000c)Slot 4 again
1108.633log_info(0x3112011a) × 4
1109.917mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec882ba1) … slot(4)Disk removed
1535.636sd 6:0:2:0: device_block, handle(0x000b)Slot 5
1536.385log_info(0x3112011a) × 6
1537.936mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec882f59) … slot(5)Finally removed

This issue occurs with both HBAs.

I've tried the following to troubleshoot the issue:

- Swapped PERC H710 for LSI 9211-8i (same behaviour).

- Moved card(s) to a different slot.

- Re-cabled both back-plane SAS ports.

- Tried running a dual port configuration (both SAS cables from the backplane plugged into each port on the HBA).


At this stage I am not sure what else to do.

For context, prior to installing Proxmox, I was running ESXI version 7.0.3 for over a year with the same SSDS and the same HBA and there were no issues.

I'm wondering if this is a firmware issue where the HBA(s) are either too old, or perhaps the firmware on the SSDs is too old.

Does anybody have any suggestions as I'm pulling my hair out trying to make sense of this!
 
root@pve:~# zpool status -v
pool: virtual-machines
state: ONLINE
scan: resilvered 46.7M in 00:00:00 with 0 errors on Fri Aug 1 20:46:38 2025
config:

NAME STATE READ WRITE CKSUM
virtual-machines ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
scsi-350000396ec882ba0 ONLINE 0 0 0
scsi-350000396ec882f58 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
scsi-350000396ec883258 ONLINE 0 0 0
scsi-350000396ec8832c8 ONLINE 0 0 0

errors: No known data errors

root@pve:~# zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
virtual-machines 744G 9.21G 735G - - 0% 1% 1.00x ONLINE -
mirror-0 372G 4.61G 367G - - 0% 1.24% - ONLINE
scsi-350000396ec882ba0 373G - - - - - - - ONLINE
scsi-350000396ec882f58 373G - - - - - - - ONLINE
mirror-1 372G 4.60G 367G - - 0% 1.23% - ONLINE
scsi-350000396ec883258 373G - - - - - - - ONLINE
scsi-350000396ec8832c8 373G - - - - - - - ONLINE

root@pve:~# lsscsi
[4:0:0:0] disk ATA INTEL SSDSC2CW12 400i /dev/sde
[6:0:0:0] disk NETAPP X438_TPM3V400AMD NA00 /dev/sda
[6:0:1:0] disk NETAPP X438_TPM3V400AMD NA00 /dev/sdb
[6:0:2:0] disk NETAPP X438_TPM3V400AMD NA00 /dev/sdc
[6:0:3:0] disk NETAPP X438_TPM3V400AMD NA00 /dev/sdd

root@pve:~# smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-13-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, [URL='http://www.smartmontools.org']www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Vendor: NETAPP
Product: X438_TPM3V400AMD
Revision: NA00
User Capacity: 400,088,457,216 bytes [400 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x50000396ec883258
Serial number: 3620A0F2TWLC
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Fri Aug 1 20:49:05 2025 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 5%
Current Drive Temperature: 28 C
Drive Trip Temperature: 64 C

Accumulated power on time, hours:minutes 56775:17
Manufactured in week 09 of year 2016
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 164044.239 0
write: 0 0 0 0 0 56024.095 0
verify: 0 0 0 0 0 19792.655 0

Non-medium error count: 12721

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Aborted (by user command) - 56593 - [- - -]

Long (extended) Self-test duration: 1800 seconds [30.0 minutes]


root@pve:~# smartctl -a /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-13-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, [URL='http://www.smartmontools.org']www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Vendor: NETAPP
Product: X438_TPM3V400AMD
Revision: NA00
User Capacity: 400,088,457,216 bytes [400 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x50000396ec8832c8
Serial number: 3620A0FWTWLC
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Fri Aug 1 20:49:09 2025 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 5%
Current Drive Temperature: 28 C
Drive Trip Temperature: 64 C

Accumulated power on time, hours:minutes 56775:30
Manufactured in week 09 of year 2016
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 166019.017 0
write: 0 0 0 0 0 61563.116 0
verify: 0 0 0 0 0 19792.501 0

Non-medium error count: 12693

No Self-tests have been logged


root@pve:~# smartctl -a /dev/sdc
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-13-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, [URL='http://www.smartmontools.org']www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Vendor: NETAPP
Product: X438_TPM3V400AMD
Revision: NA00
User Capacity: 400,088,457,216 bytes [400 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x50000396ec882f58
Serial number: 3620A09ETWLC
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Fri Aug 1 20:49:11 2025 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 6%
Current Drive Temperature: 28 C
Drive Trip Temperature: 64 C

Accumulated power on time, hours:minutes 56775:05
Manufactured in week 09 of year 2016
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 161812.747 0
write: 0 0 0 0 0 100356.700 0
verify: 0 0 0 0 0 19792.655 0

Non-medium error count: 12690

No Self-tests have been logged


root@pve:~# smartctl -a /dev/sdd
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-13-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, [URL='http://www.smartmontools.org']www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Vendor: NETAPP
Product: X438_TPM3V400AMD
Revision: NA00
User Capacity: 400,088,457,216 bytes [400 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x50000396ec882ba0
Serial number: 3620A02ETWLC
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Fri Aug 1 20:49:13 2025 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 6%
Current Drive Temperature: 28 C
Drive Trip Temperature: 64 C

Accumulated power on time, hours:minutes 56775:15
Manufactured in week 09 of year 2016
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 162046.623 0
write: 0 0 0 0 0 100890.216 0
verify: 0 0 0 0 0 19792.197 0

Non-medium error count: 12703

No Self-tests have been logged
 
Last edited:
I have exactly the same issue, but use the HBA passed through to a vm. Everything is fine in idle, when I start copying files onto the drives everything is okay for a while and then the drives are just disconnected with the same error messages you got (minus the zfs errors, as that is handled in the vm).

I do get these lines, in addition to what you posted:
Code:
mpt2sas_cm0: sending message unit reset !!
mpt2sas_cm0: message unit reset: SUCCESS
vfio-pci 0000:02000.0: enbaling device (002 -> 003)

At least this shows me that the issue has nothing to do with the vm passthrough. My next step will be testing with a different OS and not proxmox, I guess. Or just not using an hba and getting a regular SATA PCIe Adapter.

Small Update: I just tried copying the same files in bare metal Unraid and it worked. No disconnecting drives. I will try Open Media Vault bare metal next, as that is also based on Debian. Unraid is based on slackware. Still don't trust my system, but it copied files for several hours without issue, while the disconnecting drives happened after a few minutes in proxmox.

Second Update: It worked fine in Open Media Vault. I will now try to just do a new, clean proxmox install.

Third Update: Getting the same error again. Lots of write errors in Open Media Vault (running in a vm). Reinstalled OMV bare metal, mounted the same BTRFS pool and now it works without a hitch again. So it seems there is some issue in proxmox, but I cannot figure out what I may have done wrong. I think I will just have to ditch proxmox and run the NAS OS bare metal...
 
Last edited:
Everything is fine in idle, when I start copying files onto the drives everything is okay for a while and then the drives are just disconnected with the same error messages you got (minus the zfs errors, as that is handled in the vm).
Exactly the same for me. It does not happen instantly after higher workload.
I do get these lines, in addition to what you posted:
Also the case for me.
So it seems there is some issue in proxmox, but I cannot figure out what I may have done wrong.
I use omv bare metal and get all of these issues.

My kernel is 6.12.57+deb13-amd64. My card is a BroadCom 9500-16i. I am using two Slim Line SAS4.0 SFF-8654 8i to SATA adapter cables.

End result is a kernel panic due to CPU being stuck on txg_sync. ZFS pool is degraded after restart.

Is there any update or even solution to this?
 
Update from my side.

If you still have issues, I would try to chat with CoPilot. I told CoPilot about everything I knew and it might have found the root cause. Plot twist: together with my HBA error logs I also had sata link resets taking place. It turned out my SATA power cable setup is fragile. I am using 10x Y-SATA splitter while a single SATA pin is only about 4,5 ampere. Not so good if 9 drives are connected on a single SATA connector. At least this is the opionion of CoPilot.

HBA error logs (AI translated it with help of https://serverfault.com/questions/593015/i-o-errors-but-no-smart-or-zfs-errors):
Code:
server kernel: mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)

I/O error logs:
Code:
server kernel: I/O error, dev sdg, sector 331879352 op 0x0:(READ) flags 0x0 phys_seg 88 prio class 2
server kernel: zio pool=mstoragepool vdev=/dev/disk/by-id/ata-ST8000AS0002-1NA17Z_Z840EK4H-part1 error=5 type=1 offset=169921179648 size=700416 flags=2148533424
server kernel: sd 0:0:8:0: [sdg] tag#1401 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=3s

Well, in the end I am glad using ZFS all the way down. I think 90% of other filesystems were already dead end to me.