HPE P816i-a SR Gen 10 - SCSI Resets and ZFS failures

jtheisen

New Member
Jun 11, 2024
5
0
1
Germany
Hi everyone,

i have a kind of weird behaviour with a PBS based on a HP DL380 Gen10.
Since the upgrade to PBS 4 the system sporadically "spits out" one of 10 SAS disks.
Interestingly enough its a different disk every time and it's mostly just one.
These disks get marked as failed in the ZFS mirror device and replaced with the hot spare.
After a cold reboot of the server, the drive comes back, gets resilvered and continues to work fine.

Had multiple of these events over the past few weeks, always with a different disk, so i can rule out a hard drive failure there.

journalctl of one of the reset events
Code:
Jan 31 21:05:05 kg-pbs1 proxmox-backup-proxy[1963]: Upload backup log to datastore 'kg-pbs1-hdd', namespace 'KG' vm/115/2026-01-31T20:03:26Z/client.log.blob
Jan 31 21:05:33 kg-pbs1 kernel: sd 0:0:0:0: Power-on or device reset occurred
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: attempting TASK ABORT on scsi 0:0:0:0 for SCSI cmd at 00000000458a83ba
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: resetting scsi 0:0:0:0 SCSI cmd at 00000000458a83ba due to cmd opcode 0x8a
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: TASK ABORT on scsi 0:0:0:0 for SCSI cmd at 00000000458a83ba: SUCCESS
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: attempting TASK ABORT on scsi 0:0:0:0 for SCSI cmd at 000000002920b10f
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: TASK ABORT on scsi 0:0:0:0 for SCSI cmd at 000000002920b10f: SUCCESS
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: attempting TASK ABORT on scsi 0:0:0:0 for SCSI cmd at 0000000054c3f49a
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: scsi 0:0:0:0 for SCSI cmd at 0000000054c3f49a already completed
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: attempting TASK ABORT on scsi 0:0:0:0 for SCSI cmd at 00000000c18ee9f1
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: scsi 0:0:0:0 for SCSI cmd at 00000000c18ee9f1 already completed
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: attempting TASK ABORT on scsi 0:0:0:0 for SCSI cmd at 0000000009dbe4e6
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: scsi 0:0:0:0 for SCSI cmd at 0000000009dbe4e6 already completed
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: attempting TASK ABORT on scsi 0:0:0:0 for SCSI cmd at 00000000f64791f0
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: scsi 0:0:0:0 for SCSI cmd at 00000000f64791f0 already completed
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: attempting TASK ABORT on scsi 0:0:0:0 for SCSI cmd at 00000000390261f2
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: scsi 0:0:0:0 for SCSI cmd at 00000000390261f2 already completed
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: attempting TASK ABORT on scsi 0:0:0:0 for SCSI cmd at 00000000737cb003
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: scsi 0:0:0:0 for SCSI cmd at 00000000737cb003 already completed
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: attempting TASK ABORT on scsi 0:0:0:0 for SCSI cmd at 000000009eb54a10
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: scsi 0:0:0:0 for SCSI cmd at 000000009eb54a10 already completed
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: attempting TASK ABORT on scsi 0:0:0:0 for SCSI cmd at 000000001b6475ef
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: scsi 0:0:0:0 for SCSI cmd at 000000001b6475ef already completed
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: reset of scsi 0:0:0:0: SUCCESS
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: resetting scsi 0:0:0:0 SCSI cmd at 000000002920b10f due to cmd opcode 0x8a
Jan 31 21:05:52 kg-pbs1 kernel: smartpqi 0000:5c:00.0: reset of scsi 0:0:0:0: SUCCESS
Jan 31 21:05:52 kg-pbs1 kernel: sd 0:0:0:0: Power-on or device reset occurred
Jan 31 21:05:52 kg-pbs1 zed[881611]: eid=1093 class=delay pool='kg-pbs1-hdd' vdev=wwn-0x5000c500f2977fe3-part1 size=8192 offset=9528072564736 priority=3 err=0 flags=0x80100480 delay=30133ms
Jan 31 21:05:52 kg-pbs1 zed[881613]: eid=1091 class=delay pool='kg-pbs1-hdd' vdev=wwn-0x5000c500f2977fe3-part1 size=655360 offset=11682020642816 priority=3 err=0 flags=0x80100480 delay=30200ms
Jan 31 21:05:52 kg-pbs1 zed[881610]: eid=1092 class=delay pool='kg-pbs1-hdd' vdev=wwn-0x5000c500f2977fe3-part1 size=4096 offset=9528067448832 priority=3 err=0 flags=0x300080 delay=30133ms bookmark=54:391667:1:0
Jan 31 21:05:52 kg-pbs1 zed[881609]: eid=1089 class=delay pool='kg-pbs1-hdd' vdev=wwn-0x5000c500f2977fe3-part1 size=4096 offset=9528113053696 priority=3 err=0 flags=0x300080 delay=30111ms bookmark=54:391804:1:0
Jan 31 21:05:52 kg-pbs1 zed[881612]: eid=1090 class=delay pool='kg-pbs1-hdd' vdev=wwn-0x5000c500f2977fe3-part1 size=917504 offset=11682051182592 priority=3 err=0 flags=0x80100480 delay=30182ms
Jan 31 21:05:52 kg-pbs1 zed[881618]: eid=1094 class=delay pool='kg-pbs1-hdd' vdev=wwn-0x5000c500f2977fe3-part1 size=786432 offset=11682050396160 priority=3 err=0 flags=0x80100480 delay=30201ms
Jan 31 21:05:52 kg-pbs1 zed[881620]: eid=1095 class=delay pool='kg-pbs1-hdd' vdev=wwn-0x5000c500f2977fe3-part1 size=4096 offset=9528067436544 priority=3 err=0 flags=0x300080 delay=30135ms bookmark=54:391506:1:0
Jan 31 21:05:52 kg-pbs1 zed[881622]: eid=1096 class=delay pool='kg-pbs1-hdd' vdev=wwn-0x5000c500f2977fe3-part1 size=196608 offset=9519235899392 priority=3 err=0 flags=0x80100480 delay=30154ms
Jan 31 21:05:52 kg-pbs1 zed[881624]: eid=1097 class=delay pool='kg-pbs1-hdd' vdev=wwn-0x5000c500f2977fe3-part1 size=98304 offset=9528109023232 priority=3 err=0 flags=0x80100480 delay=30129ms
Jan 31 21:05:52 kg-pbs1 zed[881626]: eid=1098 class=delay pool='kg-pbs1-hdd' vdev=wwn-0x5000c500f2977fe3-part1 size=1048576 offset=11682019594240 priority=3 err=0 flags=0x80100480 delay=30204ms

The controller is configured in HBA mode with Caches disabled, so it shouldn't interfere with the communication.

Tried replacing the smartpqi driver with the current one from the repository but no change.
Code:
root@kg-pbs1:~# modinfo smartpqi | grep version
version:        2.1.38-022
description:    Driver for Microchip Smart Family Controller version 2.1.38-022 (d-6f8997e/s-e7f7d7c)
srcversion:     E6792B179A3DF1290C1B99B
vermagic:       6.17.4-2-pve SMP preempt mod_unload modversions

Firmware of the P816 is current: 7.81

Code:
root@kg-pbs1:~# proxmox-backup-manager versions --verbose
proxmox-backup                      4.0.0         running kernel: 6.17.4-2-pve
proxmox-backup-server               4.1.1-1       running version: 4.1.1   
proxmox-kernel-helper               9.0.4                                   
proxmox-kernel-6.17.4-2-pve-signed  6.17.4-2                               
proxmox-kernel-6.17                 6.17.4-2                               
proxmox-kernel-6.17.4-1-pve-signed  6.17.4-1                               
proxmox-kernel-6.14.11-5-pve-signed 6.14.11-5                               
proxmox-kernel-6.14                 6.14.11-5                               
proxmox-kernel-6.14.11-4-pve-signed 6.14.11-4                               
proxmox-kernel-6.8                  6.8.12-17                               
proxmox-kernel-6.8.12-17-pve-signed 6.8.12-17                               
proxmox-kernel-6.8.4-2-pve-signed   6.8.4-2                                 
ifupdown2                           3.3.0-1+pmx11                           
libjs-extjs                         7.0.0-5                                 
proxmox-backup-docs                 4.1.1-1                                 
proxmox-backup-client               4.1.1-1                                 
proxmox-mail-forward                1.0.2                                   
proxmox-mini-journalreader          1.6                                     
proxmox-offline-mirror-helper       0.7.3                                   
proxmox-widget-toolkit              5.1.5                                   
pve-xtermjs                         5.5.0-3                                 
smartmontools                       7.4-pve1                               
zfsutils-linux                      2.3.4-pve1

Storage config for reference:
Code:
root@kg-pbs1:~# zpool status
  pool: kg-pbs1-hdd
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Feb  1 14:36:13 2026
        3.27T / 42.7T scanned, 35.9G / 40.5T issued at 57.4M/s
        36.2G resilvered, 0.09% done, 8 days 13:21:48 to go
config:

        NAME                          STATE     READ WRITE CKSUM
        kg-pbs1-hdd                   ONLINE       0     0     0
          mirror-0                    ONLINE       0     0     0
            wwn-0x5000c500f2975c23    ONLINE       0     0     0
            spare-1                   ONLINE       0     0     0
              wwn-0x5000c500f2977fe3  ONLINE       0     0     0
              wwn-0x5000c500f29485fb  ONLINE       0     0     0
          mirror-1                    ONLINE       0     0     0
            wwn-0x5000c500ec5aa9ef    ONLINE       0     0     1  (resilvering)
            wwn-0x5000c500f2961acb    ONLINE       0     0     0
          mirror-2                    ONLINE       0     0     0
            wwn-0x5000039b4851bc59    ONLINE       0     0     0
            wwn-0x5000c500f2962167    ONLINE       0     0     2  (resilvering)
          mirror-3                    ONLINE       0     0     0
            wwn-0x5000c500f2975bcb    ONLINE       0     0     0
            wwn-0x5000c500f296381b    ONLINE       0     0     0
          mirror-4                    ONLINE       0     0     0
            wwn-0x5000c500f294cb23    ONLINE       0     0     0
            wwn-0x5000c500f29650c3    ONLINE       0     0     1  (resilvering)
        spares
          wwn-0x5000c500f29485fb      INUSE     currently in use

errors: No known data errors

The problem is also independent of kernel 6.14 and 6.17 and came up with both versions.

Does anyone have any similar experiences or might know something i could try to further troubleshoot this?

Thanks in advance!

EDIT: Forgot to state: The pool consists of 3,5" Seagate Exos X18 SAS 12TB drives
During the time of the reset there were backups running, so it might have something to do with a timeout in a way?
 
Last edited: