Hi all.
I've been trying to solve this issue for a while now and haven't managed to resolve it.
Hopefully someone may be able to offer some insight.
HARDWARE & ENVIRONMENT:
- Dell PowerEdge T420 with 8 × 3.5″ SAS/SATA back-plane.
- Dell PERC H710 D1 flashed to IT mode (LSI SAS2308), firmware 20.00.07.00.
- LSI 9211-8i (also SAS2308, IT mode).
- 4 × NetApp X438_TPM3V400AMD 400 GB SAS SSDs (firmware NA00) - Drives have all passed long smartctl tests with no errors.
- Proxmox VE 8.4 (latest stable) - Kernel 6.8.12-13-pve, mpt3sas driver 43.100.00.00.
I have setup the 4 SSDs in ZFS as a mirror.
There are no issues with the pool while idle and no virtual machines running, however, as soon as I put any load on the SSDs (i.e., start virtual machines), after approximately 3 minutes or so, drives (at random) are removed from the pool.
This issue occurs with both HBAs.
I've tried the following to troubleshoot the issue:
- Swapped PERC H710 for LSI 9211-8i (same behaviour).
- Moved card(s) to a different slot.
- Re-cabled both back-plane SAS ports.
- Tried running a dual port configuration (both SAS cables from the backplane plugged into each port on the HBA).
At this stage I am not sure what else to do.
For context, prior to installing Proxmox, I was running ESXI version 7.0.3 for over a year with the same SSDS and the same HBA and there were no issues.
I'm wondering if this is a firmware issue where the HBA(s) are either too old, or perhaps the firmware on the SSDs is too old.
Does anybody have any suggestions as I'm pulling my hair out trying to make sense of this!
I've been trying to solve this issue for a while now and haven't managed to resolve it.
Hopefully someone may be able to offer some insight.
HARDWARE & ENVIRONMENT:
- Dell PowerEdge T420 with 8 × 3.5″ SAS/SATA back-plane.
- Dell PERC H710 D1 flashed to IT mode (LSI SAS2308), firmware 20.00.07.00.
- LSI 9211-8i (also SAS2308, IT mode).
- 4 × NetApp X438_TPM3V400AMD 400 GB SAS SSDs (firmware NA00) - Drives have all passed long smartctl tests with no errors.
- Proxmox VE 8.4 (latest stable) - Kernel 6.8.12-13-pve, mpt3sas driver 43.100.00.00.
I have setup the 4 SSDs in ZFS as a mirror.
There are no issues with the pool while idle and no virtual machines running, however, as soon as I put any load on the SSDs (i.e., start virtual machines), after approximately 3 minutes or so, drives (at random) are removed from the pool.
Here is the DMESG log (courtesy of ChatGPT):
(I kept all[I][COLOR=black][FONT=Consolas]device_block[/FONT][/COLOR][/I]
, [I][COLOR=black][FONT=Consolas]task abort[/FONT][/COLOR][/I]
, [I][COLOR=black][FONT=Consolas]log_info[/FONT][/COLOR][/I]
, [I][COLOR=black][FONT=Consolas]transport_port_remove[/FONT][/COLOR][/I]
, [I][COLOR=black][FONT=Consolas]Synchronizing Cache[/FONT][/COLOR][/I]
, ZFS [I][COLOR=black][FONT=Consolas]zio … error=5[/FONT][/COLOR][/I]
, and pool-suspension lines. Network / VM-bridge chatter is omitted so you can focus on the SAS events.)Boot time (s) | Kernel message(s) – verbatim order | Notes |
243.092 | mpt3sas 0000:41:00.0: invalid VPD tag 0x00 (size 0) at offset 0; assume missing optional EEPROM | Cosmetic, happens once after driver loads |
438.755 | sd 6:0:0:0: device_block, handle(0x0009) | Drive slot 7 stalls |
439.754 | mpt2sas_cm0: log_info(0x3112011a): originator(PL), code(0x12), sub_code(0x011a) × 3 | OPEN_REJECT / Queue Full |
440.254 | sd 6:0:0:0: device_unblock and setting to running, handle(0x0009) | HBA retries succeeded |
440.260 | zio pool=virtual-machines vdev=/dev/disk/by-id/scsi-350000396ec883258-part1 error=5 … × 5zio … error=5 type=5 offset=0 size=0 flags=1049728 | ZFS sees I/O failures (error 5 = EIO) |
440.284 | sd 6:0:0:0: [sda] Synchronizing SCSI cache [sda] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT | HBA lost the link |
440.285 | mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec883259) enclosure logical id … slot(7) | Disk logically removed |
474.279 | sd 6:0:2:0: [sdc] Synchronizing SCSI cache → DID_NO_CONNECT | Slot 5 now |
474.280 | mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec882f59) | — |
474.784 – 480.004 | sd 6:0:3:0: attempting task abort! … three timesdevice_block, handle(0x000c) log_info(0x3112011a) × 4 | Slot 4 queue overflows |
481.261 | zio pool=virtual-machines vdev=/dev/disk/by-id/scsi-350000396ec882ba0-part1 error=5 … × 5WARNING: Pool 'virtual-machines' has encountered an uncorrectable I/O failure and has been suspended. | ZFS suspends pool |
481.276 | sd 6:0:3:0: [sdd] Synchronizing SCSI cache → DID_NO_CONNECT | — |
481.285 | mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec882ba1) | Slot 4 finally yanked |
949.633 | sd 6:0:1:0: device_block, handle(0x000a) | Fresh boot – slot 6 |
950.632 | log_info(0x3112011a) × 4 | Queue over-flow |
951.913 | mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec8832c9) … slot(6) | Disk gone |
1107.884 | sd 6:0:3:0: device_block, handle(0x000c) | Slot 4 again |
1108.633 | log_info(0x3112011a) × 4 | — |
1109.917 | mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec882ba1) … slot(4) | Disk removed |
1535.636 | sd 6:0:2:0: device_block, handle(0x000b) | Slot 5 |
1536.385 | log_info(0x3112011a) × 6 | — |
1537.936 | mpt3sas_transport_port_remove: removed sas_addr(0x50000396ec882f59) … slot(5) | Finally removed |
This issue occurs with both HBAs.
I've tried the following to troubleshoot the issue:
- Swapped PERC H710 for LSI 9211-8i (same behaviour).
- Moved card(s) to a different slot.
- Re-cabled both back-plane SAS ports.
- Tried running a dual port configuration (both SAS cables from the backplane plugged into each port on the HBA).
At this stage I am not sure what else to do.
For context, prior to installing Proxmox, I was running ESXI version 7.0.3 for over a year with the same SSDS and the same HBA and there were no issues.
I'm wondering if this is a firmware issue where the HBA(s) are either too old, or perhaps the firmware on the SSDs is too old.
Does anybody have any suggestions as I'm pulling my hair out trying to make sense of this!