Problem with MegaRAID SAS3508 controller

So we've tried to reproduce it on a test system provided to us, but couldn't so far.

Could you detail your setups, the workload and any steps necessary to trigger this?
 
We got a test system with a Broadcom / LSI Fusion-MPT SAS38xx we are currently trying to reproduce the issues here, and the other issues we've encountered, on.

Code:
Serial Attached SCSI controller [0107]: Broadcom / LSI Fusion-MPT 12GSAS/PCIe Secure SAS38xx [1000:00e6]
Subsystem: Broadcom / LSI 9500-16i Tri-Mode HBA [1000:4050]

So we've tried to reproduce it on a test system provided to us, but couldn't so far.

Could you detail your setups, the workload and any steps necessary to trigger this?

Since we have confirmed that no issues occur with the LSI 9500-16i Tri-Mode HBA, I believe testing with the SAS3808 (LSI 9500-16i) or SAS3816 (LSI 9500-8i) operating in IT mode would not be meaningful.

* I use the LSI 9500-16i and LSI 9400-16i, and I have never had any problems with them.

Although they are the same SAS3808 and SAS3816 models, I believe testing will not be effective unless they are the iMR 9540-16i and 9540-8i versions.

The driver they use that causes the problem always appears to be `megaraid_sas`.

* Since I don't have these devices myself, this is based on what I observed in their logs.
* The LSI 9500-16i Tri-Mode HBA is an mpt3sas.
 

Attachments

Last edited:
Since we have confirmed that no issues occur with the LSI 9500-16i Tri-Mode HBA, I believe testing with the SAS3808 (LSI 9500-16i) or SAS3816 (LSI 9500-8i) operating in IT mode would not be meaningful.

* I use the LSI 9500-16i and LSI 9400-16i, and I have never had any problems with them.

Although they are the same SAS3808 and SAS3816 models, I believe testing will not be effective unless they are the iMR 9540-16i and 9540-8i versions.

The driver they use that causes the problem always appears to be `megaraid_sas`.

* Since I don't have these devices myself, this is based on what I observed in their logs.
* The LSI 9500-16i Tri-Mode HBA is an mpt3sas.
The testsystem in question has a Broadcom MegaRAID 9540-8i. It is one of the affected controllers.

As mentioned, we weren't able to trigger the issues yet, so please provide details about the setups (including controller firmware and connected disks + firmware), the RAID configuration, filesystems/usage and the steps that usually trigger it.
 
  • Like
Reactions: waltar
Hello,

I have an affected system also: Supermicro MB, model #: H11DSi rev 2.x
BIOS: v3.5
Firmware: v1.52.23

MegaRAID 9660-16i
- Firmware: v8.17.1 (but also had issues on an older version)
- I tried the Proxmox-delivered driver and updated to the Broadcom driver: v8.17.1 (verified that the new driver was in use via "modinfo".

The controller has 4, 3.8TB Micron NVMe drives connected.
(This worked for the past year on ESXi v7 perfectly, so I know the controller and drives are working in this server.)

Proxmox v9.1.9, Enterprise repo. Fully patched.

I have tried configuring the drives as "JBOD", using ZFS raidz. Fails after any data migration.
Now, I have it configured back to hardware RAID5 and LVM on Proxmox.

I can reproduce the error every time via a simple VM clone operation. I get about 40GB copied, and the controller basically shuts down.

I also tried the kernel parameters "iommu=pt" and "amd_iommu=on". THIS IS WEIRD. Before the parameters, the controller would die after copying 47GB. Now, with the parameters, is has a long 2-3 minute pause, then continues for another 40-50GB, rinse, repeat. Nothing in dmesg this time.


This is a long running and hard to detect/fix issue. What are my other options? I read the downgrading the kernel to an older version helps, but I do not know the exact steps for that.


-Brian
 
Last edited:
Did you install storcli already ?
storcli /call show # show number and model of controllers, first is 0, second is 1
In cmd set "x" to your controller number and try the available profiles, a profile change need a controller restart !
storcli /cx show profile
storcli /cx set profile profileid=<value> ; storcli /cx restart
 
Last edited: