Problem with MegaRAID SAS3508 controller

So we've tried to reproduce it on a test system provided to us, but couldn't so far.

Could you detail your setups, the workload and any steps necessary to trigger this?
 
We got a test system with a Broadcom / LSI Fusion-MPT SAS38xx we are currently trying to reproduce the issues here, and the other issues we've encountered, on.

Code:
Serial Attached SCSI controller [0107]: Broadcom / LSI Fusion-MPT 12GSAS/PCIe Secure SAS38xx [1000:00e6]
Subsystem: Broadcom / LSI 9500-16i Tri-Mode HBA [1000:4050]

So we've tried to reproduce it on a test system provided to us, but couldn't so far.

Could you detail your setups, the workload and any steps necessary to trigger this?

Since we have confirmed that no issues occur with the LSI 9500-16i Tri-Mode HBA, I believe testing with the SAS3808 (LSI 9500-16i) or SAS3816 (LSI 9500-8i) operating in IT mode would not be meaningful.

* I use the LSI 9500-16i and LSI 9400-16i, and I have never had any problems with them.

Although they are the same SAS3808 and SAS3816 models, I believe testing will not be effective unless they are the iMR 9540-16i and 9540-8i versions.

The driver they use that causes the problem always appears to be `megaraid_sas`.

* Since I don't have these devices myself, this is based on what I observed in their logs.
* The LSI 9500-16i Tri-Mode HBA is an mpt3sas.
 

Attachments

Last edited:
Since we have confirmed that no issues occur with the LSI 9500-16i Tri-Mode HBA, I believe testing with the SAS3808 (LSI 9500-16i) or SAS3816 (LSI 9500-8i) operating in IT mode would not be meaningful.

* I use the LSI 9500-16i and LSI 9400-16i, and I have never had any problems with them.

Although they are the same SAS3808 and SAS3816 models, I believe testing will not be effective unless they are the iMR 9540-16i and 9540-8i versions.

The driver they use that causes the problem always appears to be `megaraid_sas`.

* Since I don't have these devices myself, this is based on what I observed in their logs.
* The LSI 9500-16i Tri-Mode HBA is an mpt3sas.
The testsystem in question has a Broadcom MegaRAID 9540-8i. It is one of the affected controllers.

As mentioned, we weren't able to trigger the issues yet, so please provide details about the setups (including controller firmware and connected disks + firmware), the RAID configuration, filesystems/usage and the steps that usually trigger it.
 
  • Like
Reactions: waltar
Hello,

I have an affected system also: Supermicro MB, model #: H11DSi rev 2.x
BIOS: v3.5
Firmware: v1.52.23

MegaRAID 9660-16i
- Firmware: v8.17.1 (but also had issues on an older version)
- I tried the Proxmox-delivered driver and updated to the Broadcom driver: v8.17.1 (verified that the new driver was in use via "modinfo".

The controller has 4, 3.8TB Micron NVMe drives connected.
(This worked for the past year on ESXi v7 perfectly, so I know the controller and drives are working in this server.)

Proxmox v9.1.9, Enterprise repo. Fully patched.

I have tried configuring the drives as "JBOD", using ZFS raidz. Fails after any data migration.
Now, I have it configured back to hardware RAID5 and LVM on Proxmox.

I can reproduce the error every time via a simple VM clone operation. I get about 40GB copied, and the controller basically shuts down.

I also tried the kernel parameters "iommu=pt" and "amd_iommu=on". THIS IS WEIRD. Before the parameters, the controller would die after copying 47GB. Now, with the parameters, is has a long 2-3 minute pause, then continues for another 40-50GB, rinse, repeat. Nothing in dmesg this time.


This is a long running and hard to detect/fix issue. What are my other options? I read the downgrading the kernel to an older version helps, but I do not know the exact steps for that.


-Brian
 
Last edited:
Did you install storcli already ?
storcli /call show # show number and model of controllers, first is 0, second is 1
In cmd set "x" to your controller number and try the available profiles, a profile change need a controller restart !
storcli /cx show profile
storcli /cx set profile profileid=<value> ; storcli /cx restart
 
Last edited:
Hello, since I have a 9600 series Tri-mode adapter, I am using the storcli2 utility.
There is no "profile" command (it was removed for some reason).

But, thanks for turning me to digging into the options of storcli2!

I ran './storcli2 /c0 show events', and saw many 'consistency' errors that were being corrected during an initialization of the single VG.

Then, a few hours later, the controller was logging hi-temp errors. Hmmm. The server case is huge, many fans, no load on the controller or host. Seems like I need to dig into what is happening there.

-Brian
 
  • Like
Reactions: waltar
So we've tried to reproduce it on a test system provided to us, but couldn't so far.

Could you detail your setups, the workload and any steps necessary to trigger this?
In my case :
- 2 x Supermicro SuperServer 2U 2014S-TR

For each system :
- EPYC 7313 DP/UP
- 8 x 32 GB of Registered DDR4 3200 ECC (brand supermicro)
- 2 x 400 GB of Micron 7450 MAX on Broadcom MR 9540-2M2 JBOD mode as plain disks for RAID 1 ZFS
- 8 x Samsung PM897 3.84TB on Broadcom HBA 9500-16i
- Supermicro AOC-STGF-I2S-O dual port 10 GbE SFP+

As for motherboard firmware :
indicate_fw.png
Firmware Version
01.08.01
indicate_fw.png
Firmware Build Time
12/17/2025
indicate_redfish.png
Redfish Version
1.21.1
indicate_bios.png
BIOS Firmware Version
BIOS Date: 12/17/2025 Ver 3.6
indicate_cpld.png
CPLD Version
F0.A6.44

Controller specs and firmware :
1779176880396.png

I use "2 x 400 GB of Micron 7450 MAX on Broadcom MR 9540-2M2 JBOD mode as plain disks for RAID 1 ZFS" for my boot FS where Proxmox Backup server currently version 4.2.0 with pinned kernel 6.14.11-7-pve (2026-04-30T09:27Z).

There is not much "workload" on these disks as they are boot drives and "8 x Samsung PM897 3.84TB on Broadcom HBA 9500-16i" combo seems to be working fine (even when I was using kernel 6.17 or 7.0.0-3-pve).

Steps to reproduce :
> Case 1 : fresh installation
- Download latest proxmox ISO with kernel 6.17 or 7 ;
- Write ISO to USB Stick ;
- Boot from USB Stick ;
- Go until the end of installation process ;
=> 50% of the time, process freezes on writing EFI ;

> Case 2 : OS upgrade from kernel >= 6.17
- Install system with kernel 6.17 ;
- Try upgrading system to PBS 4.2.X with kernel 7 ;
=> System crashes 100% of the time with similar errors to : https://forum.proxmox.com/threads/problem-with-megaraid-sas3508-controller.179378/post-842596

What's strange is that sometimes writing works for one disk but not the other, sometimes it crashes on the first one (but mostly when writing second one).
 
We have a setup with JBOD and Ceph OSDs on top where we can reliably crash it.
At the moment @dherzig is in the process of bisecting since 7.0 doesn't seem affected in our tests.

In our tests we have KIOXIA disks attached to it, the same ones we saw the mpt3sas issues with.
 
I would like to confirm that I am seeing what appears to be the same issue on my system.

Hardware:

Controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury]
PCI ID: 1000:005f
Subsystem: 1000:9341
The controller is currently configured in JBOD mode
Two disks are connected directly to onboard SATA and are visible
Two disks are connected through the MegaRAID controller and are not visible in Proxmox

System:

proxmox-ve: 9.2.0
pve-manager: 9.2.2
kernel: 7.0.2-6-pve

lspci output:

05:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] [1000:005f] (rev 02)
Subsystem: Broadcom / LSI Device [1000:9341]
Kernel modules: megaraid_sas

There is no Kernel driver in use: megaraid_sas line shown for the controller.

Relevant dmesg messages:

megaraid_sas 0000:05:00.0: FW now in Ready state
megaraid_sas 0000:05:00.0: controller type : iMR(0MB)
megaraid_sas 0000:05:00.0: Secure JBOD support : Yes
megaraid_sas 0000:05:00.0: JBOD sequence map support : Yes
megaraid_sas 0000:05:00.0: megasas_get_ld_map_info DCMD timed out, RAID map is disabled
megaraid_sas 0000:05:00.0: DCMD(opcode: 0x200e102) is timed out, func:megasas_issue_blocked_cmd
megaraid_sas 0000:05:00.0: megasas_sync_pd_seq_num DCMD timed out, continue without JBOD sequence map
megaraid_sas 0000:05:00.0: DCMD(opcode: 0x2010100) is timed out, func:megasas_issue_blocked_cmd
megaraid_sas 0000:05:00.0: Ignore DCMD timeout: megasas_get_pd_list
megaraid_sas 0000:05:00.0: DCMD(opcode: 0x3010100) is timed out, func:megasas_issue_blocked_cmd
megaraid_sas 0000:05:00.0: Ignore DCMD timeout: megasas_ld_list_query
megaraid_sas 0000:05:00.0: failed to get LD list
megaraid_sas 0000:05:00.0: megasas_init_fw: megasas_get_device_list failed
megaraid_sas 0000:05:00.0: Failed from megasas_init_fw

The controller itself is detected by PCI, but the disks behind it do not appear in lsblk.

I have not yet tested with an older Proxmox kernel, but this looks similar to the issue reported here with MegaRAID SAS-3 3008 / JBOD disks not being detected on newer kernels.