HPE MR408i Raid Controller crashes with Proxmox

Necaro

New Member
Jun 20, 2024
8
0
1
Dear Proxmox-Community,

I'm relatively new to Proxmox and was working with VMware primarily before.

I have a new Host with the following Specs:

HPE ProLiant DL360 Gen11 SFF
2x Intel Xeon-Gold 5515+ (3.2Ghz/8-Core/165W)
8x HPE 32GB Dual Rank DDR5-4800 Registered Memory
1x HPE MR408i-o Gen11 SPDM Storage Controller
1x HPE 96W Smart Storage Battery with 145mm Cable
4x HPE 1.92TB SAS 12G Read Intensive SFF (2.5in) SSD
1x HPE NS204i-u Gen11 Hot Plug Boot Device
1x Broadcom BCM5719 1Gb 4-port BASE-T Adapter
1x HPE Ethernet 10Gb 2-port BASE-T BCM57416 Adapter

Proxmox is installed on the NS204i boot device. The SSDs are configured as a Raid 5 with the MR408i raid controller and with LVM-Thin which serves as the datastore.
VMs were migrated/imported from VMware which worked without problems.

My problem is that 1-2 times a day the raid controller seems to crash. During that time all VMs and the host stop responding and after a few minutes resume normal operation. Firmware of all hardware devices is up to date und also the Proxmox installation is up to date. I suspect that this happens when there is a spike in I/O operations.

I attached a screenshot of the Proxmox information and of the CPU usage graph during such a crash. One can see the gap in the graph where the host hangs.
I also attached a log snippet from when the crash happens. iLO also logs th crash with the following event:
EVENT (31-Jul-2025 08:00): ControllerPreviousError (Slot=14, 0x7f833119) Redfish event from /redfish/v1/Systems/1/Storage/DE00B000/Controllers/0

Except from the crashes everything runs fast and normally.

Has anyone an idea what could cause this?
I would be grateful for any tips as I am out of ideas at the moment.
If I can provide more useful information, please let me know.

P.S.: Ditching the raid controller or configuring it with pass through for ZFS Raid is sadly not an option atm.

Best regards,
Alex
 

Attachments

  • proxmox_infos.png
    proxmox_infos.png
    57.8 KB · Views: 7
  • proxmox_cpu_usage.png
    proxmox_cpu_usage.png
    35.6 KB · Views: 7
  • proxmox_crash_log.txt
    proxmox_crash_log.txt
    7.4 KB · Views: 7
Last edited:
Sadly, there is nothing anyone here can do about that:

Code:
Jul 31 10:00:13 prox1 kernel: megaraid_sas 0000:3b:00.0: 3259 (807271183s/0x0020/DEAD) - Fatal firmware error: Line 411 in fw\hw\debug\SnapDumpHelper.c
Jul 31 10:00:13 prox1 kernel: megaraid_sas 0000:3b:00.0: 3263 (807271191s/0x0020/CRIT) - Controller encountered an error and was reset

You may contact HPE support and send the dmesg output to them for further diagnostic.
 
  • Like
Reactions: waltar
Thanks for your reply.
Yeah I did see those messages. I was wondering if perhaps some raid controller settings or anything could cause these crashes in combination with Proxmox.
Perhaps someone has a similar constellation running without problems and can share his raid (controller) settings.

I will also try contacting HPE Support but I'm not sure if they are willing to help as afaik Promox is not officially supported by HPE.
 
Are you running the latest firmware for this controller and SPP for this server ?

I'm about to pull the trigger and purchase 9 servers with the mentioned boot controller, but now I'm a bit hesitant.

Please let us know if HPE support are coming back you with solution.

I have searched for similar errors , and it might worth a try to test the following:

1: quiet pcie_aspm=off
2: Disable legacy boot, Use UEFI boot only.

Also are there any errors in the AHS or IML logs ?
 
Hi!

Firmware problem, use HPE SPP DVD to upgrade.

Code:
https://downloadmirror.intel.com/776844/35xx_MR_iMR_FWPKG-51.23.0-4637_Release_notes.txt

DCSG01177464    iMR: Fatal firmware error: Line 379 in fw\hw\debug\SnapDumpHelper.c detected after converted SATA SSD PD to JBOD
 
Hi!

Firmware problem, use HPE SPP DVD to upgrade.

Code:
https://downloadmirror.intel.com/776844/35xx_MR_iMR_FWPKG-51.23.0-4637_Release_notes.txt

DCSG01177464    iMR: Fatal firmware error: Line 379 in fw\hw\debug\SnapDumpHelper.c detected after converted SATA SSD PD to JBOD
Thanks for the tip, Raid Controller firmware was my first guess as well but I have already the newest version from HPE (52.32.3-6118) installed.
 
Last edited:
Are you running the latest firmware for this controller and SPP for this server ?

I'm about to pull the trigger and purchase 9 servers with the mentioned boot controller, but now I'm a bit hesitant.

Please let us know if HPE support are coming back you with solution.

I have searched for similar errors , and it might worth a try to test the following:

1: quiet pcie_aspm=off
2: Disable legacy boot, Use UEFI boot only.

Also are there any errors in the AHS or IML logs ?
Hey,

Are you running the latest firmware for this controller and SPP for this server ?
>> Yeah I have all latest available firmware from HPE installed.

I'm about to pull the trigger and purchase 9 servers with the mentioned boot controller, but now I'm a bit hesitant.
>> The boot controller (NS204i) is not the problem I guess. It's the MR408i raid controller which is used for VM storage, which is the problem and crashes.

Also are there any errors in the AHS or IML logs ?
>> No, only the one error I posted in my first post gets logged in IML.

Please let us know if HPE support are coming back you with solution.
>> I'm on holiday the next two weeks and will open a case with HPE after that. If that doesn't help I might also open a paid Proxmox ticket and see if they have any ideas.

I have searched for similar errors , and it might worth a try to test the following:

1: quiet pcie_aspm=off
2: Disable legacy boot, Use UEFI boot only.
>> The first might be worth a try. I only use UEFI already.

I observed the behaviour a bit more. As soon as there is a bit more load on the storage read/writes get slower and slower until finally the controller crashes. After that everthing is smooth and fast again until it slowly gets worse again until the next crash.

Suprisingly Veeam can backup the VMs at night without problems at full speed (~120MB/s, only Gigabit conenction to backup storage).
I'm all ears for more suggestions to try after my holiday :)

Beste regards,
Alex
 
Last edited:
Interested as well as I am looking at this server as well. any update on the situation. Has it gotten any better?