[SOLVED] RAID card issues on Kernel 6.8.4 - boot fail

PatrLind

New Member
Jun 13, 2024
3
3
3
[TL.DR]: RAID card firmware bug seems to cause a non boot issue on Kernel 6.8.4-N-pve, works on 6.5.13-5-pve. Any ideas on how to fix?

Today I have done a lot of troubleshooting for one of my Proxmox servers.

It started when I was rebooting the server after some updates. Suddenly it wasn't able to boot anymore.
The RAID card (Fujitsu PRAID CP400i also called D3307-A13 GS 1 (Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02))) gave me an error screen saying "L2/L3 Cache error was detected on the RAID controller. Please contact technical support to resolve this issue. Press 'X' to continue or else power off the system, replace controller and reboot."

Naturally I ordered another RAID card, and installed it, but to my surprise it failed with the exact same message...
I then started searching and found this PDF: https://www.fujitsu.com/us/imagesgig5/PY-CIB060-00.pdf
It states that "In a rare case, the PRIMERGY server with PRAID CP400i/CM400i (Firmware earlier than 24.21.0-0076) may stop during POST [...] This problem is caused by a bug in the RAID adapter firmware. [...] Update the RAID adapter firmware to 24.21.0-0076 or later."

That led me to a multi hour quest on trying to update the firmware, no tools were available and nothing I tried worked. At last I figured out that I could boot of the RHEL 9.4 installer and mount a USB where I had copied over the update file from the Fujitsu Update DVD. The file is called 'PRAID_CP400i_242100163.scexe'. This performed the update from 24.16.0-0105 to 24.21.0-0163.
I powered off the machine and rebooted. However it still failed with the same error.
I then tried to boot again but this time I tried to select an older Linux Kernel, and to my surprise it booted without issue!

So it seems to me that the newer kernel is still causing some bug with the firmware of this card, even if it has an updated firmware. Very strange...

I actually don't even use the RAID features of this card, I just use the JBOD mode with ZFS.

Has anyone else had this issue? Anyone have a better solution? Every time I reboot the Proxmox server I now have to select the older kernel. And I reboot once a month after updates as it is now.
 
I actually don't even use the RAID features of this card, I just use the JBOD mode with ZFS.
Is switching the firmware to IT mode something you'd be up for?

That model of card looks like it'd be an LSI 9300-8i.

I have a similar thing (this one) in my system, and it flashed to the newest available LSI 9300-8i firmware with no issues.

I can dig up the links for the firmware and stuff if that'd be useful? They're in another comment I made about it a few days ago.
 
@justinclift Yes, seems like that card would be compatible. Well, I have two cards now, I might just try with one of them. So if you have some firmware files I would be interested.
 
  • Like
Reactions: justinclift
No worries. This is my post from a few days ago with the details of the firmware:

https://forum.proxmox.com/threads/n...r-hba-it-mode-crashing-vm.148518/#post-671954

My recommendation would be to install the "STORCLI" command line utility (third download link in the post) first, as that's what you use to backup the existing firmware and BIOS (ie save it to local disk just in case) + flash new firmware on the card.

Then, probably update the card to the newest firmware (second download link in the post), then optionally flash both the BIOS and UEFI er.. flash things (first download link in the post).

At least, that's what worked in mine and it's running fine for the last few days without any issues. :)
 
  • Like
Reactions: PatrLind
Oh. I've just realised that the system the card is running in is also on kernel 6.5.x as well, due to it having an Nvidia card in the system (they don't like kernel 6.8.x at present).

So, there's no good info on whether the newer firmware will act better with the 6.8.x kernels.

Then again, seeing as you have two cards and can experiment a bit... :cool:
 
  • Like
Reactions: PatrLind
Thanks @kshesq, that led me to a working solution.
What I did was:
I edited /etc/kernel/cmdline
Set it to:
root=ZFS=rpool/ROOT/pve-1 boot=zfs intel_iommu=on iommu=pt
Then i ran sudo pve-efiboot-tool refresh
After a reboot, the 6.8 kernel boots!

Apparently /etc/default/grub and update-grub is not in use anymore, since they don't actually change the system boot settings...

So in summary: The new Kernel 6.8 apparently has some new default settings that makes the megaraid_sas driver not work with these cards anymore. In my case it caused some firmware issues leading to the card promting me that L2/L3 cache had failed an the card should be replaced. A firmware update of the card that should fix the issue didn't fix the issue. The actual fix was to update the kernel settings to make the driver (somehow) work again with these cards.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!