[TL.DR]: RAID card firmware bug seems to cause a non boot issue on Kernel 6.8.4-N-pve, works on 6.5.13-5-pve. Any ideas on how to fix?
Today I have done a lot of troubleshooting for one of my Proxmox servers.
It started when I was rebooting the server after some updates. Suddenly it wasn't able to boot anymore.
The RAID card (Fujitsu PRAID CP400i also called D3307-A13 GS 1 (Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02))) gave me an error screen saying "L2/L3 Cache error was detected on the RAID controller. Please contact technical support to resolve this issue. Press 'X' to continue or else power off the system, replace controller and reboot."
Naturally I ordered another RAID card, and installed it, but to my surprise it failed with the exact same message...
I then started searching and found this PDF: https://www.fujitsu.com/us/imagesgig5/PY-CIB060-00.pdf
It states that "In a rare case, the PRIMERGY server with PRAID CP400i/CM400i (Firmware earlier than 24.21.0-0076) may stop during POST [...] This problem is caused by a bug in the RAID adapter firmware. [...] Update the RAID adapter firmware to 24.21.0-0076 or later."
That led me to a multi hour quest on trying to update the firmware, no tools were available and nothing I tried worked. At last I figured out that I could boot of the RHEL 9.4 installer and mount a USB where I had copied over the update file from the Fujitsu Update DVD. The file is called 'PRAID_CP400i_242100163.scexe'. This performed the update from 24.16.0-0105 to 24.21.0-0163.
I powered off the machine and rebooted. However it still failed with the same error.
I then tried to boot again but this time I tried to select an older Linux Kernel, and to my surprise it booted without issue!
So it seems to me that the newer kernel is still causing some bug with the firmware of this card, even if it has an updated firmware. Very strange...
I actually don't even use the RAID features of this card, I just use the JBOD mode with ZFS.
Has anyone else had this issue? Anyone have a better solution? Every time I reboot the Proxmox server I now have to select the older kernel. And I reboot once a month after updates as it is now.
Today I have done a lot of troubleshooting for one of my Proxmox servers.
It started when I was rebooting the server after some updates. Suddenly it wasn't able to boot anymore.
The RAID card (Fujitsu PRAID CP400i also called D3307-A13 GS 1 (Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02))) gave me an error screen saying "L2/L3 Cache error was detected on the RAID controller. Please contact technical support to resolve this issue. Press 'X' to continue or else power off the system, replace controller and reboot."
Naturally I ordered another RAID card, and installed it, but to my surprise it failed with the exact same message...
I then started searching and found this PDF: https://www.fujitsu.com/us/imagesgig5/PY-CIB060-00.pdf
It states that "In a rare case, the PRIMERGY server with PRAID CP400i/CM400i (Firmware earlier than 24.21.0-0076) may stop during POST [...] This problem is caused by a bug in the RAID adapter firmware. [...] Update the RAID adapter firmware to 24.21.0-0076 or later."
That led me to a multi hour quest on trying to update the firmware, no tools were available and nothing I tried worked. At last I figured out that I could boot of the RHEL 9.4 installer and mount a USB where I had copied over the update file from the Fujitsu Update DVD. The file is called 'PRAID_CP400i_242100163.scexe'. This performed the update from 24.16.0-0105 to 24.21.0-0163.
I powered off the machine and rebooted. However it still failed with the same error.
I then tried to boot again but this time I tried to select an older Linux Kernel, and to my surprise it booted without issue!
So it seems to me that the newer kernel is still causing some bug with the firmware of this card, even if it has an updated firmware. Very strange...
I actually don't even use the RAID features of this card, I just use the JBOD mode with ZFS.
Has anyone else had this issue? Anyone have a better solution? Every time I reboot the Proxmox server I now have to select the older kernel. And I reboot once a month after updates as it is now.