[SOLVED] HW Raid Megaraid errors after deployment PVE 7.2 fresh install

bzb-rs

Member
Jun 8, 2022
43
4
13
Canada
I have a supermicro server with amd epyc and MegaRAID SAS 9341-8i for Raid5 3x SSD. However after a smooth deployment and mounting storage to mount point using xfs. After some hours raid controller started to post several errors as attached.
Deployment of ubuntu vm is successful and mostly fine, however when some operation starts, the errors start. Sometimes this leads to total crash of pve server.

Using latest version of pve.

Already tried multiple combination solution to resolve this however none helped.

any ideas?
 

Attachments

  • Screenshot_35.png
    Screenshot_35.png
    412.7 KB · Views: 48
  • Screenshot_36.png
    Screenshot_36.png
    293.2 KB · Views: 40
  • Screenshot_37.png
    Screenshot_37.png
    248.2 KB · Views: 40
Last edited:
Those issues can have quite a few causes - but the following steps usually help:
* check that the RAID controller as well as all other components of the system (BIOS/UEFI, other components) have the latest available firmware installed!
* PVE 7.2 is shipping with a new kernel series (5.15) - with some config-changes - while most changes affect older Intel systems - in case of your system I'd try to add `iommu=pt` on the kernel-commandline - see the known-issues for the 7.2 release:
https://pve.proxmox.com/wiki/Roadmap#7.2-known-issues

If none of the above help - you can try installing the older kernel (apt install pve-kernel-5.13) and boot into that - to see if this changes the situation

I hope this helps!
 
  • Like
Reactions: bzb-rs
Thanks for the reply @Stoiko Ivanov .
I attempted to downgrade the kernel but was giving similar errors from hardware raid. However i added iommu=pt in the kernel and ran the update/reboot and that seem to fix the issue, atleast temporarily.
(also added the additional amd_iommu=on)

/etc/default/grub

Screenshot_8.png


I also happen to switch from raid5 to zfs1 by bypassing the 3 disks to the OS.
 
attempted to downgrade the kernel but was giving similar errors from hardware raid. However i added iommu=pt in the kernel and ran the update/reboot and that seem to fix the issue, atleast temporarily.
glad that seems to have worked out :)

if the issue reoccurs - drop us a message here :)
 
Make sure you are aware of another issue present, which shows itself only during rebuild: https://forum.proxmox.com/threads/r...d-in-rebuild-consistiency-check-state.110470/


Be aware that ZFS will not function at its best when disks are not passed directly (in case you did 3x RAID0 and given that to ZFS).
I did not encounter any further errors since patching. I have 2 different LSI cards, 9341 and 9361 on 2 of our servers and after applying the patch and several operational tests, I am yet to encounter any previous errors. I am indeed running latest version of pve including the kernel.

Do note that i have deleted the hardware RAID5 for my 3xdisk i had. I currently run only 1 hardware raid operation which is raid1 for the OS drives. The reason being i tested the same setup using only hardware raid1 and removing other drives, it did not throw any errors during operation.
The rest of 3 disks are individual JBOD and passed to OS as-is for zfs creation.
I have successfully created HA cluster for replication where zfs is a requirement.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!