proxmox-kernel-6.8.12-29 causing array controller errors

fes-it-admin

Member
May 27, 2023
4
0
6
With the new Kernel proxmox-kernel-6.8.12-29 we are getting a lot of errors from our HP Array Controller "Smart HBA H240ar", that gives out these kernel errors:

dmar_fault: 155 callbacks suppressed
DMAR: DRHD: handeling fault status reg 202
DMAR: [DMA Read NO_PASID] Request device [03:00.0] fault addr 0x791dc00 [fault reason 0x06] PTE Read access is not set

This is on our HP ProLiant DL380 Gen9 servers. When we revert back to proxmox-kernel-6.8.12-23 it works. What could be the solution to fix this issue? Is it maybe a problem with the newer kernel and an old firmware (our controller firmware is version 7.00)?
 
Hi @fes-it-admin

thanks for posting in the forum!

Disclaimer: I haven't done any extensive digging into the root cause of the problem, but as a "quick to try" hunch: Try to disable the shared memory Features of the H240ar in the device configuration.
In another thread [1] this caused problems with passing through to a VM, so maybe it also interferes with the regular operation in some way.

Yours sincerely
Jonas

[1] https://forum.proxmox.com/threads/can’t-passthrough-p840.155980/
 
Looks like something changed in the IOMMU/VT-d handling between those two kernel versions. The DMAR faults basically mean the IOMMU is blocking DMA requests from your H240ar.

Try adding iommu=pt to your kernel command line in /etc/default/grub (GRUB_CMDLINE_LINUX_DEFAULT), then update-grub and reboot. Passthrough mode tells the IOMMU to not mess with DMA for devices that aren't passed through to VMs, which usually fixes it.

If you don't need IOMMU/passthrough at all, intel_iommu=off would also work but is the bigger hammer.

And yeah, firmware 7.00 on the H240ar is pretty old. Might be worth updating that too when you have a maintenance window, HPE had a bunch of fixes for IOMMU-related stuff on Gen9.

What does your current kernel cmdline look like? cat /proc/cmdline