Hello,
I have an affected system also: Supermicro MB, model #: H11DSi rev 2.x
BIOS: v3.5
Firmware: v1.52.23
MegaRAID 9660-16i
- Firmware: v8.17.1 (but also had issues on an older version)
- I tried the Proxmox-delivered driver and updated to the Broadcom driver: v8.17.1 (verified that the new driver was in use via "modinfo".
The controller has 4, 3.8TB Micron NVMe drives connected.
(This worked for the past year on ESXi v7 perfectly, so I know the controller and drives are working in this server.)
Proxmox v9.1.9, Enterprise repo. Fully patched.
I have tried configuring the drives as "JBOD", using ZFS raidz. Fails after any data migration.
Now, I have it configured back to hardware RAID5 and LVM on Proxmox.
I can reproduce the error every time via a simple VM clone operation. I get about 40GB copied, and the controller basically shuts down.
I also tried the kernel parameters "iommu=pt" and "amd_iommu=on". THIS IS WEIRD. Before the parameters, the controller would die after copying 47GB. Now, with the parameters, is has a long 2-3 minute pause, then continues for another 40-50GB, rinse, repeat. Nothing in dmesg this time.
This is a long running and hard to detect/fix issue. What are my other options? I read the downgrading the kernel to an older version helps, but I do not know the exact steps for that.
-Brian