HBA PCI passthrough works on ~20% of boot attempts, locks up proxmox otherwise

greyltc

New Member
Jul 20, 2023
2
0
1
Hi. I've been banging my head against this problem for a few days now.
I've got this HBA:

Code:
# lspci -nnkv -d 1000:005d
61:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] [1000:005d] (rev 02)
    Subsystem: Dell PERC H730P Adapter [1028:1f42]
    Flags: bus master, fast devsel, latency 0, IRQ 255, NUMA node 0, IOMMU group 4
    I/O ports at 8000 [size=256]
    Memory at b8d00000 (64-bit, non-prefetchable) [size=64K]
    Memory at b8c00000 (64-bit, non-prefetchable) [size=1M]
    Expansion ROM at <ignored> [disabled]
    Capabilities: [50] Power Management version 3
    Capabilities: [68] Express Endpoint, MSI 00
    Capabilities: [a8] MSI: Enable- Count=1/1 Maskable+ 64bit+
    Capabilities: [c0] MSI-X: Enable- Count=97 Masked-
    Capabilities: [100] Advanced Error Reporting
    Capabilities: [1e0] Secondary PCI Express
    Capabilities: [1c0] Power Budgeting <?>
    Capabilities: [148] Alternative Routing-ID Interpretation (ARI)
    Kernel driver in use: vfio-pci
    Kernel modules: megaraid_sas

and I'd like to pass it through to my linux guest. I'm doing all my testing with no media connected to the HBA. I've read https://pve.proxmox.com/wiki/PCI_Passthrough
My IOMMUs are all split, no numbers are shared. I've got the megaraid_sas module blacklisted and as you can see above, vfio-pci loaded in its place.

After playing with settings (PCI-express checkbox, ROM-bar checkbox, CPU type, vfio_iommu_type1.allow_unsafe_interrupts, many other things), I finally got my VM to boot and show the HBA passed through in the (Arch Linux) guest OS (as verified with lspci showing the HBA in the guest). But it turns out this only works maybe 1 out of 5 times when I shut down and restart the VM. 4 out of 5 times the whole proxmox server crashes instantly. When it crashes, there are no errors, just a hard lock up of the proxmox os. Nothing in the journal, nothing in dmesg, nothing in kvm stderr. Nothing.

The "working" (maybe 1 in 5 vm boots) qemu config looks like this:
Code:
balloon: 0
bios: ovmf
boot: order=ide2
cores: 1
cpu: IvyBridge
efidisk0: local-zfs:vm-103-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:61:00
ide2: local:iso/archlinux-2023.07.01-x86_64.iso,media=cdrom,size=832844K
machine: q35
memory: 2048
meta: creation-qemu=8.0.2,ctime=1690173855
name: archbase2
numa: 0
ostype: l26
scsihw: virtio-scsi-single
smbios1: uuid=714c8cd5-e224-4562-846d-1737a1944c64
sockets: 1
vmgenid: 7a0bc34d-13c9-4918-a9d9-75190cd58f2f

When I remove hostpci0: 0000:61:00 passthrough line from the qemu config, the VM, as well as the rest of PVE works as expected.

What can make a PCI passthrough setup work as expected on some VM boots, but cause the host OS to lock up hard on others with no configuration changes in between?
 
What can make a PCI passthrough setup work as expected on some VM boots, but cause the host OS to lock up hard on others with no configuration changes in between?
If a device does not reset properly (even though it advertises FLR), it won't work a second time and it can take down the PCIe bus and the whole Proxmox. Such devices only work once when you make sure nothing touches it (like the actual driver) before the VM.
It's not uncommon for PCI(e) device to not reset properly (or work at all inside a VM) as they are not designed and tested for passthrough by manufacturers. For several AMD GPU generations people even wrote a special driver to work-around very similar reset problems.
Maybe someone here knows a specific work-around for your device?
 
@greyltc did you ever find a solution?

I have stumbled across your post and I have pretty much the same results using the same card. I also blacklisted and setup vfio-pci options so the controller shows below and it's the only thing on the IOMMU group and its not used in the host or other VMs.

Pretty much what @leesteken said is what I'm observing. Add HBA with PCI Device passthrough to VM, Boot, works fine, reboot VM, works fine, but shutdown VM and start up VM, locks entire server. Then from there, it will lock up the server each time it boots, unless I remove the PCI device passthrough and re-add to the list, then I start the process over.


Usually the console just locks up with no error but one time after the system crashed I did get a nasty error in the console and snapped a screenshot, added below.


Code:
1b:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)
        Subsystem: Super Micro Computer Inc MegaRAID SAS-3 3108 [Invader]
        Flags: bus master, fast devsel, latency 0, IRQ 11, NUMA node 0, IOMMU group 10
        I/O ports at 5000 [size=256]
        Memory at aab00000 (64-bit, non-prefetchable) [size=64K]
        Memory at aaa00000 (64-bit, non-prefetchable) [size=1M]
        Expansion ROM at aa900000 [disabled] [size=1M]
        Capabilities: [50] Power Management version 3
        Capabilities: [68] Express Endpoint, MSI 00
        Capabilities: [d0] Vital Product Data
        Capabilities: [a8] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [c0] MSI-X: Enable- Count=97 Masked-
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [1e0] Secondary PCI Express
        Capabilities: [1c0] Power Budgeting <?>
        Capabilities: [148] Alternative Routing-ID Interpretation (ARI)
        Kernel driver in use: vfio-pci
        Kernel modules: megaraid_sas

Screenshot 2023-08-26 at 9.25.24 PM.png
 
@Farhover I found no solution. I gave up. Such a bummer. I replaced the Dell PERC H730P with a Dell PERC H330, which works just fine.
 
Ok thanks, I went ahead and purchased the Supermicro AOC-S3008L-L8E with the SAS 3008 chip like your Dell PERC H330.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!