PCIe Passthrough crashes host

SunBlack

Renowned Member
Jun 22, 2017
22
3
68
As requested [here](https://forum.proxmox.com/threads/opt-in-linux-7-0-kernel-for-proxmox-ve-9-available.182328/post-855630) PCIe Passthrough issues should be posted in a new thread. Therefore here:

Can also confirm that the Kernel 7.0 contains a regression related PCIe Passthrough.

With a fully patched Proxmox 9.2.3 (no-subscription) PCIe Passthrough with our Nvidia 5090 works fine with the Kernel 6.17.13-13-pve. But when we booting Proxmox with 7.0.6-2-pve Proxmox starts fine - until the VM starts with the PCIe Passthrough setting. In this case I can't even reach the server anymore via ssh.

Output of lspci -v -nn with 6.17.13-13:
Code:
41:00.0 VGA compatible controller [0300]: NVIDIA Corporation GB202 [GeForce RTX 5090] [10de:2b85] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: InnoVISION Multimedia Ltd. Device [1771:2059]
        Flags: bus master, fast devsel, latency 0, IRQ 327, NUMA node 0, IOMMU group 63
        Memory at b4000000 (32-bit, non-prefetchable) [size=64M]
        Memory at 80120000000 (64-bit, prefetchable) [size=256M]
        Memory at 80130000000 (64-bit, prefetchable) [size=32M]
        I/O ports at 6000 [size=128]
        Expansion ROM at b8000000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
        Capabilities: [48] MSI: Enable- Count=1/16 Maskable+ 64bit+
        Capabilities: [60] Express Legacy Endpoint, IntMsgNum 0
        Capabilities: [9c] Vendor Specific Information: Len=14 <?>
        Capabilities: [b0] MSI-X: Enable+ Count=9 Masked-
        Capabilities: [100] Secondary PCI Express
        Capabilities: [12c] Latency Tolerance Reporting
        Capabilities: [134] Physical Resizable BAR
        Capabilities: [140] Virtual Resizable BAR
        Capabilities: [14c] Data Link Feature <?>
        Capabilities: [158] Physical Layer 16.0 GT/s <?>
        Capabilities: [188] Physical Layer 32.0 GT/s <?>
        Capabilities: [1b8] Advanced Error Reporting
        Capabilities: [200] Lane Margining at the Receiver
        Capabilities: [248] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [250] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [290] L1 PM Substates
        Capabilities: [2a4] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
        Capabilities: [2bc] Power Budgeting <?>
        Capabilities: [2f4] Device Serial Number 7d-12-38-17-f5-2d-b0-48
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau

With 7.0.6-2-pve & 7.0.2-4-pve:
Code:
41:00.0 VGA compatible controller [0300]: NVIDIA Corporation GB202 [GeForce RTX 5090] [10de:2b85] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: InnoVISION Multimedia Ltd. Device [1771:2059]
        Flags: bus master, fast devsel, latency 0, IRQ 11, NUMA node 0, IOMMU group 63
        Memory at b4000000 (32-bit, non-prefetchable) [size=64M]
        Memory at 80120000000 (64-bit, prefetchable) [size=256M]
        Memory at 80130000000 (64-bit, prefetchable) [size=32M]
        I/O ports at 6000 [size=128]
        Expansion ROM at b8000000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
        Capabilities: [48] MSI: Enable- Count=1/16 Maskable+ 64bit+
        Capabilities: [60] Express Legacy Endpoint, IntMsgNum 0
        Capabilities: [9c] Vendor Specific Information: Len=14 <?>
        Capabilities: [b0] MSI-X: Enable- Count=9 Masked-
        Capabilities: [100] Secondary PCI Express
        Capabilities: [12c] Latency Tolerance Reporting
        Capabilities: [134] Physical Resizable BAR
        Capabilities: [140] Virtual Resizable BAR
        Capabilities: [14c] Data Link Feature <?>
        Capabilities: [158] Physical Layer 16.0 GT/s <?>
        Capabilities: [188] Physical Layer 32.0 GT/s <?>
        Capabilities: [1b8] Advanced Error Reporting
        Capabilities: [200] Lane Margining at the Receiver
        Capabilities: [248] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [250] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [290] L1 PM Substates
        Capabilities: [2a4] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
        Capabilities: [2bc] Power Budgeting <?>
        Capabilities: [2f4] Device Serial Number 7d-12-38-17-f5-2d-b0-48
        Kernel modules: nvidiafb, nouveau, nova_core

Content of /etc/modprobe.d/vfio.conf:
Code:
options vfio-pci ids=10de:2b85,10de:XXXX disable_vga=1

Content of /etc/modprobe.d/blacklist.conf:
Code:
blacklist nova
blacklist nova_core
blacklist nvidia
blacklist nvidia_drm
blacklist nvidia_modeset
blacklist nouveau

As dmesg | grep -i vfio didn't showed anything I adjusted the /etc/modules-load.d/vfio.conf:
Code:
vfio
vfio_pci
vfio_iommu_type1
vfio_virqfd
After update-initramfs -u & reboot I got finally the line Kernel driver in use: vfio-pci with Kernel 7.0.x, but the VM is still crashing the host:
Code:
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau, nova_core
 
Last edited: