Pass-through Lsi-9201-8i Crash/Lock-up

Devious

New Member
Nov 1, 2022
1
0
1
Hi There,

I've got an issue with pass-through on proxmox 7.4, regardless what VM am trying to do so with & used all kernel possible variation i can get installed (5.15.x-6.x) to make it work.

When ever i try to to pass through (LSI-9201-8i Flashed in IT mode) to any VM i create (Rockstar, TrueNAS Core/Scale, Generic Linux..) server crashes/locks-up with not much output in logs.
Only apparent kernel crash i was able to capture is when i managed to blacklist `mpt2sas` from cmdline or even from `/etc/modprobe.d/pve-blacklist.conf`.
As soon as it boots ups .. it proceeds with loading system & then `kernel panic` (attached)
I've tried to the best of my knowledge to enable (serial tty, netconsole & i did kdump too) but it was no where as I cant get a solid output from them as well.

kernel-crash.jpg


Am out of all clues ...

BTW: passing through any other device works properly, Quadro P400 was tested/used to isolate the issue from being MB or HBA.

Configration :
  • Code:
    Proxmox VE (proxmox-ve: 7.4-1 (running kernel: 6.2.9-1-pve))
    Intel 11th Gen @ 2.20GHz
    32GB DDR4 (NON ECC)
    Motherboard has 1x (PCIe x1) 1x (PCIe x16)
    2x NVMe slots
    1x Wifi slot (empty)
    RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (Built-In)
    RTL8125 2.5GbE Controller (PCIe x1)
    Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 (PCIe x16) "Genuine" - Thank you Art Of Server :cool:
    
      Boot/System Drive
      nvme0n1 238.5G SK hynix BC501 HFM256GDJTNG-8310A
    
    pveversion -v
    proxmox-ve: 7.4-1 (running kernel: 6.2.9-1-pve)
    pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
    pve-kernel-5.15: 7.4-2
    pve-kernel-6.2.9-1-pve: 6.2.9-1
    pve-kernel-5.15.107-1-pve: 5.15.107-1
    pve-kernel-5.15.102-1-pve: 5.15.102-1
    ceph-fuse: 15.2.17-pve1
    corosync: 3.1.7-pve1
    criu: 3.15-1+pve-1
    glusterfs-client: 9.2-1
    ifupdown2: 3.1.0-1+pmx3
    ksm-control-daemon: 1.4-1
    libjs-extjs: 7.0.0-1
    libknet1: 1.24-pve2
    libproxmox-acme-perl: 1.4.4
    libproxmox-backup-qemu0: 1.3.1-1
    libproxmox-rs-perl: 0.2.1
    libpve-access-control: 7.4-2
    libpve-apiclient-perl: 3.2-1
    libpve-common-perl: 7.3-4
    libpve-guest-common-perl: 4.2-4
    libpve-http-server-perl: 4.2-3
    libpve-rs-perl: 0.7.5
    libpve-storage-perl: 7.4-2
    libspice-server1: 0.14.3-2.1
    lvm2: 2.03.11-2.1
    lxc-pve: 5.0.2-2
    lxcfs: 5.0.3-pve1
    novnc-pve: 1.4.0-1
    proxmox-backup-client: 2.4.1-1
    proxmox-backup-file-restore: 2.4.1-1
    proxmox-kernel-helper: 7.4-1
    proxmox-mail-forward: 0.1.1-1
    proxmox-mini-journalreader: 1.3-1
    proxmox-widget-toolkit: 3.6.5
    pve-cluster: 7.3-3
    pve-container: 4.4-3
    pve-docs: 7.4-2
    pve-edk2-firmware: 3.20230228-2
    pve-firewall: 4.3-1
    pve-firmware: 3.6-5
    pve-ha-manager: 3.6.1
    pve-i18n: 2.12-1
    pve-qemu-kvm: 7.2.0-8
    pve-xtermjs: 4.16.0-1
    qemu-server: 7.4-3
    smartmontools: 7.2-pve3
    spiceterm: 3.2-2
    swtpm: 0.8.0~bpo11+3
    vncterm: 1.7-1
    zfsutils-linux: 2.1.11-pve1

Code:
VFIO checklist

cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.2.6-1-pve root=/dev/mapper/pve-root ro intel_iommu=on iommu=pt modprobe.blacklist=mpt3sas net.ifnames=0 biosdevname=0

# dmesg | grep -e DMAR -e IOMMU
[    0.036408] ACPI: DMAR 0x000000003639B000 000088 (v02 INTEL  EDK2     00000002      01000013)
[    0.036432] ACPI: Reserving DMAR table memory at [mem 0x3639b000-0x3639b087]
[    0.073674] DMAR: IOMMU enabled
[    0.139545] DMAR: Host address width 39
[    0.139546] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
[    0.139551] DMAR: dmar0: reg_base_addr fed90000 ver 4:0 cap 1c0000c40660462 ecap 29a00f0505e
[    0.139554] DMAR: DRHD base: 0x000000fed91000 flags: 0x1
[    0.139559] DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap d2008c40660462 ecap f050da
[    0.139562] DMAR: RMRR base: 0x0000003f000000 end: 0x0000004f7fffff
[    0.139565] DMAR-IR: IOAPIC id 2 under DRHD base  0xfed91000 IOMMU 1
[    0.139567] DMAR-IR: HPET id 0 under DRHD base 0xfed91000
[    0.139568] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[    0.141131] DMAR-IR: Enabled IRQ remapping in x2apic mode
[    0.317923] pci 0000:00:02.0: DMAR: Skip IOMMU disabling for graphics
[    0.393003] DMAR: No ATSR found
[    0.393004] DMAR: No SATC found
[    0.393005] DMAR: IOMMU feature fl1gp_support inconsistent
[    0.393006] DMAR: IOMMU feature pgsel_inv inconsistent
[    0.393007] DMAR: IOMMU feature nwfs inconsistent
[    0.393008] DMAR: IOMMU feature dit inconsistent
[    0.393009] DMAR: IOMMU feature sc_support inconsistent
[    0.393010] DMAR: IOMMU feature dev_iotlb_support inconsistent
[    0.393011] DMAR: dmar0: Using Queued invalidation
[    0.393014] DMAR: dmar1: Using Queued invalidation
[    0.393338] DMAR: Intel(R) Virtualization Technology for Directed I/O

cat /etc/modules
# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

cat /etc/modprobe.d/pve-blacklist.conf
# This file contains a list of modules which are not supported by Proxmox VE
# nidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb
#blacklist mpt2sas
blacklist mpt3sas


# bash iommu.sh
IOMMU Group 0:
00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:9a60]
IOMMU Group 1:
00:00.0 Host bridge [0600]: Intel Corporation Device [8086:9a36] (rev 04)
IOMMU Group 2:
00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:9a01] (rev 04)
IOMMU Group 3:
00:04.0 Signal processing controller [1180]: Intel Corporation Device [8086:9a03] (rev 04)
IOMMU Group 4:
00:08.0 System peripheral [0880]: Intel Corporation Device [8086:9a11] (rev 04)
IOMMU Group 5:
00:0a.0 Signal processing controller [1180]: Intel Corporation Device [8086:9a0d] (rev 01)
IOMMU Group 6:
00:0d.0 USB controller [0c03]: Intel Corporation Tiger Lake-H Thunderbolt 4 USB Controller [8086:9a17] (rev 04)
IOMMU Group 7:
00:14.0 USB controller [0c03]: Intel Corporation Device [8086:43ed] (rev 11)
00:14.2 RAM memory [0500]: Intel Corporation Device [8086:43ef] (rev 11)
IOMMU Group 8:
00:15.0 Serial bus controller [0c80]: Intel Corporation Device [8086:43e8] (rev 11)
00:15.1 Serial bus controller [0c80]: Intel Corporation Device [8086:43e9] (rev 11)
00:15.2 Serial bus controller [0c80]: Intel Corporation Device [8086:43ea] (rev 11)
00:15.3 Serial bus controller [0c80]: Intel Corporation Device [8086:43eb] (rev 11)
IOMMU Group 9:
00:16.0 Communication controller [0780]: Intel Corporation Device [8086:43e0] (rev 11)
IOMMU Group 10:
00:17.0 SATA controller [0106]: Intel Corporation Device [8086:43d3] (rev 11)
IOMMU Group 11:
00:19.0 Serial bus controller [0c80]: Intel Corporation Device [8086:43ad] (rev 11)
00:19.1 Serial bus controller [0c80]: Intel Corporation Device [8086:43ae] (rev 11)
IOMMU Group 12:
00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:43bc] (rev 11)
IOMMU Group 13:
00:1d.0 PCI bridge [0604]: Intel Corporation Device [8086:43b2] (rev 11)
IOMMU Group 14:
00:1d.3 PCI bridge [0604]: Intel Corporation Device [8086:43b3] (rev 11)
IOMMU Group 15:
00:1e.0 Communication controller [0780]: Intel Corporation Device [8086:43a8] (rev 11)
00:1e.3 Serial bus controller [0c80]: Intel Corporation Device [8086:43ab] (rev 11)
IOMMU Group 16:
00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:438b] (rev 11)
00:1f.3 Audio device [0403]: Intel Corporation Device [8086:43c8] (rev 11)
00:1f.4 SMBus [0c05]: Intel Corporation Device [8086:43a3] (rev 11)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Device [8086:43a4] (rev 11)
IOMMU Group 17:
01:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03)
IOMMU Group 18:
02:00.0 Non-Volatile memory controller [0108]: SK hynix BC501 NVMe Solid State Drive 512GB [1c5c:1327]
IOMMU Group 19:
03:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)
IOMMU Group 20:
04:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller [10ec:8125] (rev 05)

Going manual method gives out the same out put

Code:
root@nas:~# cat /sys/bus/pci/devices/0000:01:00.0/driver_override
(null)
root@nas:~# echo "vfio-pci" > /sys/bus/pci/devices/0000:01:00.0/driver_override
root@nas:~# cat /sys/bus/pci/devices/0000:01:00.0/driver_override
vfio-pci
root@nas:~# modprobe -r vfio-pci
root@nas:~# cat /sys/bus/pci/devices/0000:01:00.0/driver_override
vfio-pci
root@nas:~# lspci -nnk

01:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03)
Subsystem: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072]
Kernel modules: mpt3sas

root@nas:~# modprobe -i --first-time vfio-pci
root@nas:~# modprobe -i --first-time vfio-pci
modprobe: ERROR: could not insert 'vfio_pci': Module already in kernel
root@nas:~# client_loop: send disconnect: Broken pipe


This output was taken w/o blacklisting "mpt2sas or mpt3sas " as i understood that it is automated & it "should work unless there is a need for manual work", pve is trying to deattach the device from that loaded drive to pass it to vm .. which results in a lock-up!!

Code:
# dmesg -w (ssh)
[   47.981207] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[   47.981226] sd 0:0:0:0: [sda] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[   48.109253] sd 0:0:1:0: [sdb] Synchronizing SCSI cache
[   48.109272] sd 0:0:1:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[   48.177267] sd 0:0:2:0: [sdc] Synchronizing SCSI cache
[   48.177288] sd 0:0:2:0: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[   48.245242] sd 0:0:3:0: [sdd] Synchronizing SCSI cache
[   48.245261] sd 0:0:3:0: [sdd] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[   48.245910] mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221100000000)
[   48.245914] mpt2sas_cm0: removing handle(0x0009), sas_addr(0x4433221100000000)
[   48.245916] mpt2sas_cm0: enclosure logical id(0x500605b00973f5f0), slot(3)
[   48.245917] mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221103000000)
[   48.245918] mpt2sas_cm0: removing handle(0x000a), sas_addr(0x4433221103000000)
[   48.245920] mpt2sas_cm0: enclosure logical id(0x500605b00973f5f0), slot(0)
[   48.245921] mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221101000000)
[   48.245922] mpt2sas_cm0: removing handle(0x000b), sas_addr(0x4433221101000000)
[   48.245923] mpt2sas_cm0: enclosure logical id(0x500605b00973f5f0), slot(2)
[   48.245924] mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221102000000)
[   48.245925] mpt2sas_cm0: removing handle(0x000c), sas_addr(0x4433221102000000)
[   48.245926] mpt2sas_cm0: enclosure logical id(0x500605b00973f5f0), slot(1)
[   48.246025] mpt2sas_cm0: sending message unit reset !!
[   48.247671] mpt2sas_cm0: message unit reset: SUCCESS
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!