JBOD HBA pci passthrough crashes host

trimbljk · May 21, 2024

I have an older server: supermicro motherboard x9drx+-f that I'm running proxmox on. I have a similar older jbod that is connected via an adaptec 8405 raid controller. I'm attemping to pass the JBOD to a TrueNAS VM on the server using PCI passthrough but when I start the VM it crashes the whole system.

I have added the requisite code to my grub file: GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt" and changed my raid controller to HBA mode. I can add a hard disk directly to TrueNAS but I can't pass the entire PCI device to the VM without it crashing the system.

The pci controller is by itself in it's own IOMMU group: 24. I'm at a loss and could use some help.

dcsapak · May 21, 2024

just to make sure: you don't have your boot disk also on that controller? (because that can't work of course)

can you post the journal, or output of 'dmesg' ?

whats your 'pveversion -v' ?

trimbljk · May 21, 2024

Hey dcsapak, thanks for the response. Unless I'm crazy, I don't believe so. Attached is a photo of the drives I have mounted. sda - sdm are from the jbod. The boot partition is sdw and on the server. I'm running pve=7.4.1. Here is the output for dmesg, not sure how far back you want to go:

Code:

   13.305742] audit: type=1400 audit(1716294521.921:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="swtpm" pid=1531 comm="apparmor_parser"
[   13.306036] audit: type=1400 audit(1716294521.921:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/lxc-start" pid=1532 comm="apparmor_parser"
[   13.306397] audit: type=1400 audit(1716294521.921:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=1530 comm="apparmor_parser"
[   13.307351] audit: type=1400 audit(1716294521.921:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1526 comm="apparmor_parser"
[   13.307357] audit: type=1400 audit(1716294521.921:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1526 comm="apparmor_parser"
[   13.307761] audit: type=1400 audit(1716294521.921:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=1533 comm="apparmor_parser"
[   13.307766] audit: type=1400 audit(1716294521.921:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=1533 comm="apparmor_parser"
[   13.307769] audit: type=1400 audit(1716294521.921:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=1533 comm="apparmor_parser"
[   13.309652] audit: type=1400 audit(1716294521.925:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/chronyd" pid=1528 comm="apparmor_parser"
[   13.312493] audit: type=1400 audit(1716294521.925:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="tcpdump" pid=1529 comm="apparmor_parser"
[   13.338553] softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
[   13.338560] softdog:              soft_reboot_cmd=<not set> soft_active_on_boot=0
[   13.714295] vmbr0: port 1(eno1) entered blocking state
[   13.714301] vmbr0: port 1(eno1) entered disabled state
[   13.714367] device eno1 entered promiscuous mode
[   15.401009] igb 0000:07:00.0 eno1: igb: eno1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
[   15.401168] vmbr0: port 1(eno1) entered blocking state
[   15.401172] vmbr0: port 1(eno1) entered forwarding state
[   15.401359] IPv6: ADDRCONF(NETDEV_CHANGE): vmbr0: link becomes ready
[   16.081315] bpfilter: Loaded bpfilter_umh pid 1877
[   16.081582] Started bpfilter
[   26.304466] L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.
[   65.318382] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[   65.318416] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[   65.318417] {1}[Hardware Error]: event severity: corrected
[   65.318419] {1}[Hardware Error]:  Error 0, type: corrected
[   65.318420] {1}[Hardware Error]:  fru_text: CorrectedErr
[   65.318421] {1}[Hardware Error]:   section_type: PCIe error
[   65.318422] {1}[Hardware Error]:   port_type: 0, PCIe end point
[   65.318423] {1}[Hardware Error]:   version: 0.0
[   65.318424] {1}[Hardware Error]:   command: 0xffff, status: 0xffff
[   65.318425] {1}[Hardware Error]:   device_id: 0000:80:02.3
[   65.318426] {1}[Hardware Error]:   slot: 0
[   65.318427] {1}[Hardware Error]:   secondary_bus: 0x00
[   65.318428] {1}[Hardware Error]:   vendor_id: 0xffff, device_id: 0xffff
[   65.318429] {1}[Hardware Error]:   class_code: ffffff
[  109.345339] sd 0:1:31:0: [sdl] tag#791 Sense Key : Recovered Error [current]
[  109.345365] sd 0:1:31:0: [sdl] tag#791 Add. Sense: Defect list not found
root@pve:~#

I should also specify I have another raid controller, similar make and model that is going to the harddrives on the server. i have verified the device ids though and can confirm I chose the right pci controller to pass through because when I passed the other one it broke proxmox functionality as well. Curiously, it doens't crash the system nearly as bad as the jbod controller.

dcsapak · May 21, 2024

trimbljk said:
Curiously, it doens't crash the system nearly as bad as the jbod controller.

what do you mean by that? what exactly crashes/does not work

trimbljk · May 21, 2024

dcsapak said:
what do you mean by that? what exactly crashes/does not work

I can still navigate around in proxmox but the cli is broken in the shell and can't load some items to operate properly. However, when I pass the jbod pci, every freezes and is in a loading state. If I refresh the browser, it can't find proxmox anymore.

dcsapak · May 21, 2024

ok, do they both use the same driver? (you can see that in lspci -nnk)? maybe you'd have to blacklist the driver, or preload the vfio-pci driver for that card so it does not get loaded and the host never accesses the drives

trimbljk · May 21, 2024

dcsapak said:
ok, do they both use the same driver? (you can see that in lspci -nnk)? maybe you'd have to blacklist the driver, or preload the vfio-pci driver for that card so it does not get loaded and the host never accesses the drives

Here is the output from lscpi -nnk:

Code:

01:00.0 RAID bus controller [0104]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
        Subsystem: Adaptec Series 8 - ASR-8885 - 8 internal 8 external 12G SAS Port/PCIe 3.0 [9005:0554]
        Kernel driver in use: aacraid
        Kernel modules: aacraid

86:00.0 RAID bus controller [0104]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
        Subsystem: Adaptec Series 8 12G SAS/PCIe 3 [9005:0557]
        Kernel driver in use: aacraid
        Kernel modules: aacraid

It would appear that they're indeed using the same driver

trimbljk · May 24, 2024

I was able to change the driver for both RAID cards and the system still crashes causing me to have to do a hard reset on the server. Maybe theres something at the BIOS level?

dcsapak · May 24, 2024

trimbljk said:
I was able to change the driver for both RAID cards and the system still crashes causing me to have to do a hard reset on the server. Maybe theres something at the BIOS level?

yes well that can be the case... you could try to switch the pcie slots of the two cards? maybe that makes a difference?

Search

Search

JBOD HBA pci passthrough crashes host

trimbljk

New Member

dcsapak

Proxmox Staff Member

trimbljk

New Member

Attachments

dcsapak

Proxmox Staff Member

trimbljk

New Member

dcsapak

Proxmox Staff Member

trimbljk

New Member

trimbljk

New Member

dcsapak

Proxmox Staff Member