JBOD HBA pci passthrough crashes host

trimbljk

New Member
May 3, 2024
5
0
1
I have an older server: supermicro motherboard x9drx+-f that I'm running proxmox on. I have a similar older jbod that is connected via an adaptec 8405 raid controller. I'm attemping to pass the JBOD to a TrueNAS VM on the server using PCI passthrough but when I start the VM it crashes the whole system.

I have added the requisite code to my grub file: GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt" and changed my raid controller to HBA mode. I can add a hard disk directly to TrueNAS but I can't pass the entire PCI device to the VM without it crashing the system.

The pci controller is by itself in it's own IOMMU group: 24. I'm at a loss and could use some help.
 
just to make sure: you don't have your boot disk also on that controller? (because that can't work of course)

can you post the journal, or output of 'dmesg' ?

whats your 'pveversion -v' ?
 
Hey dcsapak, thanks for the response. Unless I'm crazy, I don't believe so. Attached is a photo of the drives I have mounted. sda - sdm are from the jbod. The boot partition is sdw and on the server. I'm running pve=7.4.1. Here is the output for dmesg, not sure how far back you want to go:


Code:
   13.305742] audit: type=1400 audit(1716294521.921:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="swtpm" pid=1531 comm="apparmor_parser"
[   13.306036] audit: type=1400 audit(1716294521.921:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/lxc-start" pid=1532 comm="apparmor_parser"
[   13.306397] audit: type=1400 audit(1716294521.921:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=1530 comm="apparmor_parser"
[   13.307351] audit: type=1400 audit(1716294521.921:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1526 comm="apparmor_parser"
[   13.307357] audit: type=1400 audit(1716294521.921:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1526 comm="apparmor_parser"
[   13.307761] audit: type=1400 audit(1716294521.921:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=1533 comm="apparmor_parser"
[   13.307766] audit: type=1400 audit(1716294521.921:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=1533 comm="apparmor_parser"
[   13.307769] audit: type=1400 audit(1716294521.921:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=1533 comm="apparmor_parser"
[   13.309652] audit: type=1400 audit(1716294521.925:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/chronyd" pid=1528 comm="apparmor_parser"
[   13.312493] audit: type=1400 audit(1716294521.925:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="tcpdump" pid=1529 comm="apparmor_parser"
[   13.338553] softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
[   13.338560] softdog:              soft_reboot_cmd=<not set> soft_active_on_boot=0
[   13.714295] vmbr0: port 1(eno1) entered blocking state
[   13.714301] vmbr0: port 1(eno1) entered disabled state
[   13.714367] device eno1 entered promiscuous mode
[   15.401009] igb 0000:07:00.0 eno1: igb: eno1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
[   15.401168] vmbr0: port 1(eno1) entered blocking state
[   15.401172] vmbr0: port 1(eno1) entered forwarding state
[   15.401359] IPv6: ADDRCONF(NETDEV_CHANGE): vmbr0: link becomes ready
[   16.081315] bpfilter: Loaded bpfilter_umh pid 1877
[   16.081582] Started bpfilter
[   26.304466] L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.
[   65.318382] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[   65.318416] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[   65.318417] {1}[Hardware Error]: event severity: corrected
[   65.318419] {1}[Hardware Error]:  Error 0, type: corrected
[   65.318420] {1}[Hardware Error]:  fru_text: CorrectedErr
[   65.318421] {1}[Hardware Error]:   section_type: PCIe error
[   65.318422] {1}[Hardware Error]:   port_type: 0, PCIe end point
[   65.318423] {1}[Hardware Error]:   version: 0.0
[   65.318424] {1}[Hardware Error]:   command: 0xffff, status: 0xffff
[   65.318425] {1}[Hardware Error]:   device_id: 0000:80:02.3
[   65.318426] {1}[Hardware Error]:   slot: 0
[   65.318427] {1}[Hardware Error]:   secondary_bus: 0x00
[   65.318428] {1}[Hardware Error]:   vendor_id: 0xffff, device_id: 0xffff
[   65.318429] {1}[Hardware Error]:   class_code: ffffff
[  109.345339] sd 0:1:31:0: [sdl] tag#791 Sense Key : Recovered Error [current]
[  109.345365] sd 0:1:31:0: [sdl] tag#791 Add. Sense: Defect list not found
root@pve:~#

I should also specify I have another raid controller, similar make and model that is going to the harddrives on the server. i have verified the device ids though and can confirm I chose the right pci controller to pass through because when I passed the other one it broke proxmox functionality as well. Curiously, it doens't crash the system nearly as bad as the jbod controller.
 

Attachments

  • Screenshot 2024-05-21 at 8.31.11 AM.png
    Screenshot 2024-05-21 at 8.31.11 AM.png
    215 KB · Views: 5
Last edited:
what do you mean by that? what exactly crashes/does not work
I can still navigate around in proxmox but the cli is broken in the shell and can't load some items to operate properly. However, when I pass the jbod pci, every freezes and is in a loading state. If I refresh the browser, it can't find proxmox anymore.
 
ok, do they both use the same driver? (you can see that in lspci -nnk)? maybe you'd have to blacklist the driver, or preload the vfio-pci driver for that card so it does not get loaded and the host never accesses the drives
 
ok, do they both use the same driver? (you can see that in lspci -nnk)? maybe you'd have to blacklist the driver, or preload the vfio-pci driver for that card so it does not get loaded and the host never accesses the drives
Here is the output from lscpi -nnk:


Code:
01:00.0 RAID bus controller [0104]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
        Subsystem: Adaptec Series 8 - ASR-8885 - 8 internal 8 external 12G SAS Port/PCIe 3.0 [9005:0554]
        Kernel driver in use: aacraid
        Kernel modules: aacraid

86:00.0 RAID bus controller [0104]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
        Subsystem: Adaptec Series 8 12G SAS/PCIe 3 [9005:0557]
        Kernel driver in use: aacraid
        Kernel modules: aacraid

It would appear that they're indeed using the same driver
 
I was able to change the driver for both RAID cards and the system still crashes causing me to have to do a hard reset on the server. Maybe theres something at the BIOS level?
 
I was able to change the driver for both RAID cards and the system still crashes causing me to have to do a hard reset on the server. Maybe theres something at the BIOS level?
yes well that can be the case... you could try to switch the pcie slots of the two cards? maybe that makes a difference?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!