JBOD HBA pci passthrough crashes host

trimbljk

New Member
May 3, 2024
5
0
1
I have an older server: supermicro motherboard x9drx+-f that I'm running proxmox on. I have a similar older jbod that is connected via an adaptec 8405 raid controller. I'm attemping to pass the JBOD to a TrueNAS VM on the server using PCI passthrough but when I start the VM it crashes the whole system.

I have added the requisite code to my grub file: GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt" and changed my raid controller to HBA mode. I can add a hard disk directly to TrueNAS but I can't pass the entire PCI device to the VM without it crashing the system.

The pci controller is by itself in it's own IOMMU group: 24. I'm at a loss and could use some help.
 
just to make sure: you don't have your boot disk also on that controller? (because that can't work of course)

can you post the journal, or output of 'dmesg' ?

whats your 'pveversion -v' ?
 
Hey dcsapak, thanks for the response. Unless I'm crazy, I don't believe so. Attached is a photo of the drives I have mounted. sda - sdm are from the jbod. The boot partition is sdw and on the server. I'm running pve=7.4.1. Here is the output for dmesg, not sure how far back you want to go:


Code:
   13.305742] audit: type=1400 audit(1716294521.921:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="swtpm" pid=1531 comm="apparmor_parser"
[   13.306036] audit: type=1400 audit(1716294521.921:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/lxc-start" pid=1532 comm="apparmor_parser"
[   13.306397] audit: type=1400 audit(1716294521.921:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=1530 comm="apparmor_parser"
[   13.307351] audit: type=1400 audit(1716294521.921:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1526 comm="apparmor_parser"
[   13.307357] audit: type=1400 audit(1716294521.921:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1526 comm="apparmor_parser"
[   13.307761] audit: type=1400 audit(1716294521.921:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=1533 comm="apparmor_parser"
[   13.307766] audit: type=1400 audit(1716294521.921:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=1533 comm="apparmor_parser"
[   13.307769] audit: type=1400 audit(1716294521.921:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=1533 comm="apparmor_parser"
[   13.309652] audit: type=1400 audit(1716294521.925:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/chronyd" pid=1528 comm="apparmor_parser"
[   13.312493] audit: type=1400 audit(1716294521.925:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="tcpdump" pid=1529 comm="apparmor_parser"
[   13.338553] softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
[   13.338560] softdog:              soft_reboot_cmd=<not set> soft_active_on_boot=0
[   13.714295] vmbr0: port 1(eno1) entered blocking state
[   13.714301] vmbr0: port 1(eno1) entered disabled state
[   13.714367] device eno1 entered promiscuous mode
[   15.401009] igb 0000:07:00.0 eno1: igb: eno1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
[   15.401168] vmbr0: port 1(eno1) entered blocking state
[   15.401172] vmbr0: port 1(eno1) entered forwarding state
[   15.401359] IPv6: ADDRCONF(NETDEV_CHANGE): vmbr0: link becomes ready
[   16.081315] bpfilter: Loaded bpfilter_umh pid 1877
[   16.081582] Started bpfilter
[   26.304466] L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.
[   65.318382] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[   65.318416] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[   65.318417] {1}[Hardware Error]: event severity: corrected
[   65.318419] {1}[Hardware Error]:  Error 0, type: corrected
[   65.318420] {1}[Hardware Error]:  fru_text: CorrectedErr
[   65.318421] {1}[Hardware Error]:   section_type: PCIe error
[   65.318422] {1}[Hardware Error]:   port_type: 0, PCIe end point
[   65.318423] {1}[Hardware Error]:   version: 0.0
[   65.318424] {1}[Hardware Error]:   command: 0xffff, status: 0xffff
[   65.318425] {1}[Hardware Error]:   device_id: 0000:80:02.3
[   65.318426] {1}[Hardware Error]:   slot: 0
[   65.318427] {1}[Hardware Error]:   secondary_bus: 0x00
[   65.318428] {1}[Hardware Error]:   vendor_id: 0xffff, device_id: 0xffff
[   65.318429] {1}[Hardware Error]:   class_code: ffffff
[  109.345339] sd 0:1:31:0: [sdl] tag#791 Sense Key : Recovered Error [current]
[  109.345365] sd 0:1:31:0: [sdl] tag#791 Add. Sense: Defect list not found
root@pve:~#

I should also specify I have another raid controller, similar make and model that is going to the harddrives on the server. i have verified the device ids though and can confirm I chose the right pci controller to pass through because when I passed the other one it broke proxmox functionality as well. Curiously, it doens't crash the system nearly as bad as the jbod controller.
 

Attachments

  • Screenshot 2024-05-21 at 8.31.11 AM.png
    Screenshot 2024-05-21 at 8.31.11 AM.png
    215 KB · Views: 7
Last edited:
what do you mean by that? what exactly crashes/does not work
I can still navigate around in proxmox but the cli is broken in the shell and can't load some items to operate properly. However, when I pass the jbod pci, every freezes and is in a loading state. If I refresh the browser, it can't find proxmox anymore.
 
ok, do they both use the same driver? (you can see that in lspci -nnk)? maybe you'd have to blacklist the driver, or preload the vfio-pci driver for that card so it does not get loaded and the host never accesses the drives
 
ok, do they both use the same driver? (you can see that in lspci -nnk)? maybe you'd have to blacklist the driver, or preload the vfio-pci driver for that card so it does not get loaded and the host never accesses the drives
Here is the output from lscpi -nnk:


Code:
01:00.0 RAID bus controller [0104]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
        Subsystem: Adaptec Series 8 - ASR-8885 - 8 internal 8 external 12G SAS Port/PCIe 3.0 [9005:0554]
        Kernel driver in use: aacraid
        Kernel modules: aacraid

86:00.0 RAID bus controller [0104]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
        Subsystem: Adaptec Series 8 12G SAS/PCIe 3 [9005:0557]
        Kernel driver in use: aacraid
        Kernel modules: aacraid

It would appear that they're indeed using the same driver
 
I was able to change the driver for both RAID cards and the system still crashes causing me to have to do a hard reset on the server. Maybe theres something at the BIOS level?
 
I was able to change the driver for both RAID cards and the system still crashes causing me to have to do a hard reset on the server. Maybe theres something at the BIOS level?
yes well that can be the case... you could try to switch the pcie slots of the two cards? maybe that makes a difference?
 
just to make sure: you don't have your boot disk also on that controller? (because that can't work of course)

can you post the journal, or output of 'dmesg' ?

whats your 'pveversion -v' ?
I'm very new here but what would be the problem having boot and storage on the same controller? I am planning on doing this...I've seen many posts where users have boot, VMs and storage all on the same drive (let alone the same controller) just partitioned. I am using separate drives for boot but everything will be connected via the same raid card/HBA and all drives connected to same backplane. IS there something I'm missing?
 
I'm very new here but what would be the problem having boot and storage on the same controller? I am planning on doing this...I've seen many posts where users have boot, VMs and storage all on the same drive (let alone the same controller) just partitioned. I am using separate drives for boot but everything will be connected via the same raid card/HBA and all drives connected to same backplane. IS there something I'm missing?
if you want to passthrough a card like an HBA, you cannot use anything on that on the host while it's passed through..
 
if you want to passthrough a card like an HBA, you cannot use anything on that on the host while it's passed through..

Great. So if had a server and the only possible boot sources were the front drive bays in which that entire backplane is connected to an HBA, I wouldn't be able to use 2 bays for mirrored boot devices and the others for storage?

How does this differ from an on board sata controller where both boot and storage are connected to same controller?

This presents a huge problem for me. Perhaps I should make a new thread on this? My cisco server has no other boot possibilities other than the drive bays all on one sas backplane which route to a hardware raid card/hba.

Any thoughts on this? Thank you for responding BTW.

This is a response from AI web search:

Proxmox VE supports booting from and storing data on the same HBA (Host Bus Adapter) controller.

EDIT- Let me be clear I am NOT using or virtualizing truenas. I have a separate nas appliance. This server is just for proxmox and playing with VMs. Maybe thats where I got confused as this thread references Truenas in the OP. Sorry to hijack this OP and other readers.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!