PCI Passthrough of SAS card causes complete host crash

IFN

New Member
Nov 8, 2022
9
0
1
Hi all:

I have a proxmox 8.2.4 server with all the pci passthrough stuff enabled (iommu, etc). I am successfully using vGPU today with several VMs, and all is good.

I have a SAS card:

5e:00.0 Serial Attached SCSI controller: ATTO Technology, Inc. ExpressSAS 6Gb/s SAS/SATA HBA

that I want to pass through to a debian VM for use with a tape changer for backups. Please note that the ONLY thing attached to this SAS drive is the tape changer; there are no disks, and definitely not any system/VM storage disks in use by ProxMox.

I've added the PCI device, but when I power on the VM, the entire proxmox host just up and resets immediately. No warning, no crash messages...Its as if someone pressed the rest button. Fortunately the VM is NOT set to start on boot, so when it finishes booting up, all the original VMs are online again.

I have blacklisted the original driver:

Code:
5e:00.0 Serial Attached SCSI controller: ATTO Technology, Inc. ExpressSAS 6Gb/s SAS/SATA HBA
Subsystem: ATTO Technology, Inc. ExpressSAS H644
Flags: bus master, fast devsel, latency 0, IRQ 255, NUMA node 0, IOMMU group 1
Memory at b8800000 (64-bit, non-prefetchable) [size=64K]
Memory at b8810000 (64-bit, non-prefetchable) [size=64K]
Memory at b8830000 (32-bit, non-prefetchable) [size=64K]
Memory at b8820000 (32-bit, non-prefetchable) [size=64K]
Expansion ROM at <ignored> [disabled]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI-X: Enable- Count=16 Masked-
Capabilities: [100] Advanced Error Reporting
Kernel driver in use: vfio-pci
Kernel modules: pm80xx

And it appears the blacklisting is working, as the /dev/sg3 that normally comes up from the card is not present.

I also verified it has its own IOMMU group (group# 1 of all groups!), and no other devices share that group. I'm dumbfounded and surprised how simply starting this VM with the PCI card attached can cause a total hardware reset like that. If it matters, this is a Dell EMC server, model PowerEdge R740xd with BIOS 2.21.2 (which should be most up to date). It was purchased from one of those server refurbishers online...

I also noticed in the iDRAC controller that it appears the host and iDRAC knows what is installed in each PCI slot, and if the device provides temperature monitoring, the server knows what its temperature is and regulates its SYSTEM fans to ensure the PCI cards stay cool enough. One of the cards in the system is an nVidia Tesla T4, which when it gets worked, the system fans speed up since it has no fan of its own.

I wonder if there's some interplay between the iDRAC/whatever monitoring the cards and the way Proxmox takes over a card for PCI Passthrough that causes the iDRAC to reset the system?

In any case, I'd greatly appreciate any assistance in getting this to work! Proxmox is great, this is the first true head-scratcher I've had that so far has not been solvable.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!