Problems with GPU Passthrough since 8.2

Thanks but there is no fix since it is already broken in vanilla Linux kernel 6.8. I just wanted to provide you with a work-around in case your problem was the same (which it still might be).
Oh my bad.... Not so much a fix, but help to get back up and running ;-)

Thankyou for your help!! :)
 
I have contacted Supermicro support, I suggest everyone with a similar board does the same. They say they do not support Proxmox, but then again, looking at the above reserved regions output, that bug has been there for earlier kernels, too. I just never caused a problem because there was not enforcement around it, I guess.
No more then they support Nutanix (they do, but not directly). When citing PromoxVE issues, always call out Debian in your support requests instead, and prepare to do a base install of KVM to test all the things. This is why I no longer use SMC in my enterprise networks, they are very choosy on what they support on any given Tuesday when compared to a Thursday.
 
I signed up to also report I updated to 8.2 and AMD 6600XT passthrough broke and I couldn't fix!

I've rolled back to 8.1 (https://enterprise.proxmox.com/iso/), haven't done passthrough just yet, still reading up and researching. But my system was running fine until I updated.

Intel 14th Gen i5 CPU & MSI Z790 Edge TI MoBo.
Ill setup a testing rig (I'm on 7002 Epyc and a Zen3 5800X3d with a 6600X) by next weekend. But since you seem to be able to test this, do you mind rolling kernel 6.6 and 6.7 and seeing if the IOMMU/VFIO passthrough is working? I think Linux updated for IOMMU SVA starting with 6.8, which could be why its failing on that kernel for AMD systems. There is a patch on 6.8.7 addressing SVA for some of the Qualcomm SOCs, so this could be entirely related since IOMMU SVA is a faily new feature.

6.7 patch to address IOMMU SVA issues on 6.6 - https://lore.kernel.org/lkml/ZUkXojmVf2CmkXHh@8bytes.org/

Phoronix's coverage on this a bit - https://www.phoronix.com/news/AMD-IOMMU-SVA-Nears

----
----

IOMMU Updates for Linux v6.7

Including:

- Core changes:
- Make default-domains mandatory for all IOMMU drivers
- Remove group refcounting
- Add generic_single_device_group() helper and consolidate
drivers
- Cleanup map/unmap ops
- Scaling improvements for the IOVA rcache depot
- Convert dart & iommufd to the new domain_alloc_paging()

- ARM-SMMU:
- Device-tree binding update:
- Add qcom,sm7150-smmu-v2 for Adreno on SM7150 SoC
- SMMUv2:
- Support for Qualcomm SDM670 (MDSS) and SM7150 SoCs
- SMMUv3:
- Large refactoring of the context descriptor code to
move the CD table into the master, paving the way
for '->set_dev_pasid()' support on non-SVA domains
- Minor cleanups to the SVA code

- Intel VT-d:
- Enable debugfs to dump domain attached to a pasid
- Remove an unnecessary inline function.

- AMD IOMMU:
- Initial patches for SVA support (not complete yet)

- S390 IOMMU:
- DMA-API conversion and optimized IOTLB flushing

- Some smaller fixes and improvements
 
I think this issue has been around for a while, I encountered it since 8.1 and you can find a bunch of posts about the same issue. Unfortunately I haven't found a solution works.
 
problem with PCI passthrough still exist int 6.8.8.1 kernel

setup: AMD Ryzen 9 + RTX 1030 + NVME passthrough to Win11 VM

problem is about NVME disk, not GPU - solved by pinned 6.8.4.2 kernel

my logs bellow (2:00.0 is NVME disk)

Code:
proxmox kernel: vfio-pci 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible
proxmox systemd[1]: 300.scope: Deactivated successfully.

Code:
swtpm_setup: Not overwriting existing state file.
kvm: ../hw/pci/pci.c:1637: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
stopping swtpm instance (pid 16941) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1


nothing helps - already checked suggestions:

cmdline:
  • pcie_aspm=off
  • pcie_port_pm=off
  • pcie_acs_override=override,multifunction
  • ...
bios:
  • turn off resize bar in BIOS
  • enable D3 cold states
  • ....
 
Last edited:
I have and AMD CPU. My work-around was to blacklist amdgpu, as the RX570 works fine in a VM if Proxmox does not touch it.
Unfortunately, I'm not trying to passthrough a GPU but a HBA-Card.. Still, using 6.5.13-5-pve temporarily fixes the problem.
 
Unfortunately, I'm not trying to passthrough a GPU but a HBA-Card.. Still, using 6.5.13-5-pve temporarily fixes the problem.
Maybe check if your cause of problem is also Proxmox loading the driver of the HBA-Card. If so, use the same work-around of blacklisting the driver (or early binding to vfio-pci with a softdep).
 
Maybe check if your cause of problem is also Proxmox loading the driver of the HBA-Card. If so, use the same work-around of blacklisting the driver (or early binding to vfio-pci with a softdep).
Thank you, I'll give that a try. Do you have any hints on how to find out which driver is used by the hba-card?
 
I think I've run into this issue the last few weeks, passing a samsung 990 nvme thorough. Strangely it happens to one more than the other. Couldn't get anything to work apart from reverting back kernel > 6.5.13-5-pve.

I tried all the same things as @KrisFromFuture tried.

I got the same errors (and sometimes no/misleading errors)
 
Last edited:
  • Like
Reactions: KrisFromFuture
I think I've run into this issue the last few weeks, passing a samsung 990 nvme thorough. Strangely it happens to one more than the other. Couldn't get anything to work apart from reverting back kernel > 6.5.13-5-pve

I pray that the issue gets resolved quickly by the Proxmox team because otherwise, we will be stuck on the old kernel forever :(
 
I also tried firmware updates & kernel param...

Code:
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off.

As I'd seen quite a few forum about nvme sleep states causing problems

This took me down a rabbit hole of bad errors and kernel panics. Seems like it was all related to kernel though as I am having the same as you. Reverted and seems to be running
 
Last edited:
I also tried firmware updates & kernel param...

Code:
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off.

As I'd seen quite a few forum about nvme sleep states causing problems

This took me down a rabbit hole of bad errors and kernel panics. Seems like it was all related to kernel though as I am having the same as you. Reverted and seems to be running

In my opinion, there's no need to complicate things because the cmdline is just a workaround, not a solution to the problem.


We are waiting for a stable kernel that will fix the passing of NVME disks to VMs.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!