Problems with GPU Passthrough since 8.2

RadRom · May 28, 2024

leesteken said:
Thanks but there is no fix since it is already broken in vanilla Linux kernel 6.8. I just wanted to provide you with a work-around in case your problem was the same (which it still might be).

Oh my bad.... Not so much a fix, but help to get back up and running ;-)

Thankyou for your help!!

_--James--_ · May 28, 2024

athurdent said:
I have contacted Supermicro support, I suggest everyone with a similar board does the same. They say they do not support Proxmox, but then again, looking at the above reserved regions output, that bug has been there for earlier kernels, too. I just never caused a problem because there was not enforcement around it, I guess.

No more then they support Nutanix (they do, but not directly). When citing PromoxVE issues, always call out Debian in your support requests instead, and prepare to do a base install of KVM to test all the things. This is why I no longer use SMC in my enterprise networks, they are very choosy on what they support on any given Tuesday when compared to a Thursday.

_--James--_ · May 28, 2024

RadRom said:
I signed up to also report I updated to 8.2 and AMD 6600XT passthrough broke and I couldn't fix!

I've rolled back to 8.1 (https://enterprise.proxmox.com/iso/), haven't done passthrough just yet, still reading up and researching. But my system was running fine until I updated.

Intel 14th Gen i5 CPU & MSI Z790 Edge TI MoBo.

Ill setup a testing rig (I'm on 7002 Epyc and a Zen3 5800X3d with a 6600X) by next weekend. But since you seem to be able to test this, do you mind rolling kernel 6.6 and 6.7 and seeing if the IOMMU/VFIO passthrough is working? I think Linux updated for IOMMU SVA starting with 6.8, which could be why its failing on that kernel for AMD systems. There is a patch on 6.8.7 addressing SVA for some of the Qualcomm SOCs, so this could be entirely related since IOMMU SVA is a faily new feature.

6.7 patch to address IOMMU SVA issues on 6.6 - https://lore.kernel.org/lkml/ZUkXojmVf2CmkXHh@8bytes.org/

Phoronix's coverage on this a bit - https://www.phoronix.com/news/AMD-IOMMU-SVA-Nears

----
----

IOMMU Updates for Linux v6.7

Including:

- Core changes:
- Make default-domains mandatory for all IOMMU drivers
- Remove group refcounting
- Add generic_single_device_group() helper and consolidate
drivers
- Cleanup map/unmap ops
- Scaling improvements for the IOVA rcache depot
- Convert dart & iommufd to the new domain_alloc_paging()

- ARM-SMMU:
- Device-tree binding update:
- Add qcom,sm7150-smmu-v2 for Adreno on SM7150 SoC
- SMMUv2:
- Support for Qualcomm SDM670 (MDSS) and SM7150 SoCs
- SMMUv3:
- Large refactoring of the context descriptor code to
move the CD table into the master, paving the way
for '->set_dev_pasid()' support on non-SVA domains
- Minor cleanups to the SVA code

- Intel VT-d:
- Enable debugfs to dump domain attached to a pasid
- Remove an unnecessary inline function.

- AMD IOMMU:
- Initial patches for SVA support (not complete yet)

- S390 IOMMU:
- DMA-API conversion and optimized IOTLB flushing

- Some smaller fixes and improvements

womanbeatmania · May 28, 2024

confirm the same issue
/hw/pci/pci.c:1637: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed
in
6.8.4-3-pve

ness1602 · May 28, 2024

I also had the same error with 8700GE gpu passthrough, didn't find the solution.

Ryohka233 · Jun 6, 2024

I think this issue has been around for a while, I encountered it since 8.1 and you can find a bunch of posts about the same issue. Unfortunately I haven't found a solution works.

Ramalama · Jun 6, 2024

djkay2637 said:
I have now resolved the issue with my DL380p by following this guide.

I hope other people get a resolution too.

awesome, thanks, relaxed cmdline solved it here either

KrisFromFuture · Jun 23, 2024

problem with PCI passthrough still exist int 6.8.8.1 kernel

setup: AMD Ryzen 9 + RTX 1030 + NVME passthrough to Win11 VM

problem is about NVME disk, not GPU - solved by pinned 6.8.4.2 kernel

my logs bellow (2:00.0 is NVME disk)

Code:

proxmox kernel: vfio-pci 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible
proxmox systemd[1]: 300.scope: Deactivated successfully.

Code:

swtpm_setup: Not overwriting existing state file.
kvm: ../hw/pci/pci.c:1637: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
stopping swtpm instance (pid 16941) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1

nothing helps - already checked suggestions:

cmdline:

pcie_aspm=off
pcie_port_pm=off
pcie_acs_override=override,multifunction
...

bios:

turn off resize bar in BIOS
enable D3 cold states
....

mldy · Jun 24, 2024

Ramalama said:
awesome, thanks, relaxed cmdline solved it here either

Did anyone have success with an amd cpu? I think intel_iommu=on,relax_rmrr will be ignored.

leesteken · Jun 24, 2024

mldy said:
Did anyone have success with an amd cpu? I think intel_iommu=on,relax_rmrr will be ignored.

I have and AMD CPU. My work-around was to blacklist amdgpu, as the RX570 works fine in a VM if Proxmox does not touch it.

mldy · Jun 24, 2024

leesteken said:
I have and AMD CPU. My work-around was to blacklist amdgpu, as the RX570 works fine in a VM if Proxmox does not touch it.

Unfortunately, I'm not trying to passthrough a GPU but a HBA-Card.. Still, using 6.5.13-5-pve temporarily fixes the problem.

leesteken · Jun 24, 2024

mldy said:
Unfortunately, I'm not trying to passthrough a GPU but a HBA-Card.. Still, using 6.5.13-5-pve temporarily fixes the problem.

Maybe check if your cause of problem is also Proxmox loading the driver of the HBA-Card. If so, use the same work-around of blacklisting the driver (or early binding to vfio-pci with a softdep).

mldy · Jun 24, 2024

leesteken said:
Maybe check if your cause of problem is also Proxmox loading the driver of the HBA-Card. If so, use the same work-around of blacklisting the driver (or early binding to vfio-pci with a softdep).

Thank you, I'll give that a try. Do you have any hints on how to find out which driver is used by the hba-card?

leesteken · Jun 24, 2024

mldy said:
Thank you, I'll give that a try. Do you have any hints on how to find out which driver is used by the hba-card?

Look at the output of lspci -k (or if you know the PCI ID from the VM settings for example, use lspci -ks THE_PCI_ID).

Giggling3999 · Jul 15, 2024

I think I've run into this issue the last few weeks, passing a samsung 990 nvme thorough. Strangely it happens to one more than the other. Couldn't get anything to work apart from reverting back kernel > 6.5.13-5-pve.

I tried all the same things as @KrisFromFuture tried.

I got the same errors (and sometimes no/misleading errors)

KrisFromFuture · Jul 15, 2024

Giggling3999 said:
I think I've run into this issue the last few weeks, passing a samsung 990 nvme thorough. Strangely it happens to one more than the other. Couldn't get anything to work apart from reverting back kernel > 6.5.13-5-pve

I pray that the issue gets resolved quickly by the Proxmox team because otherwise, we will be stuck on the old kernel forever

Giggling3999 · Jul 15, 2024

I also tried firmware updates & kernel param...

Code:

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off.

As I'd seen quite a few forum about nvme sleep states causing problems

This took me down a rabbit hole of bad errors and kernel panics. Seems like it was all related to kernel though as I am having the same as you. Reverted and seems to be running

KrisFromFuture · Jul 15, 2024

Giggling3999 said:
I also tried firmware updates & kernel param...

Code:

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off.

As I'd seen quite a few forum about nvme sleep states causing problems

This took me down a rabbit hole of bad errors and kernel panics. Seems like it was all related to kernel though as I am having the same as you. Reverted and seems to be running

In my opinion, there's no need to complicate things because the cmdline is just a workaround, not a solution to the problem.

We are waiting for a stable kernel that will fix the passing of NVME disks to VMs.

Giggling3999 · Jul 15, 2024

It didn't work anyway - Made things much worse.

You are right, i never had issues before upgrading

KrisFromFuture · Jul 15, 2024

maybe 6.9 kernel solve our problems

Problems with GPU Passthrough since 8.2

New Member

Member

Member

Member

Famous Member

New Member

Renowned Member

New Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

Member

New Member

Member

New Member

Member

New Member

We value your privacy