[SOLVED] Nvidia PCIe-Passtrough strange problem

mailo95

Member
Sep 17, 2021
9
2
8
Hello all,

I have strange problem with my Win10 VM with Nvidia GPU passtrough. First I was trying to stop the frame buffer for my only GTX1060 to be able to have only one GPU in the system but without success and after that I installed one laying aroung GT710 only to be primary (anyway I need one for debug purposes). After I successfully assigned my GTX1060 to my Win10 VM (without dumping and pointing vBios in VM config otherwise getting error 43) I have weird and strange problem because when I have heavy load (for example in FurMark will upload graph photo) on the GPU both of my passed trough USB devices starting to randomly plug and unplug and some stutters this is only when GPU is loaded with some heavy task also some times driver crashesh probably because my screen goes black for 1-2 secs. This GTX1060 was tested on other PC with Win10 not in virtualized env and there are no problems like this. I don't have any RGB fancy keyboard and mouse. Before GTX1060 I was using R7 260x on the same VM but drivers were purged via DDU tool. For GPU passtrough I was using these tutorials:
1 --> https://pve.proxmox.com/wiki/PCI_Passthrough
2 --> https://www.reddit.com/r/homelab/comments/b5xpua/the_ultimate_beginners_guide_to_gpu_passthrough/

The system specs are:
CPU: Intel Xeon E5-2660v2
MB: Asus ROG RAMPAGE IV EXTREME
RAM: 40GB DDR3
storage: a bunch of ssd's and hdd's in SATA
GPU1: GT710 (primary)
GPU2: GTX1060 3GB
Proxmox ver. 7.4-3 using kernel 6.1.15-1-pve

Will place some config and logs output:
/etc/default/grub (only edited line)
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"

dmesg | grep -e DMAR -e IOMMU
Code:
[    0.017396] ACPI: DMAR 0x00000000AC4C1688 0000AC (v01 A M I  OEMDMAR  00000001 INTL 00000001)
[    0.017410] ACPI: Reserving DMAR table memory at [mem 0xac4c1688-0xac4c1733]
[    0.172052] DMAR: IOMMU enabled
[    0.422054] DMAR: Host address width 46
[    0.422056] DMAR: DRHD base: 0x000000fbffc000 flags: 0x1
[    0.422066] DMAR: dmar0: reg_base_addr fbffc000 ver 1:0 cap d2078c106f0466 ecap f020de
[    0.422069] DMAR: RMRR base: 0x000000ac389000 end: 0x000000ac396fff
[    0.422071] DMAR: ATSR flags: 0x0
[    0.422073] DMAR: RHSA base: 0x000000fbffc000 proximity domain: 0x0
[    0.422076] DMAR-IR: IOAPIC id 0 under DRHD base  0xfbffc000 IOMMU 0
[    0.422078] DMAR-IR: IOAPIC id 2 under DRHD base  0xfbffc000 IOMMU 0
[    0.422080] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[    0.422468] DMAR-IR: Enabled IRQ remapping in x2apic mode
[    0.872004] DMAR: No SATC found
[    0.872007] DMAR: dmar0: Using Queued invalidation
[    0.874071] DMAR: Intel(R) Virtualization Technology for Directed I/O

/etc/modules
Code:
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

dmesg | grep 'remapping'
Code:
[    0.422080] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[    0.422468] DMAR-IR: Enabled IRQ remapping in x2apic mode

/etc/modprobe.d/iommu_unsafe_interrupts.conf
Code:
options vfio_iommu_type1 allow_unsafe_interrupts=1

/etc/modprobe.d/kvm.conf
Code:
options kvm ignore_msrs=1 report_ignored_msrs=0

/etc/modprobe.d/vfio.conf
Code:
options vfio-pci ids=1002:67b1,1002:aac8 disable_vga=1

/etc/modprobe.d/blacklist.conf
Code:
blacklist nouveau
blacklist nvidia
blacklist nvidiafb
blacklist radeon
blacklist amdgpu

lspci -vvv | grep -i "nvidia"
Code:
01:00.0 VGA compatible controller: NVIDIA Corporation GK208B [GeForce GT 710] (rev a1) (prog-if 00 [VGA controller])
        Kernel modules: nvidiafb, nouveau
01:00.1 Audio device: NVIDIA Corporation GK208 HDMI/DP Audio Controller (rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1) (prog-if 00 [VGA controller])
        Kernel modules: nvidiafb, nouveau
03:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)

Win10VM config:
Code:
affinity: 4-10
agent: 1
args: -cpu 'host,-hypervisor,kvm=off,hv_vendor_id=intel'
balloon: 0
bios: ovmf
boot: order=scsi0;net0
cores: 6
cpu: host,flags=+pdpe1gb;+aes
cpulimit: 6
cpuunits: 200
efidisk0: local-lvm:vm-105-disk-1,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:03:00,pcie=1,x-vga=1
hotplug: disk,network,usb
machine: pc-q35-7.2
memory: 16384
meta: creation-qemu=7.0.0,ctime=1664434421
name: Win10WS
net0: virtio=BE:22:DB:25:05:58,bridge=vmbr1,tag=100
numa: 0
ostype: win10
scsi0: KingstonSV300:vm-105-disk-0,cache=writeback,discard=on,size=220G,ssd=1
scsi1: KingstonA400:vm-105-disk-0,cache=writeback,discard=on,size=220G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=00e766f9-1fe4-43a2-a5f7-5a4569d58dd3
sockets: 1
usb0: host=258a:0090
usb1: host=1ea7:0011
vcpus: 6
vga: none
vmgenid: 671c5ae6-f564-4edb-84c0-c439b6feb747
vmstatestorage: kingston

find /sys/kernel/iommu_groups/ -type l
Code:
/sys/kernel/iommu_groups/17/devices/0000:00:1c.5
/sys/kernel/iommu_groups/45/devices/0000:ff:10.4
/sys/kernel/iommu_groups/35/devices/0000:ff:0f.0
/sys/kernel/iommu_groups/7/devices/0000:00:11.0
/sys/kernel/iommu_groups/25/devices/0000:08:00.0
/sys/kernel/iommu_groups/15/devices/0000:00:1c.3
/sys/kernel/iommu_groups/43/devices/0000:ff:10.2
/sys/kernel/iommu_groups/33/devices/0000:ff:0d.3
/sys/kernel/iommu_groups/33/devices/0000:ff:0d.1
/sys/kernel/iommu_groups/33/devices/0000:ff:0d.4
/sys/kernel/iommu_groups/33/devices/0000:ff:0d.2
/sys/kernel/iommu_groups/33/devices/0000:ff:0d.0
/sys/kernel/iommu_groups/5/devices/0000:00:05.2
/sys/kernel/iommu_groups/23/devices/0000:06:00.0
/sys/kernel/iommu_groups/13/devices/0000:00:1c.1
/sys/kernel/iommu_groups/41/devices/0000:ff:10.0
/sys/kernel/iommu_groups/31/devices/0000:ff:0b.0
/sys/kernel/iommu_groups/31/devices/0000:ff:0b.3
/sys/kernel/iommu_groups/3/devices/0000:00:03.0
/sys/kernel/iommu_groups/21/devices/0000:01:00.0
/sys/kernel/iommu_groups/21/devices/0000:01:00.1
/sys/kernel/iommu_groups/11/devices/0000:00:1b.0
/sys/kernel/iommu_groups/1/devices/0000:00:01.0
/sys/kernel/iommu_groups/48/devices/0000:ff:10.7
/sys/kernel/iommu_groups/38/devices/0000:ff:0f.3
/sys/kernel/iommu_groups/28/devices/0000:ff:08.0
/sys/kernel/iommu_groups/18/devices/0000:00:1d.0
/sys/kernel/iommu_groups/46/devices/0000:ff:10.5
/sys/kernel/iommu_groups/36/devices/0000:ff:0f.1
/sys/kernel/iommu_groups/8/devices/0000:00:16.0
/sys/kernel/iommu_groups/26/devices/0000:09:00.0
/sys/kernel/iommu_groups/16/devices/0000:00:1c.4
/sys/kernel/iommu_groups/44/devices/0000:ff:10.3
/sys/kernel/iommu_groups/34/devices/0000:ff:0e.1
/sys/kernel/iommu_groups/34/devices/0000:ff:0e.0
/sys/kernel/iommu_groups/6/devices/0000:00:05.4
/sys/kernel/iommu_groups/24/devices/0000:07:00.0
/sys/kernel/iommu_groups/14/devices/0000:00:1c.2
/sys/kernel/iommu_groups/42/devices/0000:ff:10.1
/sys/kernel/iommu_groups/32/devices/0000:ff:0c.4
/sys/kernel/iommu_groups/32/devices/0000:ff:0c.2
/sys/kernel/iommu_groups/32/devices/0000:ff:0c.0
/sys/kernel/iommu_groups/32/devices/0000:ff:0c.3
/sys/kernel/iommu_groups/32/devices/0000:ff:0c.1
/sys/kernel/iommu_groups/4/devices/0000:00:05.0
/sys/kernel/iommu_groups/22/devices/0000:03:00.0
/sys/kernel/iommu_groups/22/devices/0000:03:00.1
/sys/kernel/iommu_groups/50/devices/0000:ff:16.2
/sys/kernel/iommu_groups/50/devices/0000:ff:16.0
/sys/kernel/iommu_groups/50/devices/0000:ff:16.1
/sys/kernel/iommu_groups/12/devices/0000:00:1c.0
/sys/kernel/iommu_groups/40/devices/0000:ff:0f.5
/sys/kernel/iommu_groups/30/devices/0000:ff:0a.3
/sys/kernel/iommu_groups/30/devices/0000:ff:0a.1
/sys/kernel/iommu_groups/30/devices/0000:ff:0a.2
/sys/kernel/iommu_groups/30/devices/0000:ff:0a.0
/sys/kernel/iommu_groups/2/devices/0000:00:02.0
/sys/kernel/iommu_groups/20/devices/0000:00:1f.2
/sys/kernel/iommu_groups/20/devices/0000:00:1f.0
/sys/kernel/iommu_groups/20/devices/0000:00:1f.3
/sys/kernel/iommu_groups/49/devices/0000:ff:13.0
/sys/kernel/iommu_groups/49/devices/0000:ff:13.5
/sys/kernel/iommu_groups/49/devices/0000:ff:13.1
/sys/kernel/iommu_groups/49/devices/0000:ff:13.4
/sys/kernel/iommu_groups/10/devices/0000:00:1a.0
/sys/kernel/iommu_groups/39/devices/0000:ff:0f.4
/sys/kernel/iommu_groups/29/devices/0000:ff:09.0
/sys/kernel/iommu_groups/0/devices/0000:00:00.0
/sys/kernel/iommu_groups/19/devices/0000:00:1e.0
/sys/kernel/iommu_groups/47/devices/0000:ff:10.6
/sys/kernel/iommu_groups/37/devices/0000:ff:0f.2
/sys/kernel/iommu_groups/9/devices/0000:00:19.0
/sys/kernel/iommu_groups/27/devices/0000:0a:00.0

FurMark graph photo: (the drops in graph are where every USB device randomly plug/unplug, driver crashes probably and have stutter)
343997911_780210476709944_5465437013566910676_n (1).jpg

In addition I'm using latest Nvidia driver, MB is with latest UEFI ver (everything related to OCing is disabled) and Win10 with latest secutiry updates (used Chris Titus debload tool for debloat). Searched in internet for this kind of problem and in some forums suggesting that is the PSU but I was logging GPU-Z metrics and GPU core and 12V lanes were absolutelly stable also with this PSU I was running R9 390X without any issues and in other forums people are suggesting that this is problem with the drivers because a lot of people with RTX3xxx series have the same problem and the third suggestions from other forums were that on some Ryzen platforms should lock the PCI-e slot to ver 3.0 but mine support only 2 and 3 and currently is working in ver 3 mode. Will be grateful if someone help me with this weird problem.

Thanks in advance !
 

Attachments

  • 343997911_780210476709944_5465437013566910676_n.jpg
    343997911_780210476709944_5465437013566910676_n.jpg
    413 KB · Views: 0
Last edited:
  • Like
Reactions: stefano.molinaro

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!