[SOLVED] GPU Passthrough Issues After Upgrade to 7.2

celemine1gig · May 13, 2022

Just FYI, the workaround did work perfectly for a Nvidia GTX980 Ti. However, on my other system, where the CoffeeLake iGPU gets passed-through, it did NOT work. I would guess because the GPU is an internal function there, instead of an external card.
When the workaround is applied there, it seems to be OK, until you try to pass-through the dervice. Then you get a KVM error message and the VM cannot be started.
Without the workaround, this error is not present, but instead you get the BAR-error.

Kodey · May 14, 2022

nick.kopas said:

Only thing that worked for me was "sketchy workaround" offered by @StephenM64

Here's my hookscript:

Bash:

#!/bin/bash

if [ $2 == "pre-start" ]
then
    echo "gpu-hookscript: Resetting GPU for Vitual Machine $1"
    echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/remove
    echo 1 > /sys/bus/pci/rescan
fi

Here's how I deployed it:

Bash:

#create snippets folder
mkdir /var/lib/vz/snippets

#create script with content above
nano /var/lib/vz/snippets/gpu-hookscript.sh

#make it executable
chmod +x /var/lib/vz/snippets/gpu-hookscript.sh

#apply script to VM
qm set 100 --hookscript local:snippets/gpu-hookscript.sh

If I start the vm at boot I get this issue:

Code:

()
swtpm_setup: Not overwriting existing state file.
kvm: unable to map backing store for guest RAM: Cannot allocate memory
stopping swtpm instance (pid 8898) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1

It doesn't startup and if I try to run it after start it in that state the monitor is blank.

Kodey · May 14, 2022

Last startup went like this for me (unrelated network lines removed)

Code:

May 12 20:42:23 pmhost pvesh[7767]: Starting VM 101
May 12 20:42:23 pmhost pve-guests[7768]: <root@pam> starting task UPID:pmhost:00001F46:00000925:627CE48F:qmstart:101:root@pam:
May 12 20:42:23 pmhost pve-guests[8006]: start VM 101: UPID:pmhost:00001F46:00000925:627CE48F:qmstart:101:root@pam:
May 12 20:42:23 pmhost kernel: vfio-pci 0000:0f:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
May 12 20:42:23 pmhost kernel: pci 0000:0f:00.0: Removing from iommu group 33
May 12 20:42:23 pmhost kernel: pci 0000:0f:00.0: [1002:683f] type 00 class 0x030000
May 12 20:42:23 pmhost kernel: pci 0000:0f:00.0: reg 0x10: [mem 0x7fe0000000-0x7fefffffff 64bit pref]
May 12 20:42:23 pmhost kernel: pci 0000:0f:00.0: reg 0x18: [mem 0xfce00000-0xfce3ffff 64bit]
May 12 20:42:23 pmhost kernel: pci 0000:0f:00.0: reg 0x20: [io  0xf000-0xf0ff]
May 12 20:42:23 pmhost kernel: pci 0000:0f:00.0: reg 0x30: [mem 0xfce40000-0xfce5ffff pref]
May 12 20:42:23 pmhost kernel: pci 0000:0f:00.0: supports D1 D2
May 12 20:42:23 pmhost kernel: pci 0000:0f:00.0: PME# supported from D1 D2 D3hot
May 12 20:42:23 pmhost kernel: pci 0000:0f:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
May 12 20:42:23 pmhost kernel: pci 0000:0f:00.0: Adding to iommu group 33
May 12 20:42:23 pmhost kernel: pci 0000:0f:00.0: BAR 0: assigned [mem 0x7fe0000000-0x7fefffffff 64bit pref]
May 12 20:42:23 pmhost kernel: pci 0000:0f:00.0: BAR 2: assigned [mem 0xfce00000-0xfce3ffff 64bit]
May 12 20:42:23 pmhost kernel: pci 0000:0f:00.0: BAR 6: assigned [mem 0xfce40000-0xfce5ffff pref]
May 12 20:42:23 pmhost kernel: pci 0000:0f:00.0: BAR 4: assigned [io  0xf000-0xf0ff]
May 12 20:42:23 pmhost kernel: vfio-pci 0000:0f:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
May 12 20:42:23 pmhost systemd[1]: Created slice qemu.slice.
May 12 20:42:23 pmhost systemd[1]: Started 101.scope.
May 12 20:42:27 pmhost pve-guests[8006]: start failed: QEMU exited with code 1
May 12 20:42:27 pmhost systemd[1]: 101.scope: Succeeded.
May 12 20:42:27 pmhost pvesh[7767]: Starting VM 101 failed: start failed: QEMU exited with code 1

After that, it started manually ok but there was still a guest ping timeout error

Code:

May 12 20:55:34 pmhost pvedaemon[44311]: start VM 101: UPID:pmhost:0000AD17:00013E26:627CE7A6:qmstart:101:root@pam:
May 12 20:55:34 pmhost kernel: vfio-pci 0000:0f:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
May 12 20:55:34 pmhost kernel: pci 0000:0f:00.0: Removing from iommu group 33
May 12 20:55:34 pmhost kernel: pci 0000:0f:00.0: [1002:683f] type 00 class 0x030000
May 12 20:55:34 pmhost kernel: pci 0000:0f:00.0: reg 0x10: [mem 0x7fe0000000-0x7fefffffff 64bit pref]
May 12 20:55:34 pmhost kernel: pci 0000:0f:00.0: reg 0x18: [mem 0xfce00000-0xfce3ffff 64bit]
May 12 20:55:34 pmhost kernel: pci 0000:0f:00.0: reg 0x20: [io  0xf000-0xf0ff]
May 12 20:55:34 pmhost kernel: pci 0000:0f:00.0: reg 0x30: [mem 0xfce40000-0xfce5ffff pref]
May 12 20:55:34 pmhost kernel: pci 0000:0f:00.0: supports D1 D2
May 12 20:55:34 pmhost kernel: pci 0000:0f:00.0: PME# supported from D1 D2 D3hot
May 12 20:55:34 pmhost kernel: pci 0000:0f:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
May 12 20:55:34 pmhost kernel: pci 0000:0f:00.0: Adding to iommu group 33
May 12 20:55:34 pmhost kernel: pci 0000:0f:00.0: BAR 0: assigned [mem 0x7fe0000000-0x7fefffffff 64bit pref]
May 12 20:55:34 pmhost kernel: pci 0000:0f:00.0: BAR 2: assigned [mem 0xfce00000-0xfce3ffff 64bit]
May 12 20:55:34 pmhost kernel: pci 0000:0f:00.0: BAR 6: assigned [mem 0xfce40000-0xfce5ffff pref]
May 12 20:55:34 pmhost kernel: pci 0000:0f:00.0: BAR 4: assigned [io  0xf000-0xf0ff]
May 12 20:55:34 pmhost kernel: vfio-pci 0000:0f:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
May 12 20:55:34 pmhost systemd[1]: Started 101.scope.
May 12 20:55:39 pmhost kernel: vfio-pci 0000:0f:00.0: enabling device (0002 -> 0003)
May 12 20:55:39 pmhost kernel: vfio-pci 0000:0f:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
May 12 20:55:39 pmhost kernel: vfio-pci 0000:0f:00.1: enabling device (0000 -> 0002)
May 12 20:55:40 pmhost pvedaemon[7658]: <root@pam> end task UPID:pmhost:0000AD17:00013E26:627CE7A6:qmstart:101:root@pam: OK
May 12 20:55:44 pmhost pvedaemon[7661]: VM 101 qmp command failed - VM 101 qmp command 'guest-ping' failed - got timeout

naftu · May 14, 2022

Hello,

I also have this issue after upgrading Proxmox. Before applying the script from above (that removes and re-scans the pcie device) I was getting Code 43 on my Nvidia 1650 Super gpu.
But after using that script the Windows VM started working again, although it's not stable for me.

From time to time (couple of times a day) the VM becomes unresponsive for around 30 seconds, after that it comes back online. I checked the Proxmox log and found the following entries there when that happened:

Code:

May 14 20:14:06 server kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:07:00.1
May 14 20:14:06 server kernel: vfio-pci 0000:07:00.1: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
May 14 20:14:06 server kernel: vfio-pci 0000:07:00.1:   device [10de:1aeb] error status/mask=00008000/00000000
May 14 20:14:06 server kernel: vfio-pci 0000:07:00.1:    [15] HeaderOF             
May 14 20:14:06 server kernel: vfio-pci 0000:07:00.1: AER:   Error of this Agent is reported first
May 14 20:14:06 server kernel: vfio-pci 0000:07:00.2: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
May 14 20:14:06 server kernel: vfio-pci 0000:07:00.2:   device [10de:1aec] error status/mask=00008000/00000000
May 14 20:14:06 server kernel: vfio-pci 0000:07:00.2:    [15] HeaderOF             
May 14 20:14:06 server kernel: vfio-pci 0000:07:00.3: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
May 14 20:14:06 server kernel: vfio-pci 0000:07:00.3:   device [10de:1aed] error status/mask=00008000/00000000
May 14 20:14:06 server kernel: vfio-pci 0000:07:00.3:    [15] HeaderOF             
May 14 20:14:06 server kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:07:00.1
May 14 20:14:28 server kernel: vfio-pci 0000:07:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0015 address=0x77b5f600 flags=0x0020]
May 14 20:14:28 server kernel: vfio-pci 0000:07:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0015 address=0x77b5f610 flags=0x0020]
May 14 20:14:29 server kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:07:00.0
May 14 20:14:29 server kernel: vfio-pci 0000:07:00.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
May 14 20:14:29 server kernel: vfio-pci 0000:07:00.0:   device [10de:2187] error status/mask=00008000/00000000
May 14 20:14:29 server kernel: vfio-pci 0000:07:00.0:    [15] HeaderOF

0000:07:00 - is my graphics card's group

proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve)
pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1)
pve-kernel-5.15: 7.2-3
pve-kernel-helper: 7.2-3
pve-kernel-5.13: 7.1-9
pve-kernel-5.4: 6.4-11
pve-kernel-5.15.35-1-pve: 5.15.35-2
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-5-pve: 5.13.19-13
pve-kernel-5.13.19-4-pve: 5.13.19-9
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.4.157-1-pve: 5.4.157-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-6
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.2-2
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.1.8-1
proxmox-backup-file-restore: 2.1.8-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-10
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-1
pve-qemu-kvm: 6.2.0-5
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

Let me know if you need more details, didn't want to include too much.

Kodey · May 14, 2022

I keep Getting this message at start logs after the vm fails to start:

Code:

kernel: NOTICE: System running low on memory, aborting L2ARC rebuild.

And this in the gui task pane:

Code:

gpu-hookscript: Resetting GPU for Vitual Machine 101
swtpm_setup: Not overwriting existing state file.
kvm: unable to map backing store for guest RAM: Cannot allocate memory
stopping swtpm instance (pid 9216) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1

leesteken · May 14, 2022

Kodey said:
kvm: unable to map backing store for guest RAM: Cannot allocate memory

Kodey said:
Code:

kernel: NOTICE: System running low on memory, aborting L2ARC rebuild.

kvm: unable to map backing store for guest RAM: Cannot allocate memory

I think your problem is not enough memory for the VMs you want to run. Remember that PCI passthrough disables ballooning and KSM (and memory cannot be swapped). Also, unless you set limits for ZFS, it will take half your systems memory (eventually). I don't think passthrough is the problem here. Either set limits and lower the VM memory settings or add more memory to your system.

Kodey · May 15, 2022

Is this related?
journalctl -b

Code:

May 15 16:32:54 pmhost kernel: amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x003D to IRQ, err -517
May 15 16:32:54 pmhost kernel: fbcon: Taking over console

Kodey · May 15, 2022

leesteken said:
I think your problem is not enough memory for the VMs you want to run. Remember that PCI passthrough disables ballooning and KSM (and memory cannot be swapped). Also, unless you set limits for ZFS, it will take half your systems memory (eventually). I don't think passthrough is the problem here. Either set limits and lower the VM memory settings or add more memory to your system.

Yes but this was working and only started happening at the same time this gpu passthrough problem started so there is a correlation.
These are already set:

Code:

root@pmhost:/etc/modprobe.d# cat zfs.conf
options zfs zfs_arc_max=17179869184
root@pmhost:/etc/modprobe.d# cat /sys/module/zfs/parameters/zfs_arc_max
17179869184

And the vm has balooning disabled and it's memory allocated at boot as hugepages because it's too slow otherwise:

Code:

root@pmhost:/etc/modprobe.d# cat /etc/kernel/cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet amd_iommu=on iommu=pt kvm.ignore_msrs=1 vfio-pci.ids=1002:683f,1002:aab0 default_hugepagesz=1G hugepagesz=1G hugepages=64

root@pmhost:~# cat /etc/pve/nodes/pmhost/qemu-server/101.conf
agent: 1
balloon: 0
bios: ovmf
boot: order=sata0
cores: 4
cpu: host,flags=+amd-no-ssb;+aes
efidisk0: zfs16Tr10:vm-101-disk-0,size=1M
hookscript: local:snippets/gpu-hookscript.sh
hostpci0: 0000:0f:00,pcie=1,x-vga=1
hugepages: 1024
keephugepages: 1
machine: pc-q35-6.2
memory: 65536
name: Win11Pro
net0: virtio=4A:5B:D7:AD:F3:23,bridge=vmbr0,firewall=1
numa: 1
onboot: 1
ostype: win11
parent: UpdateOtionsName
sata0: local-zfs:vm-101-disk-1,size=512G
sata1: zfs16Tr10:vm-101-disk-1,size=2T
scsihw: virtio-scsi-pci
smbios1: uuid=22027052-1790-4ed9-b8eb-677d0de4e8e0
sockets: 1
tpmstate0: zfs16Tr10:vm-101-disk-2,size=4M,version=v2.0
usb0: host=045e:0750
usb1: host=093a:2510
vga: none
vmgenid: 83e65fb3-643b-4295-97ca-f17b0a64bd63

You're probably right and there's no relation, but I wanted to see if anyone else experiencing this problem had a similar setup.
The system already has 128G RAM 64G of which is dedicated to hugepages and reserved for that vm. Perhaps it's more a configuration issue than a lack of memory.

need2gcm · May 16, 2022

nick.kopas said:

Only thing that worked for me was "sketchy workaround" offered by @StephenM64

Here's my hookscript:

Bash:

#!/bin/bash

if [ $2 == "pre-start" ]
then
    echo "gpu-hookscript: Resetting GPU for Vitual Machine $1"
    echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/remove
    echo 1 > /sys/bus/pci/rescan
fi

Here's how I deployed it:

Bash:

#create snippets folder
mkdir /var/lib/vz/snippets

#create script with content above
nano /var/lib/vz/snippets/gpu-hookscript.sh

#make it executable
chmod +x /var/lib/vz/snippets/gpu-hookscript.sh

#apply script to VM
qm set 100 --hookscript local:snippets/gpu-hookscript.sh

Adding, this worked for me, as well as ENABLING ROM Bar, setting CPU to KVM64, and disabling Ballooning memory.

Previously I had ROM Bar disabled, CPU set to Host just to get the PCIe passthrough to work. Ballooning was working before, but does not seem to work with the 5.15 kernel and PCIe passthrough.

marcosscriven · May 18, 2022

I'm wondering if any of these fixes are likely to be incorporated into Proxmox, as currently I'm holding off upgrading from 7.1 to 7.2

nick.kopas · May 19, 2022

marcosscriven said:
I'm wondering if any of these fixes are likely to be incorporated into Proxmox, as currently I'm holding off upgrading from 7.1 to 7.2

There's already been a couple kernel updates. At some point I'll back out my hookscript and see if video=simplefb:off actually does what it's supposed to.

marcosscriven · May 19, 2022

nick.kopas said:
There's already been a couple kernel updates. At some point I'll back out my hookscript and see if video=simplefb:off actually does what it's supposed to.

I check https://git.proxmox.com/?p=pve-kernel.git;a=summary for changes, and there's only been two changes since this thread. One's for an network controller, the other for NFS.

hardwareadictos · May 22, 2022

Same issue here, rolled back to 7.1 kernel: 5.13.19-6-pve

Cybercreator · May 22, 2022

Good afternoon. Updated on the first day of the release of the update. My mistake was to update all servers at once. Due to a GPU forwarding issue, a rollback had to be performed. So far, none of the above methods have helped. I continue to watch this topic on the forum in the hope of finding a solution that works for me.

paulmorabi · May 22, 2022

Sadly I now have this problem after upgrading with my RX580. The 1050ti I have is working fine. I tried the script but after running I get:

[drm:amdgpu_init [amdgpu]] *ERROR* VGACON disables amdgpu kernel modesetting.

no bar 0 errors but also a black screen.

I've reverted back to kernel 5.13.19-6-pve for now.

leesteken · May 22, 2022

paulmorabi said:
Sadly I now have this problem after upgrading with my RX580. The 1050ti I have is working fine. I tried the script but after running I get:

[drm:amdgpu_init [amdgpu]] *ERROR* VGACON disables amdgpu kernel modesetting.

no bar 0 errors but also a black screen.

I've reverted back to kernel 5.13.19-6-pve for now.

I have a RX570 (used during boot and the Proxmox console) and it gives no such errors with amdgpu (but I don't blacklist or rescan the pci bus). Maybe my approach for pve-kernel-5.15 can get yours to work too. Can you tell me your cat /proc/cmdline and other /etc/modprobe.d/... options that might be involved? Are you using vendor-reset also?

paulmorabi · May 23, 2022

leesteken said:
I have a RX570 (used during boot and the Proxmox console) and it gives no such errors with amdgpu (but I don't blacklist or rescan the pci bus). Maybe my approach for pve-kernel-5.15 can get yours to work too. Can you tell me your cat /proc/cmdline and other /etc/modprobe.d/... options that might be involved? Are you using vendor-reset also?

@leesteken thank you, am curious how you got it working. My settings are:

Code:

root@pve:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.13.19-6-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on iommu=pt pci=noats pcie_acs_override=downstream,multifunction nofb nomodeset video=efifb:off video=vesafb:off video=simplefb:off

(note its currently using the older kernel).

Code:

root@pve:~# cat /etc/modprobe.d/blacklist.conf
blacklist radeon
blacklist nouveau
blacklist nvidia
blacklist snd_hda_intel

root@pve:~# cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=1b73:1100,10de:1c82,10de:0fb9,1002:67df,1002:aaf0 disable_vga=1

The rest looks to be the standard PVE stuff. I fortunately didn't need the vendor reset previously. Occasionally I'd get a freeze or crash on reset but we are talking one or two times over several months of uptime.

leesteken · May 23, 2022

paulmorabi said:

Code:

root@pve:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.13.19-6-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on iommu=pt pci=noats pcie_acs_override=downstream,multifunction nofb nomodeset video=efifb:off video=vesafb:off video=simplefb:off

I stopped using nofb nomodeset video=efifb:off video=vesafb:off video=simplefb:off and let the RX570 be used for boot messages and console.

paulmorabi said:

I stopped blacklisting amdgpu, but you did not do that to begin with.

paulmorabi said:

I removed the ids for the VGA part of the GPU (yours is probably 1002:67df) but I do bind the audio part to vfio-pci (and also blacklist snd_hda_intel, as you do).

That allows me to boot with the RX570, see kernel messages and have a console until I start the VM. amdgpu appears to fix the framebuffer-iomem problem and it does release the GPU to vfio-pci nicely (at the moment). I do need vendor-reset to make sure that the RX570 resets properly after being used.
I use the following commands before starting the VM (pre-start) where ${GPU} is 0000:0b:00.0 in my case:

echo 0 | tee /sys/class/vtconsole/vtcon*/bind >/dev/null
echo 'device_specific' >"/sys/bus/pci/devices/${GPU}/reset_method"
sleep 1
echo "${GPU}" > "/sys/bus/pci/devices/${GPU}/driver/unbind"

Those commands make sure Proxmox is not using amdgpu anymore and vendor-reset is actually used. I can even reconnect the GPU to Proxmox again when the VM is shut down.

paulmorabi · May 26, 2022

leesteken said:
I stopped using nofb nomodeset video=efifb:off video=vesafb:off video=simplefb:off and let the RX570 be used for boot messages and console.

I stopped blacklisting amdgpu, but you did not do that to begin with.

I removed the ids for the VGA part of the GPU (yours is probably 1002:67df) but I do bind the audio part to vfio-pci (and also blacklist snd_hda_intel, as you do).

That allows me to boot with the RX570, see kernel messages and have a console until I start the VM. amdgpu appears to fix the framebuffer-iomem problem and it does release the GPU to vfio-pci nicely (at the moment). I do need vendor-reset to make sure that the RX570 resets properly after being used.
I use the following commands before starting the VM (pre-start) where ${GPU} is 0000:0b:00.0 in my case:
echo 0 | tee /sys/class/vtconsole/vtcon*/bind >/dev/null echo 'device_specific' >"/sys/bus/pci/devices/${GPU}/reset_method" sleep 1 echo "${GPU}" > "/sys/bus/pci/devices/${GPU}/driver/unbind"
Those commands make sure Proxmox is not using amdgpu anymore and vendor-reset is actually used. I can even reconnect the GPU to Proxmox again when the VM is shut down.

Hi,

Thanks for this. I did the above but it is not working. Firstly, the script:

Code:

#!/bin/bash

echo 0 | tee /sys/class/vtconsole/vtcon*/bind >/dev/null
echo 'device_specific' >"/sys/bus/pci/devices/0000:1c:00.0/reset_method"
sleep 1
echo "0000:1c:00.0" > "/sys/bus/pci/devices/0000:1c:00.0/driver/unbind"

When I run it from a fresh reboot it looks like this:

Code:

root@pve:~# sh /var/lib/vz/snippets/gpu-hookscript.sh
/var/lib/vz/snippets/gpu-hookscript.sh: 4: echo: echo: I/O error
/var/lib/vz/snippets/gpu-hookscript.sh: 6: cannot create /sys/bus/pci/devices/0000:1c:00.0/driver/unbind: Directory nonexistent
root@pve:~# qm start 102
swtpm_setup: Not overwriting existing state file.
kvm: -device vfio-pci,host=0000:1c:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on,romfile=/usr/share/kvm/EllesmereRX580.rom: Failed to mmap 0000:1c:00.0 BAR 0. Performance may be slow

And bar 0 errors ensue. If I try to run it again, the script output is a little different:

Code:

root@pve:~# qm stop 102
root@pve:~# sh /var/lib/vz/snippets/gpu-hookscript.sh
/var/lib/vz/snippets/gpu-hookscript.sh: 4: echo: echo: I/O error

So the main problem seems to be line 4. If I start the VM again, it issues the same errors.

Any ideas what this could be?

leesteken · May 26, 2022

paulmorabi said:
Hi,

Thanks for this. I did the above but it is not working. Firstly, the script:

Code:

#!/bin/bash echo 0 | tee /sys/class/vtconsole/vtcon*/bind >/dev/null echo 'device_specific' >"/sys/bus/pci/devices/0000:1c:00.0/reset_method" sleep 1 echo "0000:1c:00.0" > "/sys/bus/pci/devices/0000:1c:00.0/driver/unbind"

When I run it from a fresh reboot it looks like this:

Code:

root@pve:~# sh /var/lib/vz/snippets/gpu-hookscript.sh /var/lib/vz/snippets/gpu-hookscript.sh: 4: echo: echo: I/O error /var/lib/vz/snippets/gpu-hookscript.sh: 6: cannot create /sys/bus/pci/devices/0000:1c:00.0/driver/unbind: Directory nonexistent root@pve:~# qm start 102 swtpm_setup: Not overwriting existing state file. kvm: -device vfio-pci,host=0000:1c:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on,romfile=/usr/share/kvm/EllesmereRX580.rom: Failed to mmap 0000:1c:00.0 BAR 0. Performance may be slow

And bar 0 errors ensue. If I try to run it again, the script output is a little different:

Code:

root@pve:~# qm stop 102 root@pve:~# sh /var/lib/vz/snippets/gpu-hookscript.sh /var/lib/vz/snippets/gpu-hookscript.sh: 4: echo: echo: I/O error

So the main problem seems to be line 4. If I start the VM again, it issues the same errors.

Any ideas what this could be?

echo 'device_specific' >"/sys/bus/pci/devices/0000:1c:00.0/reset_method" only works (and is only necessary for) kernel 5.15. If you are using 5.13, you can ignore this line and/or this error.
Your error about echo "0000:1c:00.0" > "/sys/bus/pci/devices/0000:1c:00.0/driver/unbind" indicated that amdgpu is not loaded for the GPU, which is essential for my solution. This indicates that you are not doing the same thing as I do. You can check is amdgpu is loaded for the GPU using lspci -ks 1c:00.0.

Please note that my setup fixed a different error than Failed to mmap 0000:1c:00.0 BAR 0. Performance may be slow. For me, it fixed BAR 0 cannot reserve memory (or something), which only occurs with kernel 5.15. Maybe I misunderstood your question and/or original problem, and my solution won't help you at all. Or you need to switch to the latest pve-kernel-5.15 and make sure amdgpu is used before running the script (and vendor-reset needs to be installed).

[SOLVED] GPU Passthrough Issues After Upgrade to 7.2

Well-Known Member

Member

Member

Member

Member

Distinguished Member

Member

Member

Member

Active Member

Active Member

Active Member

Member

Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

We value your privacy