[SOLVED] Second GPU PASSTHROUGH not working

teokoul

New Member
Mar 23, 2021
21
3
3
30
Hello there,

I would like some help to passthrough a Second GPU to my system.
I have successfully passed trough a ( GTX 1050 Ti ) GPU 6 month now.
Everything working very well.

------------------------------------------------------------

I decide to install an another GPU on my system. So I bought one GEFORCE GT 1030.
The plan is that I would like to use two VMs.

VM No.1 runs with GTX 1050 Ti.
VM No.2 runs with GT 1030.

The situation right now :

I was able to run a VM with GTX 1050 Ti. Everything working well. No Error 43. No crashes.

I made a new VM to passthrough the GT 1030. After I follow this tutorial and the Wiki I try to add PCI to my VM. When I press the START button the GUI stuck with "Loading" and then it write "Connection Error".

If I ping the host I have a "Request Timed out".

It need to close the system from the power to open it again.

------------------------------------------------------------

THINGS I TRIED

1. Remove the old GTX 1050 Ti and left only the GT 1030. ( The GPU working with BIOS and etc. When I try to start the machine I have a crash of the system.
2. Change some settings like pci=1 and vga_on = 1.

------------------------------------------------------------

My Build is :
MOBO : GA-AB350-GAMING-3
CPU: AMD RYZEN 7 1800X
GPU1: GTX 1050 Ti
GPU2: GT 1030
RAM : 64GB
NVME : SAMSUNG 860 EVO (FOR PROXMOX)
SSD : SAMSUNG 870 EVO (FOR VM)
HDD: 2 x 4TB TOSHIBA (FOR FREENAS STORAGE ALREADY SETUP)

------------------------------------------------------------

VM with working GPU PASSTHROUGH ( GTX 1050 Ti ).

Code:
bios: ovmf
boot: order=ide0;ide2
cores: 4
efidisk0: local-lvm:vm-100-disk-1,size=4M
hostpci0: 09:00,pcie=1
ide0: local-lvm:vm-100-disk-0,size=150G
ide2: local:iso/virtio-win-0.1.190.iso,media=cdrom,size=489986K
machine: q35
memory: 25000
name: WindowsGPU
net0: virtio=26:4C:BF:D2:C6:53,bridge=vmbr0
numa: 0
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=1626a95e-08d1-4e39-a71b-685101e4fd98
sockets: 4
vmgenid: 9e737340-79bf-4273-a44d-1788345e1322

------------------------------

VM with no working GPU PASSTHROUGH ( GT 1030 ).

Code:
bios: ovmf
boot: order=ide0;ide2;net0
cores: 1
efidisk0: SSD_VMs:vm-104-disk-1,size=4M
hostpci0: 07:00,pcie=1
ide0: SSD_VMs:vm-104-disk-0,size=100G
ide2: local:iso/virtio-win-0.1.190.iso,media=cdrom,size=489986K
machine: q35
memory: 4000
name: test
net0: e1000=32:64:F3:95:2B:7D,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=5ef33a30-2822-4522-9a3f-9ae6f3c2f644
sockets: 1
vmgenid: b9c6bc29-035b-437c-88cf-29a12aec3eec

------------------------------

The ONLY line that works to passthrough the GTX 1050 Ti.
I am afraid to change the GRUB.
I was trying weeks to achieve the first passthrough.

GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on textonly video=astdrmfb video=efifb:off"

**I'm not experienced with proxmox.

------------------------------

The nano /etc/modules

Code:
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

------------------------------

The lspci -v for GPUs

Code:
07:00.0 VGA compatible controller: NVIDIA Corporation GP108 (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Gigabyte Technology Co., Ltd GP108 [GeForce GT 1030]
        Flags: fast devsel, IRQ 11
        Memory at f4000000 (32-bit, non-prefetchable) [disabled] [size=16M]
        Memory at c0000000 (64-bit, prefetchable) [disabled] [size=256M]
        Memory at d0000000 (64-bit, prefetchable) [disabled] [size=32M]
        I/O ports at d000 [disabled] [size=128]
        Expansion ROM at f5000000 [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] #19
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau

07:00.1 Audio device: NVIDIA Corporation GP108 High Definition Audio Controller (rev a1)
        Subsystem: Gigabyte Technology Co., Ltd GP108 High Definition Audio Controller
        Flags: bus master, fast devsel, latency 0, IRQ 10
        Memory at f5080000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

09:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. GP107 [GeForce GTX 1050 Ti]
        Flags: bus master, fast devsel, latency 0, IRQ 4
        Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
        Memory at e0000000 (64-bit, prefetchable) [size=256M]
        Memory at f0000000 (64-bit, prefetchable) [size=32M]
        I/O ports at f000 [size=128]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] #19
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau

09:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)
        Subsystem: ASUSTeK Computer Inc. GP107GL High Definition Audio Controller
        Flags: bus master, fast devsel, latency 0, IRQ 11
        Memory at f7080000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

------------------------------

The lspci -n -s 07:00

Code:
07:00.0 0300: 10de:1d01 (rev a1)
07:00.1 0403: 10de:0fb8 (rev a1)

The lspci -n -s 09:00

Code:
09:00.0 0300: 10de:1c82 (rev a1)
09:00.1 0403: 10de:0fb9 (rev a1)

------------------------------

The nano /etc/modprobe.d/vfio.conf

Code:
options vfio-pci ids=10de:1c82,10de:0fb9,10de:1d01,10de:0fb8 disable_vga=1

------------------------------

find /sys/kernel/iommu_groups/ -type l

Code:
/sys/kernel/iommu_groups/7/devices/0000:00:18.3
/sys/kernel/iommu_groups/7/devices/0000:00:18.1
/sys/kernel/iommu_groups/7/devices/0000:00:18.6
/sys/kernel/iommu_groups/7/devices/0000:00:18.4
/sys/kernel/iommu_groups/7/devices/0000:00:18.2
/sys/kernel/iommu_groups/7/devices/0000:00:18.0
/sys/kernel/iommu_groups/7/devices/0000:00:18.7
/sys/kernel/iommu_groups/7/devices/0000:00:18.5
/sys/kernel/iommu_groups/5/devices/0000:00:08.0
/sys/kernel/iommu_groups/5/devices/0000:12:00.2
/sys/kernel/iommu_groups/5/devices/0000:12:00.0
/sys/kernel/iommu_groups/5/devices/0000:00:08.1
/sys/kernel/iommu_groups/5/devices/0000:12:00.3
/sys/kernel/iommu_groups/3/devices/0000:00:04.0
/sys/kernel/iommu_groups/1/devices/0000:00:02.0
/sys/kernel/iommu_groups/6/devices/0000:00:14.3
/sys/kernel/iommu_groups/6/devices/0000:00:14.0
/sys/kernel/iommu_groups/4/devices/0000:00:07.0
/sys/kernel/iommu_groups/4/devices/0000:11:00.2
/sys/kernel/iommu_groups/4/devices/0000:11:00.0
/sys/kernel/iommu_groups/4/devices/0000:00:07.1
/sys/kernel/iommu_groups/4/devices/0000:11:00.3
/sys/kernel/iommu_groups/2/devices/0000:00:03.1
/sys/kernel/iommu_groups/2/devices/0000:09:00.0
/sys/kernel/iommu_groups/2/devices/0000:09:00.1
/sys/kernel/iommu_groups/2/devices/0000:00:03.0
/sys/kernel/iommu_groups/0/devices/0000:03:00.0
/sys/kernel/iommu_groups/0/devices/0000:00:01.0
/sys/kernel/iommu_groups/0/devices/0000:01:00.0
/sys/kernel/iommu_groups/0/devices/0000:07:00.0
/sys/kernel/iommu_groups/0/devices/0000:03:00.1
/sys/kernel/iommu_groups/0/devices/0000:00:01.3
/sys/kernel/iommu_groups/0/devices/0000:04:01.0
/sys/kernel/iommu_groups/0/devices/0000:00:01.1
/sys/kernel/iommu_groups/0/devices/0000:04:04.0
/sys/kernel/iommu_groups/0/devices/0000:05:00.0
/sys/kernel/iommu_groups/0/devices/0000:07:00.1
/sys/kernel/iommu_groups/0/devices/0000:04:00.0
/sys/kernel/iommu_groups/0/devices/0000:03:00.2

------------------------------

EDIT :

And this is the lines from syslog the moment of the crash :

Code:
Sep  7 15:40:00 pve systemd[1]: Starting Proxmox VE replication runner...
Sep  7 15:40:00 pve systemd[1]: pvesr.service: Succeeded.
Sep  7 15:40:00 pve systemd[1]: Started Proxmox VE replication runner.
Sep  7 15:40:08 pve rrdcached[1195]: handle_request_update: Could not read RRD file.
Sep  7 15:40:08 pve pmxcfs[1199]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/104: -1
Sep  7 15:40:08 pve pmxcfs[1199]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-vm/104: mmaping file '/var/lib/rrdcac$
Sep  7 15:40:18 pve rrdcached[1195]: handle_request_update: Could not read RRD file.
Sep  7 15:40:18 pve pmxcfs[1199]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/104: -1
Sep  7 15:40:18 pve pmxcfs[1199]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-vm/104: mmaping file '/var/lib/rrdcac$
Sep  7 15:40:28 pve rrdcached[1195]: handle_request_update: Could not read RRD file.
Sep  7 15:40:28 pve pmxcfs[1199]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/104: -1
Sep  7 15:40:28 pve pmxcfs[1199]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-vm/104: mmaping file '/var/lib/rrdcac$
Sep  7 15:40:38 pve rrdcached[1195]: handle_request_update: Could not read RRD file.
Sep  7 15:40:38 pve pmxcfs[1199]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/104: -1
Sep  7 15:40:38 pve pmxcfs[1199]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-vm/104: mmaping file '/var/lib/rrdcac$
Sep  7 15:40:42 pve pvedaemon[1245]: <root@pam> end task UPID:pve:00000624:0000273D:61375B12:vncshell::root@pam: OK
Sep  7 15:40:42 pve systemd[1]: session-1.scope: Succeeded.
Sep  7 15:40:48 pve rrdcached[1195]: handle_request_update: Could not read RRD file.
Sep  7 15:40:48 pve pmxcfs[1199]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/104: -1
Sep  7 15:40:48 pve pmxcfs[1199]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-vm/104: mmaping file '/var/lib/rrdcac$
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@$

After that it started again.

------------------------------

Thank you very much for your time.
TeoKoul

------------------------------
 
Last edited:
You second GPU is in an IOMMU group with other devices. Devices in the same group cannot be shared between VMs or between a VM and the host. As soon as you start the VM the Proxmox host loses the other devices in the group, which contain a USB, SATA and network controller. Because Proxmox can no longer access its drives you get the Could not read RRD file messages (and a system freeze) and you get the loading animations because the network connection is also gone.
 
You second GPU is in an IOMMU group with other devices. Devices in the same group cannot be shared between VMs or between a VM and the host. As soon as you start the VM the Proxmox host loses the other devices in the group, which contain a USB, SATA and network controller. Because Proxmox can no longer access its drives you get the Could not read RRD file messages (and a system freeze) and you get the loading animations because the network connection is also gone.
Thank you very much for your reply.

I understand...

So, I should found a way to separate the IOMMU groups.

In the Proxmox wiki I read :

To have separate IOMMU groups, your processor needs to have support for a feature called ACS (Access Control Services). Make sure you enable the corresponding setting in your BIOS for this.

I think there is no feature called ACS in my BIOS. I will check this tommorow.

If ACS do not exist what is the best choice?

In this forum I read that a BIOS update maybe make better IOMMU groups.

Also, I read :

adding "pcie_acs_override=downstream" to kernel boot commandline (grub or systemd-boot) options, which can help on some setup with bad ACS implementation.

What about this?

------------------------------

Thank you for your time!
 
The IOMMU groups are determined by the motherboard hardware and BIOS. Only those know which devices are actually and securely isolated from each other. You cannot really change that, except for trying other PCIe slots (which won't help on your motherboard). The ACS is not really broken, just limited (as are the number of PCIe lanes available on Ryzen).
Some BIOS versions have better groups, sometimes older versions work better (with older CPUs), some versions just break passthrough completely. The X570(non-S) has much better groups and even X470 isolates the two x16 slots, but the B350 and B450 are much more limited.

You can tell the Linux kernel to "ignore" the groups by using the pcie_acs_override=downstream kernel parameter. Proxmox has this (unofficial) patch already built in. This splits the groups further, but there is no guarantee that it actually works with your hardware. Please note that it allows you to pass devices to different VMs (and the host) that can transfer data to one another and are not properly isolated. This is a security risk if you run untrusted software or allow untrusted users on those VMs. Feel free to try it and let us know if it works or not.
 
The IOMMU groups are determined by the motherboard hardware and BIOS. Only those know which devices are actually and securely isolated from each other. You cannot really change that, except for trying other PCIe slots (which won't help on your motherboard). The ACS is not really broken, just limited (as are the number of PCIe lanes available on Ryzen).
Some BIOS versions have better groups, sometimes older versions work better (with older CPUs), some versions just break passthrough completely. The X570(non-S) has much better groups and even X470 isolates the two x16 slots, but the B350 and B450 are much more limited.

You can tell the Linux kernel to "ignore" the groups by using the pcie_acs_override=downstream kernel parameter. Proxmox has this (unofficial) patch already built in. This splits the groups further, but there is no guarantee that it actually works with your hardware. Please note that it allows you to pass devices to different VMs (and the host) that can transfer data to one another and are not properly isolated. This is a security risk if you run untrusted software or allow untrusted users on those VMs. Feel free to try it and let us know if it works or not.

Thank you very much for all information!
I solve the problem.

Now I am able to passthrough the 2 GPUs to VMs.

The only thing I did is to add pcie_acs_override=downstream,multifunction to the GRUB.

Then update-grub and... voila!

Is there something to be careful with?
I am asking because I am reading over the internet that this patch is a little bit tricky.

MARKED AS SOLVED

------------------------------

Thank you for your time!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!