Terrible PCIE passthrough performance

jake000 · May 24, 2022

Hi,

I am trying to set up a passthrough configuration to an ubuntu server for media transcoding.
I finally got the gpu passthrough working a few days back but the performance is worse than when it was running without on novnc, there is massive screen tearing with media to the point of unwatchability, I cannot use gpu acceleration and OpenCL with video programs like vlc and using xrdp is like watching paint dry when anything is required to be rendered. RADEONTOP and HTOP both show relatively low usage, RADEONTOP does spike from time to time so its def being used to some degree. The only item that doesn't move on RADEONTOP is the memory clock which is stuck at inf% (which I am assuming is just old gpu related bugs happening)
I have also tried the zink drivers to see if they made any difference which it did to a placebo degree (may show up in the glxinfo)

My specs are as follows:

Specs :
CPU: intel 5820k
GPU: Radeon HD 5750
Ram: 8GB
PVE : 7.2

VM Specs

CPU: host - 4 cores rest default
GPU: Radeon HD 5750 - romfile, x-vga and PCIE all on.
Ram: 3gb
VM : Ubuntu 21.1
Machine: Q35
BIOS: SeaBIOS
IOMMU = enabled

VFIO conf

Code:

options vfio-pci ids=1002:68be,1002:aa58 vga=none
options vfio_iommu_type1 allow_unsafe_interrupts=1

Boot options

Code:

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on radeon.drm=1 radeon.runpm=0 iommu=pt video=efifb:off video=simplefb:off nofb nomodeset kvm.ignore_msrs=1 pcie_acs_override=downstream,multifunction"

Lspci -n -s


02:00.0 0300: 1002:68be
02:00.1 0403: 1002:aa58

LSPCI

Code:

00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
00:1a.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03)
00:1a.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03)
00:1a.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03)
00:1c.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1c.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1c.2 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1c.3 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1d.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92)
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Juniper PRO [Radeon HD 5750]
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Juniper HDMI Audio [Radeon HD 5700 Series]
05:01.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
05:02.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
05:03.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
05:04.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
06:03.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon
06:05.0 SCSI storage controller: Red Hat, Inc. Virtio SCSI
06:12.0 Ethernet controller: Red Hat, Inc. Virtio network device

DMESG + GLXINFO
https://pastebin.com/herMgnyG

dcsapak · May 24, 2022

can you post your vm config? (qm config ID)
also how do you access the vms display? when you use passthrough, you have to plug in a monitor into the gpu and use that or a rdp/vnc server inside the guest (or use something like looking glass)

LnxBil · May 24, 2022

The graphics card seems to be very old (from 2009). I also tried getting an older card to work and failed after hours and hours of trying. Plugged in a newer card and it just works.

jake000 · May 25, 2022

dcsapak said:
can you post your vm config? (qm config ID)
also how do you access the vms display? when you use passthrough, you have to plug in a monitor into the gpu and use that or a rdp/vnc server inside the guest (or use something like looking glass)

Code:

boot: order=scsi0;ide2;net0
cores: 4
cpu: host,hidden=1,flags=-pcid;+aes
cpuunits: 2048
hookscript: local:snippets/gpu-hookscript.sh
hostpci0: 0000:02:00,pcie=1,romfile=vbios.bin,x-vga=1
ide2: local:iso/ubuntu-21.10-live-server-amd64.iso,media=cdrom,size=1239626K
machine: q35
memory: 3072
meta: creation-qemu=6.2.0,ctime=1652940207
name: media
net0: virtio=F2:CE:E5:EB:FF:EF,bridge=vmbr0
numa: 0
ostype: l26
scsi0: local-lvm:vm-102-disk-0,backup=0,size=400G
scsihw: virtio-scsi-pci
smbios1: uuid=f62ea551-319e-4b1e-9526-d5e2a07e1032
sockets: 1
vga: none
vmgenid: b4573f81-fe17-42e0-87a3-aff64867a119
root@Proxmox:~#

I use desktop sharing with a monitor plugged in. I have the same results though either side (and have tried using xrd standard 'add 1 display' configuration with the same results as well.) however streaming is incredibly slow with x2go or vnc or teamviewer. the monitor looks fine but the tearing and lagging is evident which is what I want to fix.

jake000 · May 25, 2022

LnxBil said:
The graphics card seems to be very old (from 2009). I also tried getting an older card to work and failed after hours and hours of trying. Plugged in a newer card and it just works.

Yeah it is fairly old, can confirm it works fine under linux natively though so surely age doesn't play that big of a part right? It takes aprox 2 minutes to render the first frame of x2go :/

dcsapak · May 25, 2022

anything in the host or guest syslog/dmesg that stands out?

LnxBil · May 25, 2022

jake000 said:
Yeah it is fairly old, can confirm it works fine under linux natively though so surely age doesn't play that big of a part right? It takes aprox 2 minutes to render the first frame of x2go :/

I also tried a gpu that works fine without passthrough, the card "worked" after many hours of tweaking inside of the VM for a few minutes and then crashed the PVE host. Replaced with a 1080ti and it just works out of the box. Nvidia Quadro's also work in my experience very well, even the little bit older generation.

jake000 · May 26, 2022

dcsapak said:
anything in the host or guest syslog/dmesg that stands out?

not alot does, to name maybe something
[ 0.254682] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:00:1c.0 (capable of 32.000 Gb/s with 2.5 GT/s PCIe x16 link)
Could be a causation, no idea how to fix that though..

Does nomodeset affect acceleration do you know?

LnxBil said:
I also tried a gpu that works fine without passthrough, the card "worked" after many hours of tweaking inside of the VM for a few minutes and then crashed the PVE host. Replaced with a 1080ti and it just works out of the box. Nvidia Quadro's also work in my experience very well, even the little bit older generation.

Surely though the only answer can't 'buy better hardware'. There is lots of people on this board running even the old GT gpu's.

leesteken · May 26, 2022

jake000 said:
[ 0.254682] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:00:1c.0 (capable of 32.000 Gb/s with 2.5 GT/s PCIe x16 link)
Could be a causation, no idea how to fix that though..

Looks like the kernel is telling you that device 01:00.0 is a PCIe x16 card in a x1 slot (connected via host/PCI bridge 00:1c.0), which reduces the maximum bandwidth by 93.75%. If you have another empty x16 slot, you might want to move that card (but beware that the PCI IDs can change because of that). I see similar messages on my system because I have two x16 GPUs in x8 slots.

I'm confused because here you show that 01:00.0 is the HD 5750 (under LSPCI), but you shown the numeric device IDs from 02:00.0 and the VM configuration here also uses 02:00.0.
I think you moved the HD 5750 from a x16 slot to a x1 slot (which might look like a x16 but with less actual pins). That might explain why it is slower than expected.

jake000 · May 27, 2022

leesteken said:
Looks like the kernel is telling you that device 01:00.0 is a PCIe x16 card in a x1 slot (connected via host/PCI bridge 00:1c.0), which reduces the maximum bandwidth by 93.75%. If you have another empty x16 slot, you might want to move that card (but beware that the PCI IDs can change because of that). I see similar messages on my system because I have two x16 GPUs in x8 slots.

I'm confused because here you show that 01:00.0 is the HD 5750 (under LSPCI), but you shown the numeric device IDs from 02:00.0 and the VM configuration here also uses 02:00.0.
I think you moved the HD 5750 from a x16 slot to a x1 slot (which might look like a x16 but with less actual pins). That might explain why it is slower than expected.

Sorry for the confusion, that message was from the VM itself curiously enough, is the vm meant to report back with the same HWID?
Also even more curiously its in a 16x slot.
edit: I just tried to remove most of my startup parameters incl nomodeset yet I am still having issues.

leesteken · May 27, 2022

jake000 said:
Sorry for the confusion, that message was from the VM itself curiously enough, is the vm meant to report back with the same HWID?

Okay, that clears up my confusion. The PCI(e) ID depends on the place of the card in the (real or virtual) PCI(e) layout. And it will usually be different, so that's fine.
Can you please confirm that you got the bandwidth message on the host? I do still think it is the cause of the performance issue (especially single the card is PCIe gen1).

jake000 said:
Also even more curiously its in a 16x slot.

Is it electrically a x16 slot (are there metal pins from beginning (backend of the case) to the end)? Maybe there is a bottleneck internally (PCI bridge) on the motherboard? Can you tell the brand and model of the motherboard or provide a link to the motherboard manual?

jake000 said:
edit: I just tried to remove most of my startup parameters incl nomodeset yet I am still having issues.

I don't think any of those would reduce performance unless the card is thermal throtteling because you disabled power management.

jake000 · Jun 7, 2022

leesteken said:
Okay, that clears up my confusion. The PCI(e) ID depends on the place of the card in the (real or virtual) PCI(e) layout. And it will usually be different, so that's fine.
Can you please confirm that you got the bandwidth message on the host? I do still think it is the cause of the performance issue (especially single the card is PCIe gen1).

Is it electrically a x16 slot (are there metal pins from beginning (backend of the case) to the end)? Maybe there is a bottleneck internally (PCI bridge) on the motherboard? Can you tell the brand and model of the motherboard or provide a link to the motherboard manual?

I don't think any of those would reduce performance unless the card is thermal throtteling because you disabled power management.

Sorry for the late reply, was occupied with a holiday.

yes, I get the same message on the host curiously.
I tried another slot and the vm blackscreens with no message, If I boot with a fake vga connected for novnc it will not go past tty0 (yeah I did change the hw id's too to correspond with the changes). I can switch to tty1 but it blackscreens again on startx (have never been able to get it to work with a fake display connected anyway, I get BAR0 CAFEDEAD in the dmsg with a fake display in either case but this doesnt worry me since I am using a physical display)

I checked the slot and it seems metal all the way through. Its a motherboard of a HP z440 with an x99 chipset.
here is the slot configuration, the above logs are from slot2, it is now in slot 5 with the issues presented above https://avid.secure.force.com/pkb/a.../HP-Z440-Workstation-Slot-Order-Configuration
and here is a manual database https://support.hp.com/au-en/product/hp-z440-workstation/6978828/manuals

The card doesnt feel hot, I tried another distribution (manjaro) to test it out and removed the power management stuff but yeah the same issue with above in both cases.

leesteken · Jun 7, 2022

jake000 said:
yes, I get the same message on the host curiously.

I guess both kernels notice a bottleneck in the PCIe layouy.

jake000 said:
I tried another slot and the vm blackscreens with no message, If I boot with a fake vga connected for novnc it will not go past tty0 (yeah I did change the hw id's too to correspond with the changes). I can switch to tty1 but it blackscreens again on startx (have never been able to get it to work with a fake display connected anyway, I get BAR0 CAFEDEAD in the dmsg with a fake display in either case but this doesnt worry me since I am using a physical display)

I expect the PCI ID of the GPU to change when you move it to another slot. It can even change the PCI IDs of other PCI(e) devices (and the network device names) when you move PCI(e) device around. Did you account for those possible changes and recheck the IOMMU groups?

jake000 said:
I checked the slot and it seems metal all the way through. Its a motherboard of a HP z440 with an x99 chipset.
here is the slot configuration, the above logs are from slot2, it is now in slot 5 with the issues presented above https://avid.secure.force.com/pkb/a.../HP-Z440-Workstation-Slot-Order-Configuration
and here is a manual database https://support.hp.com/au-en/product/hp-z440-workstation/6978828/manuals https://support.hp.com/au-en/product/hp-z440-workstation/6978828/manuals

I cannot download or read those manuals, sorry. Maybe the PCIe slot is shared with other (in use) devices and therefore not fully x16. Maybe it's a BIOS setting. I have no experience with HP (or servers in general) but they tend to do thing slightly differently as there are several IOMMU issues with HP on this forum.

jake000 said:
The card doesnt feel hot, I tried another distribution (manjaro) to test it out and removed the power management stuff but yeah the same issue with above in both cases.

I'm convinced that its a bottleneck in the PCIe layout, not thermal throtteling.

Search

Search

Terrible PCIE passthrough performance

jake000

New Member

dcsapak

Proxmox Staff Member

LnxBil

Distinguished Member

jake000

New Member

jake000

New Member

dcsapak

Proxmox Staff Member

LnxBil

Distinguished Member

jake000

New Member

leesteken

Distinguished Member

jake000

New Member

leesteken

Distinguished Member

jake000

New Member

leesteken

Distinguished Member