[SOLVED] iGD/GPU Passthrough with Intel - GPU HANG: ecode, hang on vecs0

Lutris

Member
Apr 17, 2020
9
1
23
Proxmox VE 6.2-4, SMP PVE 5.4.41-1
I've passed through my Intel HD Graphics 630 to a vm running Ubuntu 18.04.2 (5.3.0-51-generic)
I use the quicksync feature on the iGD for plex transcoding. At first it seemed to be working as it should, but a friend mentioned he was having some trouble with the streams stopping and glitching a couple of times.
Then I tested some myself and put it under stress and quickly realized that as soon as there is a bit of load it buckles.
I was monitoring dmesg while I was testing and I keep getting GPU HANG and recovery timed out. When that happens everything stops, even froze the VM at some point.

Code:
[mai19 17:35] i915 0000:00:10.0: GPU HANG: ecode 9:0:0x00000000, hang on vecs0
[  +0,001011] i915 0000:00:10.0: Resetting vecs0 for hang on vecs0
[  +7,981283] i915 0000:00:10.0: Resetting vecs0 for hang on vecs0
[  +1,986848] i915 0000:00:10.0: GPU recovery timed out, cancelling all in-flight rendering.

I mentioned this over at the Plex-forum and they said the same thing as my googling have resulted in, that the kernel driver for i915 is having trouble.
There seems to be lots of threads regarding this problem if I google GPU HANG: ecode 9:0:0x00000000, hang on vecs0, some mentioning that it started somewhere after kernel 5.3 https://bbs.archlinux.org/viewtopic.php?id=250765

Anyone know if there is something I can do about this? Ive tried using a couple of different i915 module options recommended, but none seem to fix it.

What sort of logs would be needed to to some troubleshooting on this?

Edit: Here is some info about the VM as well
Code:
root@prox:~# cat /etc/pve/qemu-server/103.conf
agent: 1
bios: seabios
bootdisk: scsi0
cores: 4
hostpci0: 00:02
memory: 9032
name: tools
net0: virtio=A6:DF:E3:89:A1:F4,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: data:vm-103-disk-0,discard=on,size=62G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=6187827f-2b2c-4c32-b07c-bbc41aa8b6ea
sockets: 1
vmgenid: d464ca53-e753-4886-94db-90b0a86bcc1c
 
Last edited:
I seem to have solved it. Passing it through as a mediated device (gvt-g) instead of assigning the whole GPU seems to work way better.
It lead to some other problems with DRM and GPU getting a wedged error in the log. This seems to have been fixed in kernel 5.5, so I updated the VM to that kernel and have now been running some tests with HW transcoding in Plex. 30 minutes into testing and I have yet to see any sign of the previous problem. Fingers crossed
 
Last edited:
What modules and grub cmdline settings are you using? I have gvt-g working, but I get random gvt page fault errors under load with Windows VMs that take down the entire host and all the gvt VMs running on it. Is it that much more stable with Linux VMs?
 
So far I haven't had any major issues with it and I have a lot of family members using the transcode future (quicksync) with plex on a daily basis.

I'm currently only running the drm.debug module. That's the only problem I'm currently having with the gpu, but it hasn't caused any instability yet.

grub on the host
Code:
root@prox:~# cat /etc/default/grub
......
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on i915.enable_gvt=1 drm.debug=0"
GRUB_CMDLINE_LINUX="net.ifnames=0 biosdevname=0"
......
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!