Vega Frontier passthrough help

Slidspitfire

Member
Mar 30, 2020
6
0
6
32
Dear all,

First of all, since this is my first post on this awesome forum, let me thank the whole community for pushing so hard on a product like Proxmox, in my opinion the real state of the art of hypervisors and the only solution which adheres to Linuxproviding plenty of options and customizability.

I am writing you to ask for your help in setting up my system.
I am currently having some problems in setting un the GPUs passthrough.
The GPUs I am trying to pass are two AMD Vega Frontier Edition which successfully got passed through under other hypervisors (despite suffering from reset bug).

Ideally I'd like to pass both GPUs to either a Windows 10 VM, or to a CentOS 7 one.
I can accept passing each of them to a different machine tho.

I had this setup lying around for quite some time, but I am Italian and I am at working at home right now for the Covid-19 emergency.
For this reason, I got some spare time (my usual 3 hours commute each day...) to try finalizing this setup and I'd like to get operational ASAP to take part to the Folding@Home Covid-19 research.

At the moment I have (I think) successfully isolated the GPUs' IOMMU groups and managed to get them handled by the vfio driver.
My settings follow.

lspci GPUs and integrated audio controller:
Bash:
root@pve:~# lspci -nnk -d 1002:6863
03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XTX [Radeon Vega Frontier Edition] [1002:6863]
        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XTX [Radeon Vega Frontier Edition] [1002:6b76]
        Kernel driver in use: vfio-pci
        Kernel modules: amdgpu
06:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XTX [Radeon Vega Frontier Edition] [1002:6863]
        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XTX [Radeon Vega Frontier Edition] [1002:6b76]
        Kernel driver in use: vfio-pci
        Kernel modules: amdgpu

root@pve:~# lspci -nnk -d 1002:aaf8
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel
06:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

vfio.conf:
Bash:
root@pve:~# cat /etc/modprobe.d/vfio.conf
softdep radeon pre: vfio-pci
softdep amdgpu pre: vfio-pci
softdep nouveau pre: vfio-pci
softdep drm pre: vfio-pci
options vfio-pci ids=1002:6863,1002:aaf8 disable_vga=1

grub config:
Bash:
root@pve:~# cat /etc/default/grub
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="Proxmox Virtual Environment"
GRUB_CMDLINE_LINUX=""
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"

# Disable os-prober, it might add menu entries for each guest
GRUB_DISABLE_OS_PROBER=true


# Disable generation of recovery mode menu entries
GRUB_DISABLE_RECOVERY="true"

blacklist.conf (same for pve-blacklist.conf):
Bash:
root@pve:~# cat /etc/modprobe.d/blacklist.conf
# This file contains a list of modules which are not supported by Proxmox VE

# nidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb
blacklist amdgpu
blacklist radeon
blacklist nouveau

After applying such settings I both recreated the boot Ramdisk and updated the GRUB configuration.

I managed to make the windows machine boot with a SPICE VGA and the Vega FE, install the AMD drivers and boot back using the Vega GPU, but it hangs loading the Windows UI. The QEMU settings I use are:

Code:
bios: ovmf
bootdisk: sata0
cores: 16
efidisk0: local-lvm:vm-101-disk-0,size=128K
hostpci0: 03:00.0,pcie=1,x-vga=1
hostpci1: 03:00.1,pcie=1
ide2: freenas-isos:iso/Win-10.iso,media=cdrom
machine: q35
memory: 27000
name: WinFolding
net0: e1000=5A:D0:1A:31:6B:8A,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
sata0: freenas-isos:101/vm-101-disk-0.qcow2,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=8d50dab8-7f06-40aa-8d80-3f94791bd495
sockets: 1
vga: none
vmgenid: e7b11fea-e225-467a-97c3-55d20651c843

The system is a dual Xeon E5 CPU system based on an Asus Z10PE-D16 WS. The GPUs are placed each on a PCIe slot linked directly to one of the two CPUs, but the same issue happens with both GPUs making me doubt that this is related to PCIe splitting.

Is there anybody capable fo helping me troubleshooting this weird behavior?
I am a quite skilled sysadmin, so have no fear on being "too technical" in you answers. :D

Thank you in advance,

Slid
 
Last edited:
Hello again,

I have a little update.
I just tried the following options and the VM seems to work much better, booting with full display (AMD drivers failing, but lets solve one problem at a time):
  • In GRUB configuration GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts=1 video=efifb:eek:ff pcie_acs_override=downstream"
  • In /etc/modprobe.d/iommu_unsafe_interrupts.conf "options vfio_iommu_type1 allow_unsafe_interrupts=1"
  • NUMA enabled in CPU configuration (this seems to be the real deal, the ONE option missing in my multi CPU architecture)
I'll keep you posted on further success.

Best

Slid
 
mhmm nothing too obvious from the posted info besides:
Code:
hostpci0: 03:00.0,pcie=1,x-vga=1
hostpci1: 03:00.1,pcie=1

i would write it as:

Code:
hostpci0: 03:00,pcie=1,x-vga=1

this way it gets passed through as a single device with multiple functions (like the real hardware) instead of 2 pcie devices

also the logs (host/guest) would be interesting
 
Hi!

Thanks for the heads up.
Gonna try the full root passthrough immeditely.
With he additional settings I was able to get a bit further in the boot, but the radeon drivers somehow broke.
Now I am working on a newsetup for good measure.

I'll keep you posted,

Slid
 
Hello,

Finally I managed to get the machine working and booting with the GPU recognized.
As far as I understand, the issue was caused by the NUMA nature of my system: I was passing around 27 gigs of ram to the VM, while passing only 4 cores to the VM.
This means that the RAM was necessarily spacing between both CPUs.
This problem caused the faulty boot of the VM, since on my CPU architecture the PCIe and RAM controllers are deeply correlated.
At the moment I am running the following configuration:

Bash:
bios: ovmf
bootdisk: sata0
cores: 16
efidisk0: local-lvm:vm-103-disk-0,size=4M
hostpci0: 03:00.0,pcie=1,x-vga=1
hostpci1: 03:00.1,pcie=1
hostpci2: 06:00.0,pcie=1
hostpci3: 06:00.1,pcie=1
ide2: freenas-isos:iso/Win-10.iso,media=cdrom
machine: q35
memory: 16196
name: Windows10
net0: e1000=0E:08:05:CC:6C:11,bridge=vmbr0,firewall=1
numa: 1
ostype: win10
sata0: freenas-isos:101/base-101-disk-0.qcow2/103/vm-103-disk-0.qcow2,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=e2541731-21e7-4fbb-8b3e-e28a69499f40
sockets: 1
vga: none
vmgenid: 087364b6-0fbb-44c3-8434-ddcc726c12b7

so I am passing 16 cores and around 16GB of ram.
As you might notice, I am still passing separately the GPU and the audio device, but I have to admit that the audio device is not identified correctly.
I will try with the passing of all function tomorrow, and I'd also like to create a virtual PCIe root to which I should attach the GPUs to enhance the performances, as found, for example, here.

Now the bad news.
I just started Folding, and everything worked for a little while.
The GPU power states changed rapidly and displayed correctly via the backplate LEDS the Vega Frontier are equipped with.
However, after a while, the GPU fans started spinning at 100% and everything seemed completely stuckat least inside the VM.
I also noticed that the power control leds went from blue to orange right before the GPU failure.
This same issue happened before also under ESXi, and I still don't understand what's the issue.
This behavior was observed in generic tasks drawing tons of power from the GPUs.

Do you have any tips for this (or should I open a different thread for this specific thing?).

Thanks!

Slid
 
and I'd also like to create a virtual PCIe root to which I should attach the GPUs to enhance the performances, as found, for example, here.
we do that by default with q35

This behavior was observed in generic tasks drawing tons of power from the GPUs.

Do you have any tips for this (or should I open a different thread for this specific thing?).
maybe your psu is either undersized or faulty?
 
Hi,

Ok for the pcie root,I'll just try to pass all the functions then.
The PSU is a 2kW Leadex Platinum, I don't think it is underpowered neither faulty, since it has been barely used.
I think here the best guilty might be a driver failure: tonight I had the same lock even while not using the GPUs.
Having the audio device non recognized correctly might be a real hassle for the drivers trying to address the full GPU.

If it keeps failing I'm gonna install my two Quadro P4000 instead and move the Vega GPU to another machine.

Thanks again!

Slid
 
I just tried with the whole PCIe bus and still I get random blocks of the guest and the host.
It kept running for > 1h, then locked and then rebooted and locked 10 minutes later.
I have disabled ULPS (Ultra Low Power State) from the Windows registry just for good measure, see here.

I am lacking ideas, since I even monitored the GPU temps without observing any problem (less than 80 degrees celsius).

I am afraid the P4000 I have in the other system will come soon as a replacement of these two AMD GPUs...

That's disappointing!

By the way, what logs would you like to see? I can provide them at next lock...

Cheers,

Slid
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!