GPU Passthrough only working on First Card per CPU

JulianRendell

New Member
Jul 9, 2017
1
0
1
50
Hi-

I'm evaluating ProxMox and a few other virtualization tools to create a virtualized class room computer.

The goal is to have one server machine with multiple GPUs, and each GPU being the heart of a student's "workstation". I need to use Windows (for Lego and other educational products and software.)

I have purchased a asus z10pe-d16 ws with Dual Xeon E5 v4 procs, and 4x Asus R7 240 GPUs.

The block diagram for that mother board indicates that each CPU is connected to 2x16 lane PCIe sockets and 1x8 lane socket.

I've installed the GPUs into the x16 sockets of each CPU. Everything appears to be in it's own IOMMU group with out the need for any hacks.

Installation went smoothly, and setting up the VMs was super easy (other than some weird UI bugs in Chrome; I seem to have to reload quite often to be able to see everything.)

I initially tried OVMF + EFI + PCIe passthrough, but that was quite unstable; barely getting through the AMD driver installation. I mistakenly thought you should match host and VM architecture.

I then tried OVMF + EFI + PCI passthrough, which appeared to work great. Seeing 4 VMs all running webgl demos is awesome! But two of the four windows VMs would crash after ~20 minutes. I found some PCIe transmit errors in dmesg, and they appeared to be related to ASPM. Disabled that in the kernel and BIOS, and I haven't seen any more.

Running multiple instances of 3DMark was pretty cool and very promising!

But I'm still seeing GPU/VM lockups (monitor powers down, once got visual garbage, once managed to recover using ctrl-alt-delete, sometimes caps-lock still works to toggle the LED on the keyboard, sometimes the VM seems to be totally locked up.) There's nothing in dmesg or syslog. Windows doesn't appear to record any issue other than me forcibly shutting down the VM.

The pattern I think I've found is that the second VM to started that is passed a GPU on the same processor is the one that will crash. ie I can have two stable VMs if they're passed GPUs connected to different procs. If they're passed cards connected to the same proc, the second one started is unstable.

The VMs are set up with CPU type as Host.
I enabled the numa option. But I found that CPU configuration to be confusing- sockets/cores/VCPUs; I think I ended up setting 2 sockets, 4 cores (8 total) and 8 vCPUs .... to get windows to show 4 (virtual) cores.

I did install the balloon driver in windows- but the VM is set for fixed memory allocation.

Should I enable the NUMA option? Should I (and how) do I pin the VM to the socket connected to the GPU?

Any other suggestions for what to try?

Is there a way to pay for a single support issue? Or is the only way to sign up for a month and then cancel if there is no solution?

Thanks in advance,

Julian
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!