GPU Passthrough only working on First Card per CPU

Discussion in 'Proxmox VE: Installation and configuration' started by JulianRendell, Jul 9, 2017.

  1. JulianRendell

    JulianRendell New Member

    Joined:
    Jul 9, 2017
    Messages:
    1
    Likes Received:
    0
    Hi-

    I'm evaluating ProxMox and a few other virtualization tools to create a virtualized class room computer.

    The goal is to have one server machine with multiple GPUs, and each GPU being the heart of a student's "workstation". I need to use Windows (for Lego and other educational products and software.)

    I have purchased a asus z10pe-d16 ws with Dual Xeon E5 v4 procs, and 4x Asus R7 240 GPUs.

    The block diagram for that mother board indicates that each CPU is connected to 2x16 lane PCIe sockets and 1x8 lane socket.

    I've installed the GPUs into the x16 sockets of each CPU. Everything appears to be in it's own IOMMU group with out the need for any hacks.

    Installation went smoothly, and setting up the VMs was super easy (other than some weird UI bugs in Chrome; I seem to have to reload quite often to be able to see everything.)

    I initially tried OVMF + EFI + PCIe passthrough, but that was quite unstable; barely getting through the AMD driver installation. I mistakenly thought you should match host and VM architecture.

    I then tried OVMF + EFI + PCI passthrough, which appeared to work great. Seeing 4 VMs all running webgl demos is awesome! But two of the four windows VMs would crash after ~20 minutes. I found some PCIe transmit errors in dmesg, and they appeared to be related to ASPM. Disabled that in the kernel and BIOS, and I haven't seen any more.

    Running multiple instances of 3DMark was pretty cool and very promising!

    But I'm still seeing GPU/VM lockups (monitor powers down, once got visual garbage, once managed to recover using ctrl-alt-delete, sometimes caps-lock still works to toggle the LED on the keyboard, sometimes the VM seems to be totally locked up.) There's nothing in dmesg or syslog. Windows doesn't appear to record any issue other than me forcibly shutting down the VM.

    The pattern I think I've found is that the second VM to started that is passed a GPU on the same processor is the one that will crash. ie I can have two stable VMs if they're passed GPUs connected to different procs. If they're passed cards connected to the same proc, the second one started is unstable.

    The VMs are set up with CPU type as Host.
    I enabled the numa option. But I found that CPU configuration to be confusing- sockets/cores/VCPUs; I think I ended up setting 2 sockets, 4 cores (8 total) and 8 vCPUs .... to get windows to show 4 (virtual) cores.

    I did install the balloon driver in windows- but the VM is set for fixed memory allocation.

    Should I enable the NUMA option? Should I (and how) do I pin the VM to the socket connected to the GPU?

    Any other suggestions for what to try?

    Is there a way to pay for a single support issue? Or is the only way to sign up for a month and then cancel if there is no solution?

    Thanks in advance,

    Julian
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice