GPU passthrough: Nvidia errors Xid 44, 38 (was: keep virtual VGA)?

Telencephalon · Dec 29, 2018

Hi! I have a server with an NVIDIA Titan V GPU. I've been successful in setting up GPU passthrough by following the recipe in the PVE wiki: I can see and use the GPU in my VM. I'd like to use this GPU for math computations only, there is no monitor attached. I would also like to allow users to log in using xrdp. All of this currently works. Now here's my problem: As soon as I enable the passed-through GPU in the VM config, the VM doesn't see the standard, virtual VGA card anymore that is specified in the VM config file. It seems that the virtual VGA card gets disabled as soon as the x-vga=on option is given. This has the effect that (1) the NoVNC console in the Proxmox UI can't connect anymore, and (2) that the xrdp X server uses my math GPU.

So, is there a way to keep the virtual VGA card enabled while at the same time passing through a physical GPU? I've tried setting x-vga=off: In that case, the virtual VGA reappears in the VM PCIe bus, but I can't use the physical GPU anymore (it is still on the PCIe bus in the VM, but I can't run CUDA test programs anymore. Nvidia-smi hangs forever. So the nvidia driver is unhappy when x-vga=off.) -- Thanks for your help.

Relevant lines from the VM config:
args: -machine pc,max-ram-below-4g=1G
bios: ovmf
efidisk0: local-lvm:vm-101-disk-1,size=128K
hostpci0: 18:00,pcie=1,x-vga=on
machine: q35
vga: std,memory=64

pveversion: pve-manager/5.2-11/13c2da63 (running kernel: 4.15.18-9-pve)

dcsapak · Jan 2, 2019

some things here:

pveversion: pve-manager/5.2-11/13c2da63 (running kernel: 4.15.18-9-pve)

please upgrade to the current 5.3

Telencephalon said:
args: -machine pc,max-ram-below-4g=1G

Telencephalon said:
machine: q35

please use only one machine type

if you want to set those options with q35, use '-machine q35,max-ram-below-4g=1G'

Telencephalon said:
I've tried setting x-vga=off: In that case, the virtual VGA reappears in the VM PCIe bus, but I can't use the physical GPU anymore (it is still on the PCIe bus in the VM, but I can't run CUDA test programs anymore.

this is currently the only way (with our config)

with ovmf, x-vga does nothing except it does not add the virtual vga card
so i would check why the card does not work with a second graphics card present

alternatively, you could try to add a vga card in the args with '-vga std'
with a current pve, the 'args' line is at the very end and should add the vga card after the nvidia one then

Telencephalon · Jan 7, 2019

Thanks, dcsapak, for your very helpful suggestions. Based on them, I dug deeper and realized that the issues I was seeing were actually not due to the presence or absence of the virtual VGA card -- I have them regardless of that. There's something wrong in my passthrough config: On the superficial level, it seems to work (I can execute nvidia-smi and see the GPU), but all computations fail (they just stall and never return or do anything). In the syslog I'm getting error messages such as these:

Code:

NVRM: Xid (PCI:0000:01:00): 44, Ch 00000001, engmask 00000101, intr 00000000
NVRM: Xid (PCI:0000:01:00): 38, 0000 00000000 00000000 00000000 00000250 00000000

Xid 38 is "Driver firmware error". Xid 44 is "Graphics Engine fault during context switch".

Disabling "nested" kvm inside of the VM by setting -cpu host,kvm=off (which has been reported to be necessary to fool the virtualization detection used by nvidia drivers) didn't help.

I'll have to debug this some more; any helpful suggestions would obviously be greatly appreciated. My GPU is rather new, I couldn't find reports about this specific model.

dcsapak · Jan 8, 2019

mhmm.. do you run linux in the guest ?
maybe the driver in the guest is too old?

does the card work directly on hardware (with the same os as in the guest) ?

Telencephalon · May 8, 2019

Hi Dominik, I finally found the time to work on my passthrough issue. No solution so far, I'm afraid. To answer your questions: Yes, I'm running Linux (Ubuntu 18.04) as the guest, with all packages up-to-date. I installed the same Ubuntu onto a USB stick and booted the host machine from the stick. I was able to install CUDA and test my GPUs: They run just fine in this situation (so, as you suggested, running directly on hardware with the same os as in the guest). The NVIDIA driver in my guest is the latest 418.40.04, and CUDA is also at the latest 10.1. I also upgraded my host to the latest proxmox (5.4-5) in the meantime.

So, it's not the hardware, it's not the driver version, it is not the fault of the guest OS. So what's left is the passthrough config of the host OS, and the guest VM config. Could you confirm that the following steps on the host OS should be sufficient for passthrough to work (Intel Xeon CPU, interrupt remapping supported, using "GPU OVMF PCI EXPRESS PASSTHROUGH" strategy as described in the proxmox wiki)?

in /etc/default/grub, change
GRUB_CMDLINE_LINUX_DEFAULT="quiet"
to
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"
and run update-grub
(after reboot, I ran "dmesg | grep -e DMAR -e IOMMU" and got some good-looking output, so IOMMU seems to work)
in /etc/modprobe.d/blacklist.conf, add these lines:
blacklist nvidia
blacklist nouveau
blacklist radeon
in /etc/modules, add these lines:
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
in /etc/modprobe.d/vfio.conf, add the line:
options vfio-pci ids=[vendor device ids] disable_vga=1

Anything major I'm forgetting here? Any further diagnostic steps I could take on the host to make sure that it is not interfering?

To remind you of the symptoms of my problem: the passed-through GPU shows up in the guest VM, both in lspci and in nvidia-smi, but CUDA demo apps just hang forever as soon as the start accessing the GPU, and can't be killed. A host reboot changes the behavior a bit (sometimes nvidia-smi hangs itself, sometimes it doesn't), but doesn't fix the problem.

Any suggestions appreciated!

Hell Angel · May 8, 2019

So, is there a way to keep the virtual VGA card enabled while at the same time passing through a physical GPU?

Code:

hostpci0: 18:00,pcie=1,x-vga=on

to

Code:

hostpci0: 18:00,pcie=1

connect to Physical GPU with a monitor or HDMI dummy, or just buy a cheap HDMI to VGA adapter.
Done

Stewge · May 9, 2019

Telencephalon said:
Anything major I'm forgetting here? Any further diagnostic steps I could take on the host to make sure that it is not interfering?

Have you tried testing with a monitor plugged in?

Many functions in nvidia-smi will not work correctly unless the system is fooled into thinking that a monitor is present (a common issue crypto-miners run into). You can setup Xorg to get around this, but the easiest way is to use a dummy plug.

Telencephalon · May 10, 2019

Thanks for your suggestions! I tried running it with a monitor plugged in, but this makes no difference. I also tried the latest Centos 7 instead of Ubuntu 18.04 as the guest OS, but that works even less: nvidia-smi can't get a device handle in Centos. In the Ubuntu guest I played around with a number of settings (enabled X2APIC in BIOS for latest-generation interrupt remapping, pinned VM PID to the NUMA node that the GPU is connected to, removed other GPUs from system, removed "disable_vga=1" in vfio.conf, ...). One of those changes (not sure which one) led to the disappearance of the Xid 44, 38 errors in the syslog: The nvidia driver loads fine now on boot, I don't see any nvidia-related error messages anymore, the output of nvidia-smi looks great. However, as soon as I access the GPU, for example by running the "bandwithTest" demo that comes with CUDA, the demo just hangs forever and can't be killed. In the syslog, four or five messages like the following one appear:

Code:

perf: interrupt took too long (2504 > 2500), lowering kernel.perf_event_max_sample_rate to 79750

Interestingly, after a fresh restart of the host system, the demos initially run normal and succeed, but when running some more all of a sudden they start freezing again and the "interrupt took too long" messages appear (even with a screen plugged in).

So, basically, I'm out of clues. I'm afraid I'll have to give up on this and go for a non-virtualized OS on this node.

Thanks again!

Search

Search

GPU passthrough: Nvidia errors Xid 44, 38 (was: keep virtual VGA)?

Telencephalon

Active Member

dcsapak

Proxmox Staff Member

Telencephalon

Active Member

dcsapak

Proxmox Staff Member

Telencephalon

Active Member

Hell Angel

New Member

Stewge

Renowned Member

Telencephalon

Active Member

We value your privacy