GPU passthrough with Dual Quadro (Code 43)

StefanR

New Member
Feb 18, 2016
2
0
1
47
Hi there,

I have an issues getting to quadros running in seperate VMs. On one Quadro I get Code 43.

First I give you a heads up, what I did, what was working and isn't anymore after trying various fixes. I fixed it to the ground, hehe.

Hardware:
Fujitsu CELSIUS R920Power (without Monitor attached)
Quadro K4000
Quadro FX1800

I was running a Poxmox Setup for more than 2 years on that machine in Version Proxmox V3.0-20/0428106c, QEMU 1.7. GPU passthrough with the K4000 was working well.
I needed another GPU for another VM so I plugged a FX570. I did not get it to work. I thought the nvidia Driver was the blocking Point as this Card is not on the passthrough list from NVIDIA. Here i probably made my first mistake (never Change a running System) and did an "dist-upgrade" to get the "kvm=off" Option for the CPU in seabios. Did not help either.
Finally I got an FX1800 which is officially supported by NVIDIA. Et viola, working, I could virtualize both Quadros. BUT, heavy load VMs got unsable. I even upgraded to an 3.1xxxx pve-kernel. I did not work out, so I downgraded back to 2.6xxxx, I figured that it might had something to do with Caching, as the VMs hung on diskwrites, but not crashed. I disabled all Caches in the VMs, had the FX1800 passthrough to a Windows machine - working fine. Not testing the K4000 again.
After two weeks I needed the K4000 in an Linux VM, but I could not install it. As soon as X tried to Access the Driver module the VM crashed. I was looking for the Problem in the VM, because the Card was working in Windows VMs before.

I removed the Card and tested it outside the Server. All good.

I plugged it back in, physically removed the FX1800, passed the K4000 inside a Windows VM - Code 43 (what???). That was working before the kernel up-/downgrade.

I swapped PCI Slots, everything. Another strange Thing happend, when I plugged the FX1800 back in and installed the Driver in Windows I got a BSOD. After I enabled writeback Cache on virtio again, the BSOD went away, the Card was recognised - all good. If I disable now the Cache, everything still works even after shutdown/powerup of the VM. But only for the FX1800, the K4000 still throws Code 43.

I also tried different Windows VMs.

Sorry for the fairytale at the beginning. I just want to make sure you understand, that the K4000 was working, even after "dist-upgrade".

But after that I really tried a lot to bring back stability to the VMs, before I narrowed it down to the Cache Thing.

So that is my Setup now:

Code:
proxmox-ve-2.6.32: 3.4-166 (running kernel: 2.6.32-43-pve)
pve-manager: 3.4-11 (running version: 3.4-11/6502936f)
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-43-pve: 2.6.32-166
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-3
pve-cluster: 3.0-19
qemu-server: 3.4-6
pve-firmware: 1.1-5
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-34
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-14
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

lspci -nn | grep "NVIDIA"
Code:
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation G94GL [Quadro FX 1800] [10de:0638] (rev a1)
84:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK106GL [Quadro K4000] [10de:11fa] (rev a1)
84:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0e0b] (rev a1)

dmesg | grep "IOMMU"
Code:
Intel-IOMMU: enabled
dmar: IOMMU 0: reg_base_addr fbffe000 ver 1:0 cap d2078c106f0462 ecap f020fe
dmar: IOMMU 1: reg_base_addr bfffc000 ver 1:0 cap d2078c106f0462 ecap f020fe
IOMMU 0xfbffe000: using Queued invalidation
IOMMU 0xbfffc000: using Queued invalidation
IOMMU: Setting RMRR:
IOMMU: Setting identity map for device 0000:00:1d.0 [0x3cf56000 - 0x3cf63000]
IOMMU: Setting identity map for device 0000:00:1a.0 [0x3cf56000 - 0x3cf63000]
IOMMU: Prepare 0-16MiB unity mapping for LPC
IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0x1000000]

dmesg | grep "claimed"
Code:
pci-stub 0000:02:00.0: claimed by stub
pci-stub 0000:84:00.0: claimed by stub
pci-stub 0000:84:00.1: claimed by stub

seabios config of Windows 7 VM
Code:
oot: cdn
bootdisk: virtio3
cores: 8
hostpci0: 84:00.0
ide2: none,media=cdrom
memory: 32768
name: TEMP
net0: e1000=3E:FB:4F:27:6E:45,bridge=vmbr0
numa: 0
ostype: other
sockets: 1
virtio0: local:108/vm-108-disk-1.vmdk,format=vmdk,cache=none,size=10G
virtio3: local:108/vm-108-disk-3.vmdk,format=vmdk,cache=none,size=100G

"cpu: host,kvm=off" makes no difference.

I have blacklisted "nvidia" and "nouveau" in the host, even if it doesn't matter with pci-stub enabled. I doesn't make a difference.

/etc/default/grub
Code:
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="Proxmox Virtual Environment"
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"
GRUB_CMDLINE_LINUX=""

I have no other Options currently, as I had no Interrupt Problems before.

vfio is not working with my kernel and with kernel 3.1xx I had Problems with IOMMU_groups that the Folders where not found or the modules were not loaded.

NVIDIA Driver in the VM is "309.08-quadro-tesla-win8-win7-winvista-64bit-international-whql.exe" which was working with both Cards before.

Currently I have no clue at all why the FX1800 is working without a Problem and the K4000 not.

I am a little bit reluctant to upgrade to proxmox 4.x as I might get other more severe Problems, e.g. nvidia reset Problem. Rebooting the host on a regular Basis is not possible.

I really hope that you can help me out. Maybe the way back to my Initial Setup is the way, but how do I do that?

Cheers,
Stefan
 
Hi,
for code43, they are a new feature in qemu 2.5, which allow to pass hv_vendor to cpu.
(generally code43 come from hyper_v extensions enabled, but last drivers don't seem to work even if they are disable).
you can try to choose ostype: linux 2.6 instead windows. maybe it'll work for you.


if you can, can you try proxmox 4.1 with all last update from no-subscription repository
We have done a lot of optimisations recently. (ovmf support, vfio, ...) Check the wiki.

And after all update,
Then download my patch:

can you download
http://odisoweb1.odiso.net/qemu-server_4.0-56_amd64.deb

and install it with
dpkg -i qemu-server_4.0-56_amd64.deb

(I'll add hv_vendor, I'll would like to see if it's help or not before pushing it officialy in proxmox code)
 
Hi,
for code43, they are a new feature in qemu 2.5, which allow to pass hv_vendor to cpu.
(generally code43 come from hyper_v extensions enabled, but last drivers don't seem to work even if they are disable).
you can try to choose ostype: linux 2.6 instead windows. maybe it'll work for you.

Hi Spirit,

thx for the reply. I am using NVIDIA quadros. Both are on the list of supported Cards for virtualization. No Need to hide any hypervisor related stuff.

Here the extract from the nvidia Driver release notes:
Supported Graphics Cards


The following GPUs are supported for device passthrough:

Kepler:
GRID: K1, K2, K520, K340
Quadro: K2000, K4000, K5000, K6000
Tesla: K10, K20, K20x, K20Xm, K20c, K20s, K40m, K40c,
K40s, K40st, K40t

Fermi:
Quadro: 2000, 4000, 5000, 6000
Quadro-MXM: 1000M, 3000M
Tesla: C2050, C2075, M2050, M2070, M2070Q

Tesla:
Quadro FX1800, 3800, 4800, 5800
Quadro-MXM: FX880M, FX2800M
Tesla: M1060, C1060

The Thing is, it was working before the kernel up-/downgrade - so my suspicion is, that something in the host System does not add up and the resources are not released/isolated correctly.
With the current loglevel I cannot see any signs that the host isn't releasing the Card correctly. With Windows as Client you can't see anything but code 43, but with Linux clients it gets more precise. Here I get in the Clients kernel Panics, seg faults and so on, as soon as I try to Access a Feature of the K4000 (e.g. 2D acceleration). The nvidia kernel module in the Client is loaded without error. So it narrows it down to a lower Hardware Level.

So for the Clients it Looks like an Hardware defect. As I tested the Hardware natively I can rule out a real Hardware Problem.

There is only one Thing left. On the way from the host to the Client (on the passthrough basically) something gets lost. Maybe the hypervisor can't get Access to all resources of that Card, but I do not know how to check this with the current log-Level.

for example if I start a Windows Client with the K4000 assigned, I see the following in dmesg of the host:

Code:
device tap108i0 entered promiscuous mode
vmbr0: port 14(tap108i0) entering forwarding state
pci-stub 0000:84:00.0: PCI INT A -> GSI 64 (level, low) -> IRQ 64
tap108i0: no IPv6 routers present
assign device 0:84:0.0
pci-stub 0000:84:00.0: Invalid ROM contents
pci-stub 0000:84:00.0: irq 108 for MSI/MSI-X

I don't get the "Invalid ROM Content" all the time. But for rest it Looks ok for me.

Is there a way to increase the loglevel to more Hardware Events?

I try to avoid re-installing the host, as it takes me about a week to get everything back up running, with all testing and so on.

For my understanding. When I upgraded the pve-kernel from 2.6xxx to 3.1xxx what happens in the System?
When I downgraded back to 2.6xx via uninstalling the 3.1xxx via apt-get uninstall. Is there something left from the 3.1xx, maybe some libraries which are still used?