Tesla K80

Mar 29, 2020
17
0
6
53
I am trying to install a Tesla K80 on Proxmox 6.1-7 (Am too scared to upgrade to latest and greatest...).

I was able to pass both cards (the K80 shows up as two GPUs in lspci) to a WindowsVM without 4G decoding.

The VM booted fine and I went on to install the required Nvidia Drivers... The machine rebooted, and is now unwilling to boot. If I remove 4G decoding I get the error below.

PS: This box and its virtual machines have been running a bunch of Nvidia Quadros without issues so I assume that the issue is K80 related.

Here is the config:
Code:
qm config 111
agent: 1
args: -machine pc,max-ram-below-4g=4G
bootdisk: scsi0
cores: 15
cpu: host,flags=+pdpe1gb
hostpci0: 03:00,pcie=1,rombar=0
hostpci1: 06:00.0,pcie=1
hostpci2: 07:00.0,pcie=1
ide2: none,media=cdrom
machine: q35
memory: 40000
name: Abaqus2020
net0: virtio=E2:47:D3:F8:55:16,bridge=vmbr0,firewall=1
numa: 1
ostype: win10
scsi0: ssdzfs:vm-111-disk-0,discard=on,iothread=1,size=150G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=b4b5c719-f41e-4122-87d3-ea156a0f6677
sockets: 2
tablet: 1
usb0: host=3-1,usb3=1
vmgenid: a0af1fb9-4037-46e4-bf9a-2ff90781c941

Error when starting the VM:
Code:
kvm:/usr/share/qemu-server/pve-q35-4.0.cfg:1: Bus 'pcie.0' not found
start failed: QEMU exited with code 1

Here is the error that I am getting from console (not much IMHO - I manually patched the E1000 driver):
Code:
Mar 15 12:46:10 proxbox kernel: [  854.292859] device tap111i0 entered promiscuous mode
Mar 15 12:46:11 proxbox kernel: [  854.339792] fwbr111i0: port 1(fwln111i0) entered blocking state
Mar 15 12:46:11 proxbox kernel: [  854.339811] fwbr111i0: port 1(fwln111i0) entered disabled state
Mar 15 12:46:11 proxbox kernel: [  854.339934] device fwln111i0 entered promiscuous mode
Mar 15 12:46:11 proxbox kernel: [  854.340008] fwbr111i0: port 1(fwln111i0) entered blocking state
Mar 15 12:46:11 proxbox kernel: [  854.340013] fwbr111i0: port 1(fwln111i0) entered forwarding state
Mar 15 12:46:11 proxbox kernel: [  854.345126] vmbr0: port 2(fwpr111p0) entered blocking state
Mar 15 12:46:11 proxbox kernel: [  854.345131] vmbr0: port 2(fwpr111p0) entered disabled state
Mar 15 12:46:11 proxbox kernel: [  854.345219] device fwpr111p0 entered promiscuous mode
Mar 15 12:46:11 proxbox kernel: [  854.345265] vmbr0: port 2(fwpr111p0) entered blocking state
Mar 15 12:46:11 proxbox kernel: [  854.345268] vmbr0: port 2(fwpr111p0) entered forwarding state
Mar 15 12:46:11 proxbox kernel: [  854.350491] fwbr111i0: port 2(tap111i0) entered blocking state
Mar 15 12:46:11 proxbox kernel: [  854.350495] fwbr111i0: port 2(tap111i0) entered disabled state
Mar 15 12:46:11 proxbox kernel: [  854.350598] fwbr111i0: port 2(tap111i0) entered blocking state
Mar 15 12:46:11 proxbox kernel: [  854.350601] fwbr111i0: port 2(tap111i0) entered forwarding state
Mar 15 12:46:11 proxbox kernel: [  854.724182] fwbr111i0: port 2(tap111i0) entered disabled state
Mar 15 12:46:11 proxbox kernel: [  854.746326] fwbr111i0: port 1(fwln111i0) entered disabled state
Mar 15 12:46:11 proxbox kernel: [  854.746409] vmbr0: port 2(fwpr111p0) entered disabled state
Mar 15 12:46:11 proxbox kernel: [  854.746774] device fwln111i0 left promiscuous mode
Mar 15 12:46:11 proxbox kernel: [  854.746783] fwbr111i0: port 1(fwln111i0) entered disabled state
Mar 15 12:46:11 proxbox kernel: [  854.765504] device fwpr111p0 left promiscuous mode
Mar 15 12:46:11 proxbox kernel: [  854.765512] vmbr0: port 2(fwpr111p0) entered disabled state

This is the output of dmesg:
Code:
grep -e DMAR -e IOMMU
[    0.012706] ACPI: DMAR 0x00000000B9E1A908 000108 (v01 DELL   CBX3     00000001 INTL 20091013)
[    0.280077] DMAR: IOMMU enabled
[    0.517030] DMAR: Host address width 46
[    0.517032] DMAR: DRHD base: 0x000000f7ffc000 flags: 0x0
[    0.517039] DMAR: dmar0: reg_base_addr f7ffc000 ver 1:0 cap d2078c106f0466 ecap f020df
[    0.517041] DMAR: DRHD base: 0x000000f3ffd000 flags: 0x0
[    0.517046] DMAR: dmar1: reg_base_addr f3ffd000 ver 1:0 cap d2008c10ef0466 ecap f0205b
[    0.517049] DMAR: DRHD base: 0x000000f3ffc000 flags: 0x1
[    0.517053] DMAR: dmar2: reg_base_addr f3ffc000 ver 1:0 cap d2078c106f0466 ecap f020df
[    0.517056] DMAR: RMRR base: 0x000000bafae000 end: 0x000000bafbcfff
[    0.517058] DMAR: ATSR flags: 0x0
[    0.517060] DMAR: RHSA base: 0x000000f3ffc000 proximity domain: 0x0
[    0.517062] DMAR: RHSA base: 0x000000f7ffc000 proximity domain: 0x1
[    0.517066] DMAR-IR: IOAPIC id 3 under DRHD base  0xf7ffc000 IOMMU 0
[    0.517068] DMAR-IR: IOAPIC id 1 under DRHD base  0xf3ffc000 IOMMU 2
[    0.517070] DMAR-IR: IOAPIC id 2 under DRHD base  0xf3ffc000 IOMMU 2
[    0.517072] DMAR-IR: HPET id 0 under DRHD base 0xf3ffc000
[    0.517074] DMAR-IR: x2apic is disabled because BIOS sets x2apic opt out bit.
[    0.517075] DMAR-IR: Use 'intremap=no_x2apic_optout' to override the BIOS setting.
[    0.517909] DMAR-IR: Enabled IRQ remapping in xapic mode
[    1.622623] DMAR: dmar1: Using Queued invalidation
[    1.622633] DMAR: dmar2: Using Queued invalidation
[    1.664133] DMAR: Intel(R) Virtualization Technology for Directed I/O

lspci correctly detects the K80:
Code:
# lspci | grep NVIDIA
03:00.0 VGA compatible controller: NVIDIA Corporation GK104GL [Quadro K4200] (rev a1)
03:00.1 Audio device: NVIDIA Corporation GK104 HDMI Audio Controller (rev a1)
06:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
07:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
# lspci -n -s 06:00
06:00.0 0302: 10de:102d (rev a1)
# lspci -n -s 07:00
07:00.0 0302: 10de:102d (rev a1)

vfio.conf for PCI passthrough:
Code:
more /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:13f1,10de:0fbb disable_vga=1
options vfio-pci ids=10de:0dd8,10de:0be9 disable_vga=1
options vfio-pci ids=10de:13bb,10de:0fbc disable_vga=1
options vfio-pci ids=10de:0ffe,10de:0e1b disable_vga=1
options vfio-pci ids=10de:1eb1,10de:10f8,10de:1ad8,10de:1ad9 disable_vga=1
options vfio-pci ids=10de:102d
 
Last edited:
kvm:/usr/share/qemu-server/pve-q35-4.0.cfg:1: Bus 'pcie.0' not found
that error makes sense, since you say in the config 'q35' but in the args you say: 'pc' (which is i440fx)
so the qemu commandline tries to load the q35 config but the actual machine is i440fx (which has no pcie bus)

either change that 'pc' to 'q35' or remove the 'args' altogether

also pcie passthrough may be working better if you use ovmf instead of seabios
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!