Nvidia Quadro 2000 GPU Passthrough Causes Kernel Panic

noisufnoc · Sep 3, 2019

After some research, I picked up a Quadro 2000 for use as a passthrough GPU for a Windows 10 VM.

I rebuilt my VM, using the following guide from reddit (https://www.reddit.com/r/homelab/comments/b5xpua/the_ultimate_beginners_guide_to_gpu_passthrough/), my entire system is kernel panicing when I try to boot the VM. Shortly after booting the windows VM, with the passthrough enabled, I can see the panic in the server's syslog. Eventually the entire Proxmox server goes unresponsive and requires a hardware reset. Any ideas?

lspci output

Code:

03:00.0 VGA compatible controller: NVIDIA Corporation GF106GL [Quadro 2000] (rev a1)
03:00.1 Audio device: NVIDIA Corporation GF106 High Definition Audio Controller (rev a1)

After reqviewing this doc https://pve.proxmox.com/wiki/Pci_passthrough, I dumped the rom and used rom-parser to review:

Code:

Valid ROM signature found @0h, PCIR offset 188h
PCIR: type 0 (x86 PC-AT), vendor: 10de, device: 0dd8, class: 030000
PCIR: revision 0, vendor revision: 1
Last image

According to the doc, you need to be type 3 to be UEFI compatible. I'm not sure if this is a deal breaker or if the reddit guide doesn't address older hardware?

dcsapak · Sep 4, 2019

can you post the complete vm config ( qm config ID ) and the iommu groups ? maybe also the dmesg output ? also does your log show anything when the host goes unresponsive?

noisufnoc · Sep 4, 2019

dcsapak said:
can you post the complete vm config ( qm config ID ) and the iommu groups ? maybe also the dmesg output ? also does your log show anything when the host goes unresponsive?

Code:

root@splinter:~# qm config 110
args: -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=NV43FIX,kvm=off'
bios: ovmf
boot: dc
bootdisk: scsi0
cores: 4
cpu: host,hidden=1,flags=+pcid
efidisk0: sas15k:vm-110-disk-0,size=128K
ide0: iso:iso/virtio-win-0.1.141.iso,media=cdrom,size=309208K
ide2: iso:iso/Win10_1903_V1_English_x64.iso,media=cdrom
machine: q35
memory: 6144
name: win10
net0: virtio=3A:AC:CE:AA:F7:C8,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
scsi0: local-zfs:vm-110-disk-0,cache=writeback,iothread=1,replicate=0,size=64G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=4d2208a9-e110-413d-9560-4d37b3b7b1ba
sockets: 1
vmgenid: 9e355091-d2f4-4f99-a749-f6aacfb9b7d0

how do i find the iommu groups? i do know there's some events in the syslog/dmseg when it goes unresponsive, i'll test it again soon.

dcsapak · Sep 4, 2019

afaics this config does not include any passed through card? (no hostpci setting)
also:
args: -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=NV43FIX,kvm=off'

will overwrite this line:

cpu: host,hidden=1,flags=+pcid

why do you need to set it in args?
if for the vendor_id, you can either set this manually on the cpu line,e.g.

Code:

cpu: host,hidden=1,flags=+pcid,hv-vendor-id=FOO

or set 'x-vga=on' on the hostpci line (we detect a graphic card passthrough and set the vendor automatically to something)

noisufnoc · Sep 4, 2019

dcsapak said:
afaics this config does not include any passed through card? (no hostpci setting)
also:
args: -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=NV43FIX,kvm=off'

will overwrite this line:

cpu: host,hidden=1,flags=+pcid

why do you need to set it in args?
if for the vendor_id, you can either set this manually on the cpu line,e.g.

Code:

cpu: host,hidden=1,flags=+pcid,hv-vendor-id=FOO

or set 'x-vga=on' on the hostpci line (we detect a graphic card passthrough and set the vendor automatically to something)

So I may have messed around with that file after the panic started. In your opinion, what settings should I be using to successfully pass through the card?

dcsapak · Sep 4, 2019

for the most part the gui options should be enough, but the config settings are very well documented here: https://pve.proxmox.com/wiki/PCI(e)_Passthrough

noisufnoc · Sep 4, 2019

dcsapak said:
for the most part the gui options should be enough, but the config settings are very well documented here: https://pve.proxmox.com/wiki/PCI(e)_Passthrough

sounds good, i'll re-read that doc and post the results.

noisufnoc · Sep 4, 2019

Okay, went back and made some changes, after reading the PCI(e) doc.

Code:

bios: ovmf
boot: dc
bootdisk: scsi0
cores: 4
cpu: host,hidden=1,flags=+pcid,hv-vendor-id=FOO
efidisk0: sas15k:vm-110-disk-0,size=128K
hostpci0: 03:00,pcie=1
ide0: iso:iso/virtio-win-0.1.141.iso,media=cdrom,size=309208K
ide2: iso:iso/Win10_1903_V1_English_x64.iso,media=cdrom
machine: q35
memory: 6144
name: win10
net0: virtio=3A:AC:CE:AA:F7:C8,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
scsi0: local-zfs:vm-110-disk-0,cache=writeback,iothread=1,replicate=0,size=64G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=4d2208a9-e110-413d-9560-4d37b3b7b1ba
sockets: 1
vmgenid: 9e355091-d2f4-4f99-a749-f6aacfb9b7d0

Code:

root@splinter:~# dmesg | grep IOMMU
[    0.953135] DMAR: IOMMU enabled
[    1.365841] DMAR-IR: IOAPIC id 3 under DRHD base  0xfbffe000 IOMMU 0
[    1.365844] DMAR-IR: IOAPIC id 0 under DRHD base  0xdfffc000 IOMMU 1
[    1.365846] DMAR-IR: IOAPIC id 2 under DRHD base  0xdfffc000 IOMMU 1

Started the VM, and the same panic behavior. The entire proxmox host goes unresponsive after about 60 seconds.

Code:

Sep  4 11:09:42 splinter kernel: [  449.488837] perf: interrupt took too long (919371 > 417846), lowering kernel.perf_event_max_sample_rate to 250
Sep  4 11:09:45 splinter kernel: [  450.846894] hrtimer: interrupt took 1046613435 ns
Sep  4 11:09:55 splinter kernel: [  460.783512] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 58.208 msecs

Message from syslogd@splinter at Sep  4 11:09:55 ...
 kernel:[  461.631639] watchdog: BUG: soft lockup - CPU#30 stuck for 22s! [kvm:6976]
Sep  4 11:09:55 splinter kernel: [  461.302570] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 229.328 msecs
Sep  4 11:09:55 splinter kernel: [  461.612431] perf: interrupt took too long (2015763 > 1149213), lowering kernel.perf_event_max_sample_rate to 250
Sep  4 11:09:55 splinter kernel: [  461.631641] Modules linked in: tcp_diag inet_diag nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache veth ebtable_filter ebtables ip_set ip6table_filter ip6_tables iptable_filter bpfilter softdog snd_hda_codec_hdmi nfnetlink_log nfnetlink intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel ast aes_x86_64 crypto_simd ttm cryptd glue_helper drm_kms_helper intel_cstate snd_hda_intel intel_rapl_perf pcspkr drm input_leds snd_hda_codec joydev snd_hda_core fb_sys_fops syscopyarea sysfillrect snd_hwdep sysimgblt snd_pcm snd_timer snd mei_me mei soundcore ipmi_si ipmi_devintf ipmi_msghandler mac_hid vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_virqfd irqbypass sunrpc vfio_iommu_type1 vfio ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zlua(PO) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) btrfs xor
Sep  4 11:09:55 splinter kernel: [  461.631705]  zstd_compress raid6_pq libcrc32c hid_logitech_hidpp hid_logitech_dj hid_generic usbkbd usbmouse usbhid hid ixgbe igb ahci xfrm_algo i2c_algo_bit lpc_ich i2c_i801 libahci dca isci mdio mpt3sas libsas raid_class scsi_transport_sas wmi
Sep  4 11:09:55 splinter kernel: [  461.631727] CPU: 30 PID: 6976 Comm: kvm Tainted: P           O      5.0.18-1-pve #1
Sep  4 11:09:55 splinter kernel: [  461.631728] Hardware name: Cirrascale VB1416/GA-7PESH2, BIOS R17 06/26/2018
Sep  4 11:09:55 splinter kernel: [  461.631737] RIP: 0010:smp_call_function_single+0xd2/0xf0
Sep  4 11:09:55 splinter kernel: [  461.631740] Code: 65 48 33 0c 25 28 00 00 00 75 34 c9 c3 48 89 d1 48 89 f2 48 89 e6 e8 6d fe ff ff 8b 54 24 18 83 e2 01 74 0b f3 90 8b 54 24 18 <83> e2 01 75 f5 eb ca 8b 05 41 74 a2 01 85 c0 75 88 0f 0b eb 84 e8
Sep  4 11:09:55 splinter kernel: [  461.631742] RSP: 0000:ffffc08608e8fba0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
Sep  4 11:09:55 splinter kernel: [  461.631745] RAX: 0000000000000000 RBX: 000000000000001e RCX: 0000000000000830
Sep  4 11:09:55 splinter kernel: [  461.631746] RDX: 0000000000000003 RSI: 00000000000008fb RDI: 0000000000000830
Sep  4 11:09:55 splinter kernel: [  461.631748] RBP: ffffc08608e8fbf8 R08: 000000000000000c R09: ffff9ca80f407c00
Sep  4 11:09:55 splinter kernel: [  461.631750] R10: 0000000000000004 R11: 00000065ceb9e927 R12: ffffffff95e84470
Sep  4 11:09:55 splinter kernel: [  461.631751] R13: ffffc08608e8fd08 R14: 0000000000000001 R15: ffff9cb4095721a0
Sep  4 11:09:55 splinter kernel: [  461.631754] FS:  00007fd50b228dc0(0000) GS:ffff9cb40fb80000(0000) knlGS:0000000000000000
Sep  4 11:09:55 splinter kernel: [  461.631756] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep  4 11:09:55 splinter kernel: [  461.631757] CR2: 00007f15a6a64000 CR3: 00000015122ba006 CR4: 00000000000626e0
Sep  4 11:09:55 splinter kernel: [  461.631759] Call Trace:
Sep  4 11:09:55 splinter kernel: [  461.631766]  ? flush_tlb_func_common.constprop.9+0x230/0x230
Sep  4 11:09:55 splinter kernel: [  461.631771]  ? cpumask_next_and+0x1e/0x20
Sep  4 11:09:55 splinter kernel: [  461.631773]  smp_call_function_many+0x223/0x250
Sep  4 11:09:55 splinter kernel: [  461.631777]  ? flush_tlb_func_common.constprop.9+0x230/0x230
Sep  4 11:09:55 splinter kernel: [  461.631779]  on_each_cpu_mask+0x2a/0x70
Sep  4 11:09:55 splinter kernel: [  461.631782]  ? x86_configure_nx+0x50/0x50
Sep  4 11:09:55 splinter kernel: [  461.631785]  on_each_cpu_cond_mask+0xab/0x140
Sep  4 11:09:55 splinter kernel: [  461.631788]  ? flush_tlb_func_common.constprop.9+0x230/0x230
Sep  4 11:09:55 splinter kernel: [  461.631791]  native_flush_tlb_others+0xca/0x130
Sep  4 11:09:55 splinter kernel: [  461.631794]  flush_tlb_mm_range+0xde/0x110
Sep  4 11:09:55 splinter kernel: [  461.631800]  change_protection+0x91c/0xb00
Sep  4 11:09:55 splinter kernel: [  461.631806]  change_prot_numa+0x1c/0x40
Sep  4 11:09:55 splinter kernel: [  461.631809]  task_numa_work+0x1f0/0x300
Sep  4 11:09:55 splinter kernel: [  461.631815]  task_work_run+0x9d/0xc0
Sep  4 11:09:55 splinter kernel: [  461.631820]  exit_to_usermode_loop+0xf2/0x100
Sep  4 11:09:55 splinter kernel: [  461.631823]  prepare_exit_to_usermode+0x66/0x90
Sep  4 11:09:55 splinter kernel: [  461.631827]  retint_user+0x8/0x8
Sep  4 11:09:55 splinter kernel: [  461.631829] RIP: 0033:0x55648e071e1b
Sep  4 11:09:55 splinter kernel: [  461.631832] Code: 0f 1f 80 00 00 00 00 31 c0 4c 39 c7 0f 94 c0 48 83 c4 08 5b 5d c3 66 0f 1f 84 00 00 00 00 00 31 d2 48 89 f8 4c 89 c6 49 f7 f0 <83> c6 01 31 c0 d1 fe 48 63 f6 48 39 f2 0f 92 c0 48 83 c4 08 5b 5d
Sep  4 11:09:55 splinter kernel: [  461.631833] RSP: 002b:00007ffc957458d0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Sep  4 11:09:55 splinter kernel: [  461.631836] RAX: 00000000000004b8 RBX: 00007fd382b80a88 RCX: 000055648e071e10
Sep  4 11:09:55 splinter kernel: [  461.631838] RDX: 000000000000174d RSI: 0000000000002e9c RDI: 0000000000dc076d
Sep  4 11:09:55 splinter kernel: [  461.631839] RBP: 00000002da2f52fd R08: 0000000000002e9c R09: 00333ac06ff9b274
Sep  4 11:09:55 splinter kernel: [  461.631841] R10: 000000003b9aca00 R11: 00333ac06ff9b274 R12: 00000002da2f52fc
Sep  4 11:09:55 splinter kernel: [  461.631842] R13: 0000000000000001 R14: 0000000000000000 R15: 000055648e0724c0

dcsapak · Sep 5, 2019

can you post the output of

Code:

lspci -nnk
find /sys/kernel/iommu_groups -type l

noisufnoc · Sep 6, 2019

dcsapak said:
can you post the output of

Code:

lspci -nnk find /sys/kernel/iommu_groups -type l

These were large, so I used termbin to host the output...

root@splinter:~# lspci -nnk | nc termbin.com 9999
https://termbin.com/rulr
root@splinter:~# find /sys/kernel/iommu_groups -type l | nc termbin.com 9999
https://termbin.com/06dw

dcsapak · Sep 9, 2019

the only thing i saw is

Kernel driver in use: snd_hda_intel

of the audio part of the graphics card, maybe it helps to blacklist that driver, or bind the device also to vfio-pci?

noisufnoc · Sep 9, 2019

dcsapak said:
the only thing i saw is

of the audio part of the graphics card, maybe it helps to blacklist that driver, or bind the device also to vfio-pci?

I added it to the conf file I created in `/etc/modprobe.d/` as part of the reddit guide

Code:

blacklist radeon
blacklist nouveau
blacklist nvidia
blacklist snd_hda_intel

noisufnoc · Sep 9, 2019

noisufnoc said:
I added it to the conf file I created in `/etc/modprobe.d/` as part of the reddit guide

Code:

blacklist radeon blacklist nouveau blacklist nvidia blacklist snd_hda_intel

rebooted the host, after adding the snd_hda_intel drive to the blacklist, and it appears to be having the same issue. I'm getting the same error messages on the console as before.

noisufnoc · Sep 9, 2019

Just to test, I tried adding the audio part of the graphics card to the VM...

Code:

Virtual Environment 6.0-5
Search
Virtual Machine 110 (win10) on node 'splinter'
Server View
Logs
()
kvm: -device vfio-pci,host=03:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: Failed to mmap 0000:03:00.0 BAR 3. Performance may be slow
kvm: -device vfio-pci,host=03:00.0,id=hostpci1.0,bus=ich9-pcie-port-2,addr=0x0.0,multifunction=on: vfio 0000:03:00.0: device is already attached
TASK ERROR: start failed: command '/usr/bin/kvm -id 110 -name win10 -chardev 'socket,id=qmp,path=/var/run/qemu-server/110.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/110.pid -daemonize -smbios 'type=1,uuid=4d2208a9-e110-413d-9560-4d37b3b7b1ba' -drive 'if=pflash,unit=0,format=raw,readonly,file=/usr/share/pve-edk2-firmware//OVMF_CODE.fd' -drive 'if=pflash,unit=1,format=raw,id=drive-efidisk0,file=/dev/zvol/sas15k/vm-110-disk-0' -smp '4,sockets=1,cores=4,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/110.vnc,password -no-hpet -cpu 'host,+pcid,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=FOO,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,hv_synic,hv_stimer,hv_ipi,kvm=off' -m 6144 -object 'iothread,id=iothread-virtioscsi0' -device 'vmgenid,guid=9e355091-d2f4-4f99-a749-f6aacfb9b7d0' -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=03:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on' -device 'vfio-pci,host=03:00.1,id=hostpci0.1,bus=ich9-pcie-port-1,addr=0x0.1' -device 'vfio-pci,host=03:00.0,id=hostpci1.0,bus=ich9-pcie-port-2,addr=0x0.0,multifunction=on' -device 'vfio-pci,host=03:00.1,id=hostpci1.1,bus=ich9-pcie-port-2,addr=0x0.1' -device 'VGA,id=vga,bus=pcie.0,addr=0x1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:c083bd9974a4' -drive 'file=/mnt/pve/iso/template/iso/virtio-win-0.1.141.iso,if=none,id=drive-ide0,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=100' -drive 'file=/mnt/pve/iso/template/iso/Win10_1903_V1_English_x64.iso,if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=101' -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0' -drive 'file=/dev/zvol/rpool/data/vm-110-disk-0,if=none,id=drive-scsi0,cache=writeback,format=raw,aio=threads,detect-zeroes=on' -device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,rotation_rate=1,bootindex=200' -netdev 'type=tap,id=net0,ifname=tap110i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=3A:AC:CE:AA:F7:C8,netdev=net0,bus=pci.0,addr=0x12,id=net0' -rtc 'driftfix=slew,base=localtime' -machine 'type=q35' -global 'kvm-pit.lost_tick_policy=discard'' failed: exit code 1

Code:

root@splinter:/var/log# cat /etc/pve/qemu-server/110.conf
bios: ovmf
boot: dc
bootdisk: scsi0
cores: 4
cpu: host,hidden=1,flags=+pcid,hv-vendor-id=FOO
efidisk0: sas15k:vm-110-disk-0,size=128K
hostpci0: 03:00,pcie=1
hostpci1: 03:00,pcie=1
ide0: iso:iso/virtio-win-0.1.141.iso,media=cdrom,size=309208K
ide2: iso:iso/Win10_1903_V1_English_x64.iso,media=cdrom
machine: q35
memory: 6144
name: win10
net0: virtio=3A:AC:CE:AA:F7:C8,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
scsi0: local-zfs:vm-110-disk-0,cache=writeback,iothread=1,replicate=0,size=64G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=4d2208a9-e110-413d-9560-4d37b3b7b1ba
sockets: 1
vmgenid: 9e355091-d2f4-4f99-a749-f6aacfb9b7d0

doesn't seem right, going to remove it.

still panicking when booting the VM. are my config entries correct?

dcsapak · Sep 10, 2019

noisufnoc said:
I added it to the conf file I created in `/etc/modprobe.d/` as part of the reddit guide

did you update the initramfs after? check with lspci -k to see if that device really is not using the driver anymore

noisufnoc said:
hostpci0: 03:00,pcie=1
hostpci1: 03:00,pcie=1

this tries to passthrough the same both times, once is enough

the shorthand '03:00' instead of '03:00.0' already passes through all functions (including the audio device)

maybe try only the gpu function? (with '03:00.0')

noisufnoc · Sep 11, 2019

dcsapak said:
did you update the initramfs after? check with lspci -k to see if that device really is not using the driver anymore

this tries to passthrough the same both times, once is enough

the shorthand '03:00' instead of '03:00.0' already passes through all functions (including the audio device)

maybe try only the gpu function? (with '03:00.0')

I ran initramfs and verified that the driver isn't in use anymore

Code:

03:00.0 VGA compatible controller: NVIDIA Corporation GF106GL [Quadro 2000] (rev a1)
        Subsystem: NVIDIA Corporation GF106GL [Quadro 2000]
        Kernel modules: nvidiafb, nouveau
03:00.1 Audio device: NVIDIA Corporation GF106 High Definition Audio Controller (rev a1)
        Subsystem: NVIDIA Corporation GF106 High Definition Audio Controller
        Kernel modules: snd_hda_intel

Still panics. Going to try 03:00.0

noisufnoc · Sep 11, 2019

Still panics with the following settings

Code:

cat /etc/pve/qemu-server/110.conf
bios: ovmf
boot: dc
bootdisk: scsi0
cores: 4
cpu: host,hidden=1,flags=+pcid,hv-vendor-id=FOO
efidisk0: sas15k:vm-110-disk-0,size=128K
hostpci0: 03:00.0,pcie=1
ide0: iso:iso/virtio-win-0.1.141.iso,media=cdrom,size=309208K
ide2: iso:iso/Win10_1903_V1_English_x64.iso,media=cdrom
machine: q35
memory: 6144
name: win10
net0: virtio=3A:AC:CE:AA:F7:C8,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
scsi0: local-zfs:vm-110-disk-0,cache=writeback,iothread=1,replicate=0,size=64G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=4d2208a9-e110-413d-9560-4d37b3b7b1ba
sockets: 1
vmgenid: 9e355091-d2f4-4f99-a749-f6aacfb9b7d0

Code:

Sep 10 20:57:44 splinter kernel: [  638.273892]  znvpair(PO) zavl(PO) icp(PO) spl(O) btrfs xor zstd_compress raid6_pq libcrc32c hid_logitech_hidpp hid_logitech_dj hid_generic usbkbd usbmouse usbhid hid ixgbe igb ahci i2c_algo_bit xfrm_algo i2c_i801 libahci lpc_ich isci dca mpt3sas mdio libsas raid_class scsi_transport_sas wmi
Sep 10 20:57:44 splinter kernel: [  638.273911] CPU: 27 PID: 31324 Comm: kvm Tainted: P           O L    5.0.18-1-pve #1
Sep 10 20:57:44 splinter kernel: [  638.273925] RAX: 0000000000000007 RBX: ffff99accfae3c00 RCX: ffff99a0cf9e9260
Sep 10 20:57:44 splinter kernel: [  638.273930] R10: ffffd84b21252440 R11: 0000000000000000 R12: ffffffffa8c84470
Sep 10 20:57:44 splinter kernel: [  638.273934] FS:  00007f4c007ff700(0000) GS:ffff99accfac0000(0000) knlGS:0000000000000000
Sep 10 20:57:44 splinter kernel: [  638.273937] CR2: 00007f8f16186008 CR3: 00000015f72c8006 CR4: 00000000000626e0
Sep 10 20:57:44 splinter kernel: [  638.273946]  ? flush_tlb_func_common.constprop.9+0x230/0x230
Sep 10 20:57:44 splinter kernel: [  638.273952]  ? x86_configure_nx+0x50/0x50
Sep 10 20:57:44 splinter kernel: [  638.273960]  ? flush_tlb_func_common.constprop.9+0x230/0x230
Sep 10 20:57:44 splinter kernel: [  638.273970]  pmdp_invalidate+0xe9/0xf0
Sep 10 20:57:44 splinter kernel: [  638.283977]  ? default_wake_function+0x12/0x20
Sep 10 20:57:44 splinter kernel: [  638.283987]  task_work_run+0x9d/0xc0
Sep 10 20:57:44 splinter kernel: [  638.291519] RIP: 0033:0x7f4c10c7d427
Sep 10 20:57:44 splinter kernel: [  638.291528] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000001e
Sep 10 20:57:44 splinter kernel: [  639.309805]  ? vmx_vcpu_run+0x28c/0xc70 [kvm_intel]

noisufnoc · Sep 16, 2019

dcsapak said:
did you update the initramfs after? check with lspci -k to see if that device really is not using the driver anymore

this tries to passthrough the same both times, once is enough

the shorthand '03:00' instead of '03:00.0' already passes through all functions (including the audio device)

maybe try only the gpu function? (with '03:00.0')

I'm still striking out over here...any other ideas?

noisufnoc · Sep 17, 2019

So I did more troubleshooting today, upon starting the windows VM that has the passthrough the following is in `/var/log/messages`

Code:

Sep 16 19:02:05 splinter kernel: [  360.498080] vfio-pci 0000:03:00.0: BAR 3: can't reserve [mem 0xd0000000-0xd3ffffff 64bit pref]
Sep 16 19:02:24 splinter kernel: [  372.015058] INFO: NMI handler (ghes_notify_nmi) took too long to run: 11.637 msecs
Sep 16 19:02:24 splinter kernel: [  379.896253] INFO: NMI handler (ghes_notify_nmi) took too long to run: 52.213 msecs

The can't reserve line is curious. Shortly after that, the panics start.

noisufnoc · Sep 17, 2019

this might be related... https://pve.proxmox.com/wiki/Pci_passthrough#BAR_3:_can.27t_reserve_.5Bmem.5D_error

Nvidia Quadro 2000 GPU Passthrough Causes Kernel Panic

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Member

Proxmox Staff Member

Member

Member

Member

Member

Member