q35 guest Win10 won't start, AMD Threadripper 3970x

Try without the pcie_acs_override and list the IOMMU groups then. Also, you have the 'hostpci' device configured 4 times in your config, one time is enough if you enable "All functions" in the GUI.
without pcie_acs_override
the IOMMU grouping is the same:
Code:
IOMMU Group 28:
        21:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
        21:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
        21:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Controller [10de:1ad6] (rev a1)
        21:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 UCSI Controller [10de:1ad7] (rev a1)


"have the 'hostpci' device configured 4 times in your config"
I thought these four time is because four devices that I can see in IOMMU group 28 (all from NVIDIA 2080Ti)
(I follow a guide here, he added two and have all function checked https://www.reddit.com/r/homelab/comments/b5xpua/the_ultimate_beginners_guide_to_gpu_passthrough/)
like the sreenshot below, with all function checked for all four device.
I did add them in GUI not by CLI.
20200831_gui_add_pci_allJPG.JPG

If I add just the first one and check all functions. It still has the same error: group 28 is not viable.
The below has only the first one PCI passthrough.
20200831_gui_add_pci_just_one.JPG
With only one pci passthrough
Code:
qm showcmd 500 --pretty > /tmp/qm.shroot@mars:~# cat /tmp/qm.sh/usr/bin/kvm \
  -id 500 \
  -name Win10Edu-1909-2080Ti \
  -chardev 'socket,id=qmp,path=/var/run/qemu-server/500.qmp,server,nowait' \
  -mon 'chardev=qmp,mode=control' \
  -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
  -mon 'chardev=qmp-event,mode=control' \
  -pidfile /var/run/qemu-server/500.pid \
  -daemonize \
  -smbios 'type=1,uuid=51b7e8ff-ae21-45e2-942a-c236379bbd62' \
  -drive 'if=pflash,unit=0,format=raw,readonly,file=/usr/share/pve-edk2-firmware//OVMF_CODE.fd' \
  -drive 'if=pflash,unit=1,format=raw,id=drive-efidisk0,size=131072,file=/dev/zvol/nvmepool/vm-500-disk-1' \
  -smp '32,sockets=1,cores=32,maxcpus=32' \
  -nodefaults \
  -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
  -vga none \
  -nographic \
  -no-hpet \
  -cpu 'host,hv_ipi,hv_relaxed,hv_reset,hv_runtime,hv_spinlocks=0x1fff,hv_stimer,hv_synic,hv_time,hv_vapic,hv_vpindex,+ibpb,kvm=off,+kvm_pv_eoi,+kvm_pv_unhalt,-pcid' \
  -m 32768 \
  -object 'memory-backend-ram,id=ram-node0,size=32768M' \
  -numa 'node,nodeid=0,cpus=0-31,memdev=ram-node0' \
  -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg \
  -device 'vmgenid,guid=c1d954f9-87b9-47cb-83f4-7a5505137199' \
  -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' \
  -device 'vfio-pci,host=0000:21:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on' \
  -device 'vfio-pci,host=0000:21:00.1,id=hostpci0.1,bus=ich9-pcie-port-1,addr=0x0.1' \
  -device 'vfio-pci,host=0000:21:00.2,id=hostpci0.2,bus=ich9-pcie-port-1,addr=0x0.2' \
  -device 'vfio-pci,host=0000:21:00.3,id=hostpci0.3,bus=ich9-pcie-port-1,addr=0x0.3' \
  -chardev 'socket,path=/var/run/qemu-server/500.qga,server,nowait,id=qga0' \
  -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' \
  -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' \
  -iscsi 'initiator-name=iqn.1993-08.org.debian:01:baecad992489' \
  -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' \
  -drive 'file=/dev/zvol/nvmepool/vm-500-disk-0,if=none,id=drive-scsi0,cache=writeback,discard=on,format=raw,aio=threads,detect-zeroes=unmap' \
  -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,rotation_rate=1,bootindex=100' \
  -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
  -drive 'file=/mnt/pve/network-neoproxmox/template/iso/en_windows_10_consumer_editions_version_1909_x64_dvd_be09950e.iso,if=none,id=drive-sata2,media=cdrom,aio=threads' \
  -device 'ide-cd,bus=ahci0.2,drive=drive-sata2,id=sata2,bootindex=200' \
  -drive 'file=/mnt/pve/network-neoproxmox/template/iso/virtio-win-0.1.185.iso,if=none,id=drive-sata3,media=cdrom,aio=threads' \
  -device 'ide-cd,bus=ahci0.3,drive=drive-sata3,id=sata3,bootindex=201' \
  -netdev 'type=tap,id=net0,ifname=tap500i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' \
  -device 'virtio-net-pci,mac=3E:80:D5:FD:81:D2,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' \
  -rtc 'driftfix=slew,base=localtime' \
  -machine 'type=q35+pve0' \
  -global 'kvm-pit.lost_tick_policy=discard' \
  -machine 'type=q35,kernel_irqchip=on'
root@mars:~# sh /tmp/qm.shkvm: -device vfio-pci,host=0000:21:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: vfio 0000:21:00.0: group 28 is not viable
Please ensure all devices within the iommu_group are bound to their vfio bus driver.
 
Last edited:
Please ensure all devices within the iommu_group are bound to their vfio bus driver.
What does 'lspci -nnk' say? Are they actually bound to the vfio driver? If not, see here on how to configure modprobe.d (don't forget to rebuild your initramfs afterwards and reboot).
 
What does 'lspci -nnk' say? Are they actually bound to the vfio driver? If not, see here on how to configure modprobe.d (don't forget to rebuild your initramfs afterwards and reboot).

lspci -nnk full output attached
Code:
21:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
        Subsystem: PNY TU102 [GeForce RTX 2080 Ti] [196e:134e]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau
21:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
        Subsystem: PNY TU102 High Definition Audio Controller [196e:134e]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel
21:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Controller [10de:1ad6] (rev a1)
        Subsystem: PNY TU102 USB 3.1 Controller [196e:134e]
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci
21:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 UCSI Controller [10de:1ad7] (rev a1)
        Subsystem: PNY TU102 UCSI Controller [196e:134e]
        Kernel driver in use: nvidia-gpu
        Kernel modules: i2c_nvidia_gpu


I had /etc/modprobe.d/vfio.conf set to this from the very beginning
Code:
lspci -n -s 21:00
21:00.0 0300: 10de:1e04 (rev a1)
21:00.1 0403: 10de:10f7 (rev a1)
21:00.2 0c03: 10de:1ad6 (rev a1)
21:00.3 0c80: 10de:1ad7 (rev a1)

Code:
/etc/modprobe.d/vfio.conf (all four parts were added)
options vfio-pci ids=10de:1e04,10de:10f7,10de:1ad6,10de:1ad7 disable_vga=1
 

Attachments

Last edited:
Hm, the xhci driver is usually fine to be unbound, but I'm not so sure about the nvidia proprietary one. Can you try removing that driver from your host system? Or otherwise manually unassign all subdevices from their bound driver (echo "0000:21:00.X" > /sys/bus/pci/devices/0000:21:00.X/driver/unbind with X for all 4 subdevices), and then check lspci -nnk again.
 
Hm, the xhci driver is usually fine to be unbound, but I'm not so sure about the nvidia proprietary one. Can you try removing that driver from your host system? Or otherwise manually unassign all subdevices from their bound driver (echo "0000:21:00.X" > /sys/bus/pci/devices/0000:21:00.X/driver/unbind with X for all 4 subdevices), and then check lspci -nnk again.

Code:
echo "0000:21:00.0" > /sys/bus/pci/devices/0000:21:00.0/driver/unbind
echo "0000:21:00.1" > /sys/bus/pci/devices/0000:21:00.1/driver/unbind
echo "0000:21:00.2" > /sys/bus/pci/devices/0000:21:00.2/driver/unbind
echo "0000:21:00.3" > /sys/bus/pci/devices/0000:21:00.3/driver/unbind

Still the same. Also did a reboot, re-ran lspci, still same output with lspci
Code:
21:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
        Subsystem: PNY TU102 [GeForce RTX 2080 Ti] [196e:134e]
        Kernel modules: nvidiafb, nouveau
21:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
        Subsystem: PNY TU102 High Definition Audio Controller [196e:134e]
        Kernel modules: snd_hda_intel
21:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Controller [10de:1ad6] (rev a1)
        Subsystem: PNY TU102 USB 3.1 Controller [196e:134e]
        Kernel modules: xhci_pci
21:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 UCSI Controller [10de:1ad7] (rev a1)
        Subsystem: PNY TU102 UCSI Controller [196e:134e]
        Kernel modules: i2c_nvidia_gpu

Imediatelly after unbind, start VM again gives no such directory instead of not viable
Code:
sh /tmp/qm.sh
kvm: -device vfio-pci,host=0000:21:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: vfio 0000:21:00.0: failed to open /dev/vfio/28: No such file or directory

After reboot, it will still be not vaiable
Code:
kvm: -device vfio-pci,host=0000:21:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: vfio 0000:21:00.0: failed to open /dev/vfio/28: No such file or directory
 
Last edited:
Does it also fail if you now start it via the GUI/via 'qm start <vmid>' or only if you do the QEMU command manually? We do assign the vfio-pci driver automatically in our VM start code, which of course doesn't run when you do the 'showcmd'/'sh ...' method.

If that's also not it, I'm honestly at a loss, might just be something wrong with your specific hardware config...
 
Does it also fail if you now start it via the GUI/via 'qm start <vmid>' or only if you do the QEMU command manually? We do assign the vfio-pci driver automatically in our VM start code, which of course doesn't run when you do the 'showcmd'/'sh ...' method.

If that's also not it, I'm honestly at a loss, might just be something wrong with your specific hardware config...

Thanks for keeping diagnose with me.
qm start works without any change now (I didn't unbind before starting vm)

Code:
qm start 500
kvm: -device vfio-pci,host=0000:21:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: Failed to mmap 0000:21:00.0 BAR 3. Performance may be slow

Now I encounter famous code43 error in device manger for RTX2080Ti after installing NVIDIA driver.
 
Code 43: See this https://forum.proxmox.com/threads/nvidia-gtx-1050ti-error-43-code-43.75553/

Not sure it works with that new Card and pick the right ROM (best is to extract with GPU-Z).

Thanks.
The github rom patcher only works for 1xxx GPU, not RTX2xxx.
I can extract rom using GPU-Z in the win10 guest system.

I put extracted (original, unpatched) rom in folder /usr/share/kvm/ and put romfile=rtx2080ti.rom in 500.conf file.
It still gives me code43 error.
Also, same output
Code:
qm start 500
kvm: -device vfio-pci,host=0000:21:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on,romfile=/usr/share/kvm/rtx2080ti.rom: Failed to mmap 0000:21:00.0 BAR 3. Performance may be slow

BTW, the physical monitor connect to 2080ti doesn't have display signal.
I access the guest win10 using RDP (VNC console on PVE was disabled, display = none).

I only have 2 x16 slot (PCIE#1 and #3).
PCIE#3 is occupied by PCIE 4 xM.2 SSD card in RAID-Z2 hosting all VMs.

I have two other PCIEx8 (single slot). I have to get riser cable (currently don't have) in order to move the 2080Ti off the PCIE#1 primary slot.
 

Attachments

Last edited:
I think I found something:

https://github.com/Matoking/NVIDIA-vBIOS-VFIO-Patcher/pull/11 <-- it SHOULD work with your 2080

https://www.reddit.com/r/VFIO/comments/9x1x10/anyone_has_any_experience_with_passing_through/

At your own risk...

EDIT: It would be great if that works

In the Reddit post, 2080Ti was said to be working on on the secondary slot.

Use original, unpatched rom
I disabled GPU in device manager in win10 and re-enable it, for a brief time, the exclamation mark disappeared and it shows device working properly.
Then I restarted the VM and then still get code43.

Use patched rom
Patcher worked. romfile=rtx2080ti_patched.com
Now gives me display on the physical monitor connected to 2080Ti.
But I can see black blocks of PVE command line (I can see my keyboard typing) overlay with the guest win10.
The resolution is at 800x600, which suggested that driver is not working.

Disable/re-enable in device manager doesn't fix the problem any more.

I do have quiet amd_iommu=on iommu=pt nofb nomodeset video=vesafb:off,efifb:off in /etc/kernel/cmdline
 
Last edited:
Hmm do you have the nouvou driver blacklisted on the linux side?

I did this
Code:
echo "blacklist radeon" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf

unless it is not successfully blacklisting (I did update-initramfs -u -k all )

as well as
Code:
echo "vfio" > \
  /etc/modules-load.d/vfio.conf
echo "vfio_iommu_type1" >> \
  /etc/modules-load.d/vfio.conf
echo "vfio_pci" >> \
  /etc/modules-load.d/vfio.conf
echo "vfio_virqfd" >> \
  /etc/modules-load.d/vfio.conf
echo "options vfio-pci ids=10de:1e04,10de:10f7" > \
  /etc/modprobe.d/vfio.conf

The "Reading all physical volumes..." is typically the last thing printed via the UEFI frame buffer before the GPU driver would kick in.
This indicate that host is not using frame buffer anymore. The monitor will go to sleep after a while due to no signal.

Then, when I lauch VM. The monitor will light up and have black cmdline overlay with Win10 (and code43).
 
YES!!!

I solve the BAR3 error and Code43 error using a tip from UNRAID VM thread.

Code:
echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

These three lines and code and original rom (not patched) fixed the problem.
 
  • Like
Reactions: r.jochum
I think it works with /etc/sysctl.d/passthrough.conf

Just put them doted

class.vtconsole.vtcon0.bind = 0
class.vtconsole.vtcon1.bind = 0
bus.platform.drivers.efi-framebuffer.unbind = efi-framebuffer.0