q35 guest Win10 won't start, AMD Threadripper 3970x

jena · Aug 31, 2020

Stefan_R said:
Try without the pcie_acs_override and list the IOMMU groups then. Also, you have the 'hostpci' device configured 4 times in your config, one time is enough if you enable "All functions" in the GUI.

without pcie_acs_override
the IOMMU grouping is the same:

Code:

IOMMU Group 28:
        21:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
        21:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
        21:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Controller [10de:1ad6] (rev a1)
        21:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 UCSI Controller [10de:1ad7] (rev a1)

"have the 'hostpci' device configured 4 times in your config"
I thought these four time is because four devices that I can see in IOMMU group 28 (all from NVIDIA 2080Ti)
(I follow a guide here, he added two and have all function checked https://www.reddit.com/r/homelab/comments/b5xpua/the_ultimate_beginners_guide_to_gpu_passthrough/)
like the sreenshot below, with all function checked for all four device.
I did add them in GUI not by CLI.

If I add just the first one and check all functions. It still has the same error: group 28 is not viable.
The below has only the first one PCI passthrough.

With only one pci passthrough

Code:

qm showcmd 500 --pretty > /tmp/qm.shroot@mars:~# cat /tmp/qm.sh/usr/bin/kvm \
  -id 500 \
  -name Win10Edu-1909-2080Ti \
  -chardev 'socket,id=qmp,path=/var/run/qemu-server/500.qmp,server,nowait' \
  -mon 'chardev=qmp,mode=control' \
  -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
  -mon 'chardev=qmp-event,mode=control' \
  -pidfile /var/run/qemu-server/500.pid \
  -daemonize \
  -smbios 'type=1,uuid=51b7e8ff-ae21-45e2-942a-c236379bbd62' \
  -drive 'if=pflash,unit=0,format=raw,readonly,file=/usr/share/pve-edk2-firmware//OVMF_CODE.fd' \
  -drive 'if=pflash,unit=1,format=raw,id=drive-efidisk0,size=131072,file=/dev/zvol/nvmepool/vm-500-disk-1' \
  -smp '32,sockets=1,cores=32,maxcpus=32' \
  -nodefaults \
  -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
  -vga none \
  -nographic \
  -no-hpet \
  -cpu 'host,hv_ipi,hv_relaxed,hv_reset,hv_runtime,hv_spinlocks=0x1fff,hv_stimer,hv_synic,hv_time,hv_vapic,hv_vpindex,+ibpb,kvm=off,+kvm_pv_eoi,+kvm_pv_unhalt,-pcid' \
  -m 32768 \
  -object 'memory-backend-ram,id=ram-node0,size=32768M' \
  -numa 'node,nodeid=0,cpus=0-31,memdev=ram-node0' \
  -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg \
  -device 'vmgenid,guid=c1d954f9-87b9-47cb-83f4-7a5505137199' \
  -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' \
  -device 'vfio-pci,host=0000:21:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on' \
  -device 'vfio-pci,host=0000:21:00.1,id=hostpci0.1,bus=ich9-pcie-port-1,addr=0x0.1' \
  -device 'vfio-pci,host=0000:21:00.2,id=hostpci0.2,bus=ich9-pcie-port-1,addr=0x0.2' \
  -device 'vfio-pci,host=0000:21:00.3,id=hostpci0.3,bus=ich9-pcie-port-1,addr=0x0.3' \
  -chardev 'socket,path=/var/run/qemu-server/500.qga,server,nowait,id=qga0' \
  -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' \
  -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' \
  -iscsi 'initiator-name=iqn.1993-08.org.debian:01:baecad992489' \
  -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' \
  -drive 'file=/dev/zvol/nvmepool/vm-500-disk-0,if=none,id=drive-scsi0,cache=writeback,discard=on,format=raw,aio=threads,detect-zeroes=unmap' \
  -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,rotation_rate=1,bootindex=100' \
  -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
  -drive 'file=/mnt/pve/network-neoproxmox/template/iso/en_windows_10_consumer_editions_version_1909_x64_dvd_be09950e.iso,if=none,id=drive-sata2,media=cdrom,aio=threads' \
  -device 'ide-cd,bus=ahci0.2,drive=drive-sata2,id=sata2,bootindex=200' \
  -drive 'file=/mnt/pve/network-neoproxmox/template/iso/virtio-win-0.1.185.iso,if=none,id=drive-sata3,media=cdrom,aio=threads' \
  -device 'ide-cd,bus=ahci0.3,drive=drive-sata3,id=sata3,bootindex=201' \
  -netdev 'type=tap,id=net0,ifname=tap500i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' \
  -device 'virtio-net-pci,mac=3E:80:D5:FD:81:D2,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' \
  -rtc 'driftfix=slew,base=localtime' \
  -machine 'type=q35+pve0' \
  -global 'kvm-pit.lost_tick_policy=discard' \
  -machine 'type=q35,kernel_irqchip=on'
root@mars:~# sh /tmp/qm.shkvm: -device vfio-pci,host=0000:21:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: vfio 0000:21:00.0: group 28 is not viable
Please ensure all devices within the iommu_group are bound to their vfio bus driver.

Stefan_R · Sep 1, 2020

jena said:
Please ensure all devices within the iommu_group are bound to their vfio bus driver.

What does 'lspci -nnk' say? Are they actually bound to the vfio driver? If not, see here on how to configure modprobe.d (don't forget to rebuild your initramfs afterwards and reboot).

jena · Sep 1, 2020

Stefan_R said:
What does 'lspci -nnk' say? Are they actually bound to the vfio driver? If not, see here on how to configure modprobe.d (don't forget to rebuild your initramfs afterwards and reboot).

lspci -nnk full output attached

Code:

21:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
        Subsystem: PNY TU102 [GeForce RTX 2080 Ti] [196e:134e]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau
21:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
        Subsystem: PNY TU102 High Definition Audio Controller [196e:134e]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel
21:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Controller [10de:1ad6] (rev a1)
        Subsystem: PNY TU102 USB 3.1 Controller [196e:134e]
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci
21:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 UCSI Controller [10de:1ad7] (rev a1)
        Subsystem: PNY TU102 UCSI Controller [196e:134e]
        Kernel driver in use: nvidia-gpu
        Kernel modules: i2c_nvidia_gpu

I had /etc/modprobe.d/vfio.conf set to this from the very beginning

Code:

lspci -n -s 21:00
21:00.0 0300: 10de:1e04 (rev a1)
21:00.1 0403: 10de:10f7 (rev a1)
21:00.2 0c03: 10de:1ad6 (rev a1)
21:00.3 0c80: 10de:1ad7 (rev a1)

Code:

/etc/modprobe.d/vfio.conf (all four parts were added)
options vfio-pci ids=10de:1e04,10de:10f7,10de:1ad6,10de:1ad7 disable_vga=1

Stefan_R · Sep 2, 2020

Hm, the xhci driver is usually fine to be unbound, but I'm not so sure about the nvidia proprietary one. Can you try removing that driver from your host system? Or otherwise manually unassign all subdevices from their bound driver (echo "0000:21:00.X" > /sys/bus/pci/devices/0000:21:00.X/driver/unbind with X for all 4 subdevices), and then check lspci -nnk again.

jena · Sep 3, 2020

Stefan_R said:
Hm, the xhci driver is usually fine to be unbound, but I'm not so sure about the nvidia proprietary one. Can you try removing that driver from your host system? Or otherwise manually unassign all subdevices from their bound driver (echo "0000:21:00.X" > /sys/bus/pci/devices/0000:21:00.X/driver/unbind with X for all 4 subdevices), and then check lspci -nnk again.

Code:

echo "0000:21:00.0" > /sys/bus/pci/devices/0000:21:00.0/driver/unbind
echo "0000:21:00.1" > /sys/bus/pci/devices/0000:21:00.1/driver/unbind
echo "0000:21:00.2" > /sys/bus/pci/devices/0000:21:00.2/driver/unbind
echo "0000:21:00.3" > /sys/bus/pci/devices/0000:21:00.3/driver/unbind

Still the same. Also did a reboot, re-ran lspci, still same output with lspci

Code:

21:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
        Subsystem: PNY TU102 [GeForce RTX 2080 Ti] [196e:134e]
        Kernel modules: nvidiafb, nouveau
21:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
        Subsystem: PNY TU102 High Definition Audio Controller [196e:134e]
        Kernel modules: snd_hda_intel
21:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Controller [10de:1ad6] (rev a1)
        Subsystem: PNY TU102 USB 3.1 Controller [196e:134e]
        Kernel modules: xhci_pci
21:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 UCSI Controller [10de:1ad7] (rev a1)
        Subsystem: PNY TU102 UCSI Controller [196e:134e]
        Kernel modules: i2c_nvidia_gpu

Imediatelly after unbind, start VM again gives no such directory instead of not viable

Code:

sh /tmp/qm.sh
kvm: -device vfio-pci,host=0000:21:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: vfio 0000:21:00.0: failed to open /dev/vfio/28: No such file or directory

After reboot, it will still be not vaiable

Code:

kvm: -device vfio-pci,host=0000:21:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: vfio 0000:21:00.0: failed to open /dev/vfio/28: No such file or directory

Stefan_R · Sep 7, 2020

Does it also fail if you now start it via the GUI/via 'qm start <vmid>' or only if you do the QEMU command manually? We do assign the vfio-pci driver automatically in our VM start code, which of course doesn't run when you do the 'showcmd'/'sh ...' method.

If that's also not it, I'm honestly at a loss, might just be something wrong with your specific hardware config...

jena · Sep 15, 2020

Stefan_R said:
Does it also fail if you now start it via the GUI/via 'qm start <vmid>' or only if you do the QEMU command manually? We do assign the vfio-pci driver automatically in our VM start code, which of course doesn't run when you do the 'showcmd'/'sh ...' method.

If that's also not it, I'm honestly at a loss, might just be something wrong with your specific hardware config...

Thanks for keeping diagnose with me.
qm start works without any change now (I didn't unbind before starting vm)

Code:

qm start 500
kvm: -device vfio-pci,host=0000:21:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: Failed to mmap 0000:21:00.0 BAR 3. Performance may be slow

Now I encounter famous code43 error in device manger for RTX2080Ti after installing NVIDIA driver.

r.jochum · Sep 15, 2020

Code 43: See this https://forum.proxmox.com/threads/nvidia-gtx-1050ti-error-43-code-43.75553/

Not sure it works with that new Card and pick the right ROM (best is to extract with GPU-Z).

jena · Sep 16, 2020

r.jochum said:
Code 43: See this https://forum.proxmox.com/threads/nvidia-gtx-1050ti-error-43-code-43.75553/

Not sure it works with that new Card and pick the right ROM (best is to extract with GPU-Z).

Thanks.
The github rom patcher only works for 1xxx GPU, not RTX2xxx.
I can extract rom using GPU-Z in the win10 guest system.

I put extracted (original, unpatched) rom in folder /usr/share/kvm/ and put romfile=rtx2080ti.rom in 500.conf file.
It still gives me code43 error.
Also, same output

Code:

qm start 500
kvm: -device vfio-pci,host=0000:21:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on,romfile=/usr/share/kvm/rtx2080ti.rom: Failed to mmap 0000:21:00.0 BAR 3. Performance may be slow

BTW, the physical monitor connect to 2080ti doesn't have display signal.
I access the guest win10 using RDP (VNC console on PVE was disabled, display = none).

I only have 2 x16 slot (PCIE#1 and #3).
PCIE#3 is occupied by PCIE 4 xM.2 SSD card in RAID-Z2 hosting all VMs.

I have two other PCIEx8 (single slot). I have to get riser cable (currently don't have) in order to move the 2080Ti off the PCIE#1 primary slot.

r.jochum · Sep 16, 2020

I think I found something:

https://github.com/Matoking/NVIDIA-vBIOS-VFIO-Patcher/pull/11 <-- it SHOULD work with your 2080

https://www.reddit.com/r/VFIO/comments/9x1x10/anyone_has_any_experience_with_passing_through/

At your own risk...

EDIT: It would be great if that works

jena · Sep 16, 2020

r.jochum said:
I think I found something:

https://github.com/Matoking/NVIDIA-vBIOS-VFIO-Patcher/pull/11 <-- it SHOULD work with your 2080

https://www.reddit.com/r/VFIO/comments/9x1x10/anyone_has_any_experience_with_passing_through/

At your own risk...

EDIT: It would be great if that works

In the Reddit post, 2080Ti was said to be working on on the secondary slot.

Use original, unpatched rom
I disabled GPU in device manager in win10 and re-enable it, for a brief time, the exclamation mark disappeared and it shows device working properly.
Then I restarted the VM and then still get code43.

Use patched rom
Patcher worked. romfile=rtx2080ti_patched.com
Now gives me display on the physical monitor connected to 2080Ti.
But I can see black blocks of PVE command line (I can see my keyboard typing) overlay with the guest win10.
The resolution is at 800x600, which suggested that driver is not working.

Disable/re-enable in device manager doesn't fix the problem any more.

I do have quiet amd_iommu=on iommu=pt nofb nomodeset video=vesafb:off,efifb:off in /etc/kernel/cmdline

r.jochum · Sep 16, 2020

Hmm do you have the nouvou driver blacklisted on the linux side?

jena · Sep 16, 2020

r.jochum said:
Hmm do you have the nouvou driver blacklisted on the linux side?

I did this

Code:

echo "blacklist radeon" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf

unless it is not successfully blacklisting (I did update-initramfs -u -k all )

as well as

Code:

echo "vfio" > \
  /etc/modules-load.d/vfio.conf
echo "vfio_iommu_type1" >> \
  /etc/modules-load.d/vfio.conf
echo "vfio_pci" >> \
  /etc/modules-load.d/vfio.conf
echo "vfio_virqfd" >> \
  /etc/modules-load.d/vfio.conf
echo "options vfio-pci ids=10de:1e04,10de:10f7" > \
  /etc/modprobe.d/vfio.conf

The "Reading all physical volumes..." is typically the last thing printed via the UEFI frame buffer before the GPU driver would kick in.

This indicate that host is not using frame buffer anymore. The monitor will go to sleep after a while due to no signal.

Then, when I lauch VM. The monitor will light up and have black cmdline overlay with Win10 (and code43).

jena · Sep 16, 2020

YES!!!

I solve the BAR3 error and Code43 error using a tip from UNRAID VM thread.

Code:

echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

These three lines and code and original rom (not patched) fixed the problem.

r.jochum · Sep 17, 2020

Great @jena thanks for posting the Solution.

jena · Sep 17, 2020

r.jochum said:
Great @jena thanks for posting the Solution.

Very excited.
BTW, any way to make these three lines permanent at boot?
Right now I need to type these three lines again after reboot.

r.jochum · Sep 17, 2020

I think it works with /etc/sysctl.d/passthrough.conf

Just put them doted

class.vtconsole.vtcon0.bind = 0
class.vtconsole.vtcon1.bind = 0
bus.platform.drivers.efi-framebuffer.unbind = efi-framebuffer.0

Search

Search

q35 guest Win10 won't start, AMD Threadripper 3970x

jena

Member

Stefan_R

Proxmox Retired Staff

jena

Member

Attachments

Stefan_R

Proxmox Retired Staff

jena

Member

Stefan_R

Proxmox Retired Staff

jena

Member

r.jochum

Renowned Member

jena

Member

Attachments

r.jochum

Renowned Member

jena

Member

r.jochum

Renowned Member

jena

Member

jena

Member

r.jochum

Renowned Member

jena

Member

r.jochum

Renowned Member

We value your privacy