VM will not restart w/out rebooting host

Jan 5, 2020
20
2
8
I got a windows 10 VM to load with PCIE passthrough (GPU), along with USB passthrough of a keyboard and mouse. It works only right after my host has reboot. if the VM stops/shutsdown, when i try to start it again i receive Error: start failed ... got timeout.

Hardware
TRX40-e ROG Strix (fyi, no onboard video)
Threadripper 3960
64GB Ram
Sapphire Pulse 5700 x
various hard drives

Installed PVE 6.1 - I had to add mce=off to the grub in order for it to be able to get past the boot hang up caused by threadripper
/boot/grub/grub.cfg
Code:
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="Proxmox Virtual Environment"
GRUB_CMDLINE_LINUX_DEFAULT="quiet iommu=pt amd_iommu=on video=efifb:off mce=off"
GRUB_CMDLINE_LINUX=""

# Disable os-prober, it might add menu entries for each guest
GRUB_DISABLE_OS_PROBER=true

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Disable generation of recovery mode menu entries
GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"

/etc/modules
Code:
# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.

vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

/etc/modprobe.d/blacklist.conf
Code:
blacklist radeon
blacklist nouveau
blacklist nvidia

I checked lspci -v to find the card then lspci -n -s 03:00 for the ids
/etc/modprobe.d/vfio.conf
Code:
options vfio-pci ids=1002:731f,1002:ab38 disable_vga=1
update-initramfs -u

dmesg | grep kvm
Code:
[   10.517968] kvm: Nested Virtualization enabled
[   10.518030] kvm: Nested Paging enabled

pveversion -v
Code:
proxmox-ve: 6.1-2 (running kernel: 5.3.10-1-pve)
pve-manager: 6.1-3 (running version: 6.1-3/37248ce6)
pve-kernel-5.3: 6.0-12
pve-kernel-helper: 6.0-12
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-9
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.1-2
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-1
pve-cluster: 6.1-2
pve-container: 3.0-14
pve-docs: 6.1-3
pve-edk2-firmware: 2.20191002-1
pve-firewall: 4.0-9
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 3.13.2-1
qemu-server: 6.1-2
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2

created a windows 10 vm
using the disc image file virtio-win-0.1.171

Code:
.conf
agent: 1
args: -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=NV43FIX,kvm=off'
balloon: 0
bios: ovmf
bootdisk: virtio0
cores: 8
cpu: host,hidden=1,flags=+pcid
efidisk0: local-lvm:vm-100-disk-1,size=128K
hostpci0: 03:00,pcie=1,x-vga=1
ide2: local:iso/virtio-win-0.1.173.iso,media=cdrom,size=385062K
machine: q35
memory: 10240
name: his
net0: virtio=4A:5C:E7:BA:18:3F,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
scsihw: virtio-scsi-single
smbios1: uuid=9e1a83cb-3251-4660-b6d0-069b254b9eb1
sockets: 1

When I start the VM, success. The monitor shows windows 10, the keyboard and mouse are active. I installed the video card drivers, and qemu drivers from the virtio ISO. When i shutdown the VM, either by proxmox GUI, or from the win10 start menu. it shuts down with no error. however when i try to start the VM again I get a failed start error
Code:
TASK ERROR: start failed: command '/usr/bin/kvm -id 100 -name his -chardev 'socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/100.pid -daemonize -smbios 'type=1,uuid=9e1a83cb-3251-4660-b6d0-069b254b9eb1' -drive 'if=pflash,unit=0,format=raw,readonly,file=/usr/share/pve-edk2-firmware//OVMF_CODE.fd' -drive 'if=pflash,unit=1,format=raw,id=drive-efidisk0,file=/dev/pve/vm-100-disk-1' -smp '8,sockets=1,cores=8,maxcpus=8' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga none -nographic -no-hpet -cpu 'host,+pcid,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=proxmox,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,hv_synic,hv_stimer,hv_ipi,kvm=off' -m 10240 -device 'vmgenid,guid=a224fca7-b1f0-43d8-bd0f-1873b4d41b36' -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'nec-usb-xhci,id=xhci,bus=pci.1,addr=0x1b' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:03:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on' -device 'vfio-pci,host=0000:03:00.1,id=hostpci0.1,bus=ich9-pcie-port-1,addr=0x0.1' -device 'usb-host,bus=xhci.0,hostbus=7,hostport=1.2,id=usb0' -device 'usb-host,bus=xhci.0,hostbus=7,hostport=1.3,id=usb1' -chardev 'socket,path=/var/run/qemu-server/100.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:d341574f94eb' -drive 'file=/var/lib/vz/template/iso/virtio-win-0.1.173.iso,if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -drive 'file=/dev/pve/vm-100-disk-0,if=none,id=drive-virtio0,cache=writeback,format=raw,aio=threads,detect-zeroes=on' -device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=4A:5C:E7:BA:18:3F,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -rtc 'driftfix=slew,base=localtime' -machine 'type=q35+pve1' -global 'kvm-pit.lost_tick_policy=discard' -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=NV43FIX,kvm=off'' failed: got timeout

If I reboot the host, I can start the VM witout error.
I've tried with and without the qemu guest agent enabled the GUI options tab
I've tried with and with a Display in the GUI hardware tab
after a failed start, I've tried using the command line
qm showcmd 100 | bash
it appears to start without error, it will return pings, however it does not display to the monitor or answer to remote desktop. Using the GUI to stop is successful.
I've tried
rm /var/lock/qemu-server/lock-100.conf
i installed the qemu agent to VM then
qm agent 100 ping
returned me to command line without an error

If i can correct this I plan to add at least 1 more GPU passed to a 2nd win10 VM along with usb peripherals. I haven't tried it yet because this one is not working properly

Thank you.
 
hypothetically...If I'm not savvy enough to compile the kernal and patch myself, it appears I have two options. are there any other choices?
1. buy NVIDIA GPU
2. switch to vmware esxi
3. ?
 
I am experienceing this issue with an NVIDIA GPU on a machine I have that is part of a cluster.
 
My problem was due to the NAVI reset bug for AMD GPUs. I solved it with the solution here
https://forum.level1techs.com/t/navi-reset-kernel-patch/147547/134
patching the kernel worked in my situation with multiple GPU installed
I would not imagine this will work for NVIDIA GPUs, and I would not advise you to try it.

Any chance you can upload your .deb, assuming it's for kernel version 5.4.34-1-pve?

I'm trying to recompile right now with that patch, but getting hung up on zfsonlinux doesn't want to.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!