[SOLVED] Windows 10 VM with PCI passthrough won't start

MrGeezer

New Member
Apr 11, 2022
27
3
3
Hi,

I created a windows 10 pro VM and setup PCI passthrough of my host graphics card. All was working fine - I installed my CCTV software on the guest which ran fine and I played some games via steam streaming no problem.

I tried to connect to the guest via RDP this morning and it wouldn't respond so I rebooted the guest. Couldn't get in so I used the console to take a look - it appears that it's freezing on boot. After a long wait the windows troubleeshooting menu comes up but after runninsays it cannot repair the windows installation and shuts down.

Output of pveversion -v and qm config <VMID> as follows:

Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.30-2-pve)
pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1)
pve-kernel-helper: 7.2-2
pve-kernel-5.15: 7.2-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-6
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.2-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.1.8-1
proxmox-backup-file-restore: 2.1.8-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-10
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-1
pve-ha-manager: 3.3-4
pve-i18n: 2.7-1
pve-qemu-kvm: 6.2.0-5
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

Code:
agent: 1
bios: ovmf
boot: order=scsi0;ide2;net0;scsi1
cores: 16
cpu: host
efidisk0: local-lvm:vm-101-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:01:00,pcie=1,x-vga=1
ide2: local:iso/virtio-win.iso,media=cdrom,size=519172K
machine: pc-q35-6.2
memory: 32768
meta: creation-qemu=6.2.0,ctime=1656016432
name: WIN10
net0: virtio=8A:0C:63:F3:4A:F4,bridge=vmbr0
numa: 0
ostype: win10
scsi0: local-lvm:vm-101-disk-1,discard=on,iothread=1,size=400G,ssd=1
scsi1: local:iso/virtio-win.iso,media=cdrom,size=519172K
scsihw: virtio-scsi-single
smbios1: uuid=82a17aad-0730-4d2a-8568-17c38392e61c
sockets: 1
vga: qxl
vmgenid: ed4cc8c0-adc4-462c-b7fb-48f7e934acaf

Any idea what I'm doing wrong?

Thanks :)
 
No idea what GPU you used, but no all devices reset properly and you might need to reboot the Proxmox host to get it to work again (once, until the VM is shut down).
 
Not suggesting this would be the cause, but I wonder if allocating 16 cpu cores and 32GB of ram to a single VM is really needed? How many cpu cores and how much RAM does the host actually have?
 
Jup, using too much RAM can also prevent the VM from starting when using PCI passthrough. People reported that several times here.
 
Thanks for your suggestions. Understood re. too much RAM/CPU but I don't understand as it was running fine for 2 days.

Not sure if it's related but It's also taking an age to respond when I try to click 'reboot' or 'shutdown' from the web GUI. The ubuntu VM that I have on the same host boots and responds to shutdown/reboot requests quickly but dunno if that's just because it's Linux.

Anyway - the host is an i9 with 16 cores and the system has 64Gb of RAM. The card is an NVIDIA RTX3060 with 12Gb of memory. I reduced it to 16 Gb of RAM allocated and reduced the cores to 8 but alas it still won't start. Any idea what the limit is? Is it 'too much RAM' as a percentage of total available? Or is it a hard limit number of Gb that I shouldn't exceed?
 
Actually it appears that reducing the RAM/GPU may have helped - I've discovered if I leave it frozen for about 20 minutes I get the windows BSOD with a message about VIDEO TDR FAILURE. After claiming to have fixed itself, windows then rebootsand after about a 20 minute wait eventually I can see the login screen on the local console. However it never responds to any keyboard/mouse input nor does it reply if I try to ping it.

I wonder if I keep reducing the RAM it will eventually work but just don't understand how it was fine for 2 days solid after I first got it working!
 
The thing to remember for a VM with passthrough enabled is that whatever RAM you allocate is 100% reserved for that VM - while the same doesn't apply for a normal VM where the RAM allocation is more dynamic and any unused RAM can be shared with the host and/or other VM's depending on what each is doing at any given moment.

So the guideline would be to only allocate enough RAM for the VM to run the programs it needs to run, and I also would not over-allocate CPU cores - while not so critical as RAM allocation, IME over-committing CPU resource is often counter-productive as well. The goal is to balance resource usage between the host and VM's.

I would switch your VM to a standard graphics adapter and see if the VM boots normally, then switch back to pass-through once you have resolved any boot issues.
 
Well it's definitely the passthrough adaptor - when I removed it from the VM and put the SPICE console back on it booted fine. Then I built a new VM and ran into same issue - won't boot with the PCI card passed thorough to it but will without.

I've gone back to the original VM (that worked fine for 48 hours last weekend), reduced the memory to 16Gb and the cores to 4. Still won't boot - the proxmox logo appears then hangs for ages - followed by windows BSOD with a message about VIDEO TDR ERROR in nvlddmkm.sys.

Don't suppose there's any further avenues I can explore to find out what's changed or why it won't work? If I hadn't had it running for 2 days I guess I'd assume it doesn't work on the NUC and go back to having windows on the bare metal but it's so frustrating when I've seen it work.

Anyway thanks for your help and sugggestions - was much appreciated.
 
Does it work when you start the VM for the first time after a reboot of the Proxmox host (and only give problems after a complete shutdown of the VM and a restart)?
Did you update drivers or Windows inside the VM before it stopped working? Did you update the Proxmox host before it stopped working?
 
Does it work when you start the VM for the first time after a reboot of the Proxmox host (and only give problems after a complete shutdown of the VM and a restart)?
NO
Did you update drivers or Windows inside the VM before it stopped working?
I don't know. I wonder if that's the case as I can't think of anything else that would have changed. I guess if nvidia released an updated driver it could have broken it - and now when I try to download the new driver on the new VM of course that's the new broken version. I guess I will try to see if NVIDIA has an older / different driver available.
Did you update the Proxmox host before it stopped working?
NO. Not sure if PVE has an auto update - but to try to eliminate that cause, I installed proxmox as a nested KVM from the same ISO that I used to install the host. Then I ran pveversion - v and compared the output with the output in my post above. Identical.
 
Does it work when you start the VM for the first time after a reboot of the Proxmox host (and only give problems after a complete shutdown of the VM and a restart)?
NO
At least it is not a (common) passthrough issue then.
Did you update drivers or Windows inside the VM before it stopped working?s
I don't know. I wonder if that's the case as I can't think of anything else that would have changed. I guess if nvidia released an updated driver it could have broken it - and now when I try to download the new driver on the new VM of course that's the new broken version. I guess I will try to see if NVIDIA has an older / different driver available.
Do you have a backup of the VM (in a working state) that you can restore (as a separate VM) to see if it still works?
Did you change the VM configuration (or other system configuration) that maybe only took effect after restarting the VM?
Did you update the Proxmox host before it stopped working?
NO. Not sure if PVE has an auto update - but to try to eliminate that cause, I installed proxmox as a nested KVM from the same ISO that I used to install the host. Then I ran pveversion - v and compared the output with the output in my post above. Identical.
Proxmox does not automatically update by itself, so I think we can rule this out.
 
@leesteken
Could this be another issue solved by pinning the older kernel?
Unlikely because @MrGeezer did not update Proxmox. Therefore, the exact same kernel was used when the VM did work. pve-kernel-5.13 is probably not even installed.
Also, I noticed that you don't have balloon: 0 in your config, maybe try setting that?
There is no ballooning when PCI(e) passthrough is used (because all VM memory is pinned into actual RAM, because of DMA etc.). I don't expect that disabling the ballooning driver, which reports actual memory usage even when ballooning is disabled, will change anything. Then again, it's little work to just try it.
 
Thanks. Yes my understanding is ballooning irrelevant but I did try with and without to no avail.

With regards to backup - no I didn't have one. The next task was going to be to join the PVE host to my PBS backup server but I hadn't got round to it yet. Perhaps if I'd spent my weekend backing up my working system instead of playing EA Sports Cricket and eating cakes then I wouldn't be in this mess but we live and learn eh chaps?

Last night I created a brand new VM without the PCI card passed through and installed windows, then cloned it. So now at least when I add the card and it breaks the machine I have a copy I can start again with.

Next I went to the NVIDIA site and dug through the driver histroy. Turns out the latest driver for my RTX3060 was released 28th June. Hmm - isn't that the day my working system broke?!

So I quickly downloaded the previous driver from 15th June. Added the PCI passthrough to my new machine. Booted and tried to install the driver and - error. NVIDIA driver said it was already in use somewhere else! I'm an idiot - I still had the card passing through to the original VM!

Rebooted host (once the VM crashes the whole host crashes), removed PCI passthrough from the old broken VM. Added it to the new clean VM. Booted it ready to try installing the older driver and CRASH The VM did the whole "I'm going to boot but then lag so hard that you literally cannot use the keyboard or mouse"

Ran out of time so tonight I wil try again. The only thing I can think now is that when I created the new clean VM to clone and play with, I gave it 32Gb of RAM (only 8 cores this time though). Maybe that's why it crashed?

Tonight I will delete the boken VM, re-clone my clean copy, reduce the RAM to 8Gb and then pass the PCI card through to it again to see if I can get it to stay alive long enough to install the NVIDIA driver from 15/08.

Thank you to everyone for yor patience. I really appreciate it. IF I ever get this working I owe you a pint!
 
I guess the other thing I could try is a clean install of PVE. But if you say PVE doesn't auto update then it's unlikely to work as I defintely didn't do a mnaual update since I installed 7.2-3 from the ISO on the website.
 
Well in case anyone comes across this again in future - I wasn't able to get back into my PVE host. Despite definitely having the correct password the web gui insisted my login failed (I could SSH to it, interestingly, so it was still running, sort of). Unable to do anything I reinstalled PVE 7.2 from the ISO. I reconfigured PCI passthrough and built my windows VM again. It works perfectly - initally with 16Gb of RAM then with 32Gb (kept to 8 cores though).

I haven't yet played games on steam but I was able to install the slightly older NVIDIA driver and reboot the machine multiple times. Windows in the VM reports the card as the correct model and working correctly. There is no sign of sluggishness in the VM whatsover (as you would expect from a solitary VM on an i9 with 32Gb of RAM really?)

I have no explanation but all I can assume is that somehow the NVIDIA driver update broke the VM and somehow damaged the configuration of proxmox. I know that defies the logic of virtualisation but I'm an IT dunce and I can't think of anything else better.

So TLDR; try reinstalling proxmox.
 
The last Geforce driver released last or previous week is really crap. I also had to remove it and replace it with an older one on my bare metal workstation as games where starting to crash or atleast I got strange artifacts. I really would suggest everyone skipping that driver and wait for a fixed one. If you google, there are lists which games that driver will make unplayable.
 
I have no explanation but all I can assume is that somehow the NVIDIA driver update broke the VM and somehow damaged the configuration of proxmox. I know that defies the logic of virtualisation but I'm an IT dunce and I can't think of anything else better.
It could be a bit of disk drive that got corruption, which can also be caused by a bad memory chip. Did you have a unexpected power loss or forced a hard system restart? You could do some (stress) testing of system parts and see if anything breaks. Or it might have been a one time random chance thing like cosmic rays.
 
Thanks again all for the suggestions and help it's much appreciated.

I would say that Dunuin's explanation seemed most likely as I would swear the release date of the driver was the day it broke. But I just don't understand how that then broke proxmox forever - including any as-yet unbuilt VM's. Maybe by sheer bloody coincidence it was was some sort of corruption event after the driver broke it. Dunno but I'm keeping that driver frozen at it's current version for sure. I'm lucky that I don't play any modern or very graphics intensive games so should be ok.

Thanks again for all your help I will mark this as solved and start a new thread shoud it recur.
 
  • Like
Reactions: nick.kopas
For what it's worth, I'm running the current config with no apparent issues:
  • Z490, i9-10850K, 64GB RAM, 3060TI
  • Latest version of Proxmox with all updates from enterprise repository. (Kernel 5.13.19-6-pve pinned)
  • Single GPU Passthrough to Windows 11 Pro 21H2 VM.
  • NVIDIA Studio Driver 516.59. (Latest Release)
But I'll second what @Dunuin said. Lots of folks are sticking with 511 or 512 NVIDIA driver releases. (Anything past 465.89 enables virtualization support.)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!