Windows VMs Constantly Crashing With VIDEO DXGKRNL FATAL ERROR - Nvidia GPU Passthrough

dizzydre21

New Member
Apr 10, 2023
28
0
1
Hello all,

My Hardware:

Motherboard - Asrock Rack Rome8d-2t
CPU - Epyc 7F72
RAM - 128GB 3200MHZ ECC
GPU - RTX-2080 Super - passed through to Windows
OS Drive - Samsung 980 Pro 500GB
VM OS Drives - ZFS Mirror 2x960GB Samsung P9A3
Windows VM Game Storage - ZFS Mirror 2x1TB WD SN850
HBA Card - LSI-9211-8i - passed through to TrueNAS
TrueNAS Drives - 6x6TB WD Ironwolf
NIC - 82599ES 10Gbe - passed through to TrueNAS

I am running Proxmox 8.1.3 on an AMD Epyc Gen 2 build with an Asrock ROMED8-2T motherboard. I have a number of Ubuntu server VMs running and one Windows 11 VM, though I had these issues on a Windows 10 VM too. The Windows 11 VM has an RTX-2080 passed through and I am using an X550 NIC with SR-IOV enabled (VFs used on the other VMs too). All VMs are running on a ZFS mirror. The drives are U.2 Samsung drives, but this started happening before I moved my VMs to these drives. I think that it was happening before I set up SR-IOV too.

I constantly get BSODs on the Windows VMs without any dump files. Sometimes there isn't even a BSOD code, but usually it is the DirectX error in the subject line of this post. This likely points to something being up with the GPU or it's driver, at least from a Windows perspective, but I have went through all of the troubleshooting steps that I can find inside of the Windows guest machine. Removed drivers with DDU, installed earlier versions, ran several system commands to fix corrupt files, swapped the GPU for an RTX-2070 Super, disabled fast start, disabled any power management things related to PCIe, and probably a few other things that I am forgetting.

There is no evidence inside of Windows that shows there is an issue until the BSOD. I've looked in the Event viewer, Device Manager Events on the GPUs themselves, and ran dxdiag which said everything was happy.

The VM is used for remote gaming and doesn't crash all that often if I am running Minecraft Bedrock edition. However, I can get it to crash almost everytime if I open the launcher or the Java edition. It will crash if I try to uninstall either of those. I had to remove the GPU and install them without it, which worked, but I still got the BSOD when launching them without the GPU. It will also crash if I try to update the Nvidia driver with first removing it with DDU in safe mode. Didn't try to remove it in normal mode.

So, now I am wondering if something is up with my Proxmox configuration. I have went through the PCI and PCIe passthrough docs several times and I think it is set up just fine. I have forced the PCIe slot to be Gen3 x16 in the BIOS, which is what the card is, but the issue happens when it is set to auto or manually configured.

Is there anyone out there with a similar issue or that might have some input on how to troubleshoot further???

VM Config:
agent: 1 balloon: 0 bios: ovmf boot: order=scsi0 cores: 12 cpu: host efidisk0: tank_nvme:vm-102-disk-1,efitype=4m,pre-enrolled-keys=1,size=1M hostpci0: 0000:81:00,pcie=1 hostpci1: 0000:42:10.2,pcie=1 machine: pc-q35-8.1 memory: 16384 meta: creation-qemu=8.1.2,ctime=1702680162 name: Win11-Pro numa: 0 ostype: win11 scsi0: tank_nvme:vm-102-disk-0,cache=writeback,discard=on,size=102G,ssd=1 scsihw: virtio-scsi-pci smbios1: uuid=809fb1c2-c1c0-4c32-801e-b8ec688f2d92 sockets: 1 startup: order=5,up=10 tpmstate0: tank_nvme:vm-102-disk-2,size=4M,version=v2.0 vga: std vmgenid: 1ecfccf4-82aa-4ad6-9a2b-da75cb035881 #qmdump#map:efidisk0:drive-efidisk0:tank_nvme:raw: #qmdump#map:scsi0:drive-scsi0:tank_nvme:raw: #qmdump#map:tpmstate0:drive-tpmstate0-backup:tank_nvme:raw:
 
Last edited:
Hi,

I faced the same troubles with my 4080 passthrough.

A few weeks ago, I could successfully pass my GPU to W11 VMs, but recently I noticed that I could not get it stable.

The VM would freeze with the exact same BSOD as you described.

It works perfectly fine when I pass the GPU to a Linux VM though...

The interval between crashes would always be 11 minutes (this behaviour is documented on reddit with no definite solution).

I had tried everything... well I thought I had.

Eventually, I tried this error 43 fix (i'm using libvirt / manjaro, not proxmox, but these params must exist for proxmox) :

XML:
    <hyperv mode="custom">
      <relaxed state="on"/>
      <vapic state="on"/>
      <spinlocks state="on" retries="8191"/>
     <vendor_id state="on" value="1234567890ab"/>
    </hyperv>
    <kvm>
       <hidden state="on"/>
    </kvm>

And I 'm happy to report that I haven't had a single crash ever since I added vendor_id and the hidden state lines.

Hope this helps.
 
Last edited:
Hi,

I faced the same troubles with my 4080 passthrough.

A few weeks ago, I could successfully pass my GPU to W11 VMs, but recently I noticed that I could not get it stable.

The VM would freeze with the exact same BSOD as you described.

It works perfectly fine when I pass the GPU to a Linux VM though...

The interval between crashes would always be 11 minutes (this behaviour is documented on reddit with no definite solution).

I had tried everything... well I thought I had.

Eventually, I tried this error 43 fix (i'm using libvirt / manjaro, not proxmox, but these params must exist for proxmox) :

XML:
    <hyperv mode="custom">
      <relaxed state="on"/>
      <vapic state="on"/>
      <spinlocks state="on" retries="8191"/>
     <vendor_id state="on" value="1234567890ab"/>
    </hyperv>
    <kvm>
       <hidden state="on"/>
    </kvm>

And I 'm happy to report that I haven't had a single crash ever since I added vendor_id and the hidden state lines.

Hope this helps.
Thanks for the reply. I have seen some stuff about the error 43, but didn't think it was relevant to my issue. I will look into it further.

Are those arguments supposed to go into the VM config? I don't seem them as part of the web GUI, but I can manually add them if needed.

BTW, I just checked my Event Viewer and there are definitely some 11 minute intervals between BSODs. I have left the VM running overnight a few times though with it crashing and definitely played some Minecraft on it for longer than that.
 
Hi,

I faced the same troubles with my 4080 passthrough.

A few weeks ago, I could successfully pass my GPU to W11 VMs, but recently I noticed that I could not get it stable.

The VM would freeze with the exact same BSOD as you described.

It works perfectly fine when I pass the GPU to a Linux VM though...

The interval between crashes would always be 11 minutes (this behaviour is documented on reddit with no definite solution).

I had tried everything... well I thought I had.

Eventually, I tried this error 43 fix (i'm using libvirt / manjaro, not proxmox, but these params must exist for proxmox) :

XML:
    <hyperv mode="custom">
      <relaxed state="on"/>
      <vapic state="on"/>
      <spinlocks state="on" retries="8191"/>
     <vendor_id state="on" value="1234567890ab"/>
    </hyperv>
    <kvm>
       <hidden state="on"/>
    </kvm>

And I 'm happy to report that I haven't had a single crash ever since I added vendor_id and the hidden state lines.

Hope this helps.
A bit of an update:

I believe I added the arguments above and all it really does is set the CPU type to KVM64. This does seem to fix the BSODs, but it cripples CPU performance by 20-25% in the VM, at least according to Cinebench. Obviously, that is just a quick test and is not the end-all-be-all. I am not sure this is really the best solution, though, but would work as a last resort.

EDIT:

KVM=Off is also set by default. You can confirm what flags are set for everything by running qm showcmd <vmid>`
 
Last edited:
I am updating this because I think I may have resolved the issue. This is a comment from another post I made regarding a different issue, but I think I may have fixed both in one go.

From that post:
I believe disabling C-States in the BIOS has resolved this issue. It does use about 20 more watts at idle now, which I kind of dislike, but oh well.

I am no longer getting unexpected shutdown events after quite a few reboots and shutdowns between two different Windows 11 VMs. I have a sneaking suspicion that this may have resolved my BSODs that I mentioned above, but I need more time to tell. I don't think that Windows was shutting down correctly after driver installs or updates. I was also getting a weird issue where my GPU wouldn't clock down below 1755MHz despite being on normal power modes in Windows and Nvidia Control Panel. After using DDU to uninstall the drivers in safe mode, I reinstalled them in normal boot mode, succesfully rebooted, and now it idles at 210 MHz, as it should.

I will mark this as solved after some further testing.

The link to the post:
 
Last edited:
So some of the above info is not totally correct. My GPU was not clocking down because some game streaming software forces the driver into performance mode.

Also, I had to disable Windows fast boot as well as disable C-states and then all of the weird issues went away. The fact that Windows was not shutting down correctly was screwing with the GPU driver as well, because it wasn't rebooting after a clean driver install. It was like pulling the power plug after install the driver.
 
Hey, im facing the same issue on my Threadripper 7975WX on an Asrock WRX90 Evo build.

Does the C-State fix still hold up? At the moment, I'm testing different options and what seems to work so far is setting the CPU-model to an Intel CPU, e.g., Icelake-Server-noTSX. I havent done any performance testing, but Icelake also supports AVX etc.

What almost always triggers the bluescreen is running GPU-Z, hitting the refresh button, and wait or reboot.
Whats also super weired is the behaviour of GPU-Z. On its first launch it shows the correct PCIe-lane-speed. After the first refresh it switches to just PCI. This is also fixed when using an Intel cpu-model.
 
I also have this issue on Proxmox 8.2 with a Ryzen 1700. It started happening after I swapped my GPU from a GTX 1650 to a GTX 1070 then back to the GTX 1650.

The workaround that stopped the BSODs for me (and also made the GPU-Z PCIe link speed output show x16 as normal) is to change CPU type to x86-64-v3.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!