The whole system crashes when the gaming Windows VM is working

Zireael

New Member
Jul 9, 2024
4
1
3
Hi, My Proxmox host completely crashes and reboots when the gaming Windows VM (GPU is passthroughed) is working. There is nothing on the system log prior to crash. It only says "-- Reboot --".

Here are the changes I made to the system and the problems I encountered while making them:
  • Initial state of the system: The CPU was a Ryzen 5500, the GPU was an RTX 3070, and the PSU was a 500W 80+ Bronze. I used this setup for a year without any crash issues, but the CPU was not sufficient for the 3070 in all games.
  • I upgraded the CPU and PSU two weeks ago. The new CPU is a Ryzen 5700X, and the new PSU is a 750W 80+ Gold. Since the upgrade, the Windows VM started experiencing BSOD with the error code "VIDEO_DXGKRNL_FATAL_ERROR."
  • I reinstalled the Windows VM as a clean setup, and the BSOD issue was resolved.
  • About five days ago, a new problem began. The Windows VM crashes, and Proxmox reboots immediately (so I assume Proxmox also crashes). This happens randomly.
  • I tried different CPU types, but it didn't help.
  • This problem disappears for at least a few hours if I reinstall the Windows VM.
  • I disabled C-States and other power-saving settings in the BIOS, but they didn't make any change.
The entire system draws 400W from the wall when both the CPU and GPU are under stress test. I also tested with my old PSU to see if it was the problem, but the issue persisted. I also tried different NVIDIA drivers, but that did not resolve the problem.

I don't have a spare SSD to install Windows on bare metal right now to determine if the issue is with Proxmox or something else. I’m asking to see if anyone else has encountered the same problem or knows how to fix it. I used Memtest to see if something was wrong with the RAM, but it didn't give me any errors. Thanks for help.
 
I found the problem, but I don't know why it's happening. I saw these errors on the monitor for a split second after one crash:
Rich (BB code):
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27: faa000000000080b
mce: [Hardware Error]: TSC 0 MISC d012000400000000 SYND 5d000000 IPID 1002e00000500
mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1723119576 SOCKET 0 APIC 0 microcode a20120e

It seems like a CPU problem, but why is it not happening when the Windows VM is not running? I ran Prime95 for about 10 hours directly on Proxmox, and it worked without any issues. I also tried a Fedora VM with GPU passthrough, and Proxmox crashed like before.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!