[SOLVED] Windows VM random reboot after xeon to epyc migration w/gpu passthrough

warriorcookie

New Member
Apr 2, 2023
7
1
3
Hello,

I've searched on this issue and tried the solutions I've found but not have resolved this issue.

Proxmox 7.4-3 running Truenas, Debian w/ docker, Windows VM.

Previously had dual e5-2667 v2 server running with Proxmox or a few years. VM in question is Windows 10 VM with GPU (RTX3060ti) passthrough used for parsec/steam/gaming.

Recently upgraded to H11SSL-i with Epyc 7371

The migration was relatively uneventful with the exception of the windows VM. The VM runs fine until streaming any sort of game. It will run for a short period of time before the entire VM reboots. I also have another VM with a quadro P400 for plex transcoding that works with no issues.

-I edited my systemd changed intel_iommu=on to amd_iommu=on. Tried both with and without iommu=pt.
-confirmed no other devices in the iommu_group for the gpu.
-i dumped the bios and added it to the conf file and it increased the time between reboots from 5 mins to maybe 20 mins.
-Added args: -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=NV43FIX,kvm=off' saw no difference.
-machine was previously q35, changed to latest version, no difference.
-terminal was getting spammed with dmesg errors. echo "options kvm ignore_msrs=1 report_ignored_msrs=0" > /etc/modprobe.d/kvm.conf resolved the errors but VM still crashes.
-installed microcode update.

I keep trying the solutions posted to threads I find in my searches but I'm feeling like I'm throwing mud against the wall at this point. I sure would appreciate some help narrowing this down.


dmesg | grep -e DMAR -e IOMMU:
Code:
[    0.960388] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    0.960412] pci 0000:20:00.2: AMD-Vi: IOMMU performance counters supported
[    0.960430] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[    0.960451] pci 0000:60:00.2: AMD-Vi: IOMMU performance counters supported
[    0.977462] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[    0.977472] pci 0000:20:00.2: AMD-Vi: Found IOMMU cap 0x40
[    0.977477] pci 0000:40:00.2: AMD-Vi: Found IOMMU cap 0x40
[    0.977481] pci 0000:60:00.2: AMD-Vi: Found IOMMU cap 0x40
[    0.978109] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[    0.978114] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[    0.978119] perf/amd_iommu: Detected AMD IOMMU #2 (2 banks, 4 counters/bank).
[    0.978124] perf/amd_iommu: Detected AMD IOMMU #3 (2 banks, 4 counters/bank).


conf:
Code:
agent: 1
args: -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=NV43FIX,kvm=off'
bios: ovmf
boot: order=scsi0;sata2
cores: 16
cpu: host,hidden=1,flags=+pcid
efidisk0: local-zfs:vm-102-disk-1,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:21:00,pcie=1,romfile=rtx3060ti.bin
machine: pc-q35-7.2
memory: 16384
meta: creation-qemu=6.1.0,ctime=1643449206
name: Windows
net0: virtio=92:5B:C8:65:C7:BF,bridge=vmbr0,firewall=1
net1: virtio=5E:F1:D8:F5:B9:E5,bridge=vmbr1,firewall=1
numa: 1
onboot: 1
ostype: win10
parent: fresh
sata2: none,media=cdrom
scsi0: local-zfs:vm-102-disk-0,cache=writeback,discard=on,size=128G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=d2f66626-ba8e-4b9c-a676-5994cd4609f2
sockets: 1
startup: order=4
tpmstate0: local-zfs:vm-102-disk-2,size=4M,version=v2.0
usb0: host=19b9:4d10,usb3=1
vga: std
vmgenid: 17483853-331f-47cc-b8ed-96431c49638d
 
Just an update.

syslog shows nothing that coincides with the restart.

Eventviewer gives eventID: 41 with keywords (70368744177664),(2).
 
So changing the CPU to KVM64 fixes the problem but reduces performance. Setting to Host or Epyc the problem comes back.

I can reproduce the issue everytime by running 3dMark Timespy.


Any suggestions on how to fix this?
 
Just a wild guess: try an earlier version of package pve-edk2-firmware.
It was worth a try, but no dice.

I tried 3.20230228-1 (latest), 3.20221111-2 all, 3.20220526-1. All crashed the VM at roughly the same point in the timespy benchmark.

I also tried previous 2 kernels.


Does anybody know what logs might give more info as to why the crash is happening?
 
  • Like
Reactions: leesteken
I created a new windows 10 VM and it works set to host with the graphics card passed through. Haven't tried EPYC.

Strange as I used the VM daily on the xeon system, it only started showing issues after I switched out the motherboard/processor to Epyc.

I checked the old VM and no device errors showing in device manager. Anyway, thanks for the suggestions.