[SOLVED] Windows 10 VM - memory errors when passing through GPU, after upgrading to Proxmox 8.2

marcosscriven

Member
Mar 6, 2021
142
12
23
I'm trying to run a Windows 10 VM clone from a template I've been using absolutely fine. When I passthrough an Nvidia GPU I get these errors in the logs:


Code:
[Fri May 17 08:31:12 2024] x86/PAT: memtype_reserve failed [mem 0xf800000000-0xf801ffffff], track uncached-minus, req uncached-minus
[Fri May 17 08:31:12 2024] ioremap memtype_reserve failed -16
[Fri May 17 08:31:34 2024] x86/PAT: CPU 7/KVM:2231 conflicting memory types f800000000-f802000000 uncached-minus<->write-combining
[Fri May 17 08:31:34 2024] x86/PAT: memtype_reserve failed [mem 0xf800000000-0xf801ffffff], track uncached-minus, req uncached-minus
[Fri May 17 08:31:34 2024] ioremap memtype_reserve failed -16
[Fri May 17 08:31:34 2024] x86/PAT: CPU 7/KVM:2231 conflicting memory types f800000000-f802000000 uncached-minus<->write-combining
[Fri May 17 08:31:34 2024] x86/PAT: memtype_reserve failed [mem 0xf800000000-0xf801ffffff], track uncached-minus, req uncached-minus
[Fri May 17 08:31:34 2024] ioremap memtype_reserve failed -16
[Fri May 17 08:31:34 2024] x86/PAT: CPU 7/KVM:2231 conflicting memory types f800000000-f802000000 uncached-minus<->write-combining
[Fri May 17 08:31:34 2024] x86/PAT: memtype_reserve failed [mem 0xf800000000-0xf801ffffff], track uncached-minus, req uncached-minus
[Fri May 17 08:31:34 2024] ioremap memtype_reserve failed -16
[Fri May 17 08:31:34 2024] x86/PAT: CPU 7/KVM:2231 conflicting memory types f800000000-f802000000 uncached-minus<->write-combining
[Fri May 17 08:31:34 2024] x86/PAT: memtype_reserve failed [mem 0xf800000000-0xf801ffffff], track uncached-minus, req uncached-minus
[Fri May 17 08:31:34 2024] ioremap memtype_reserve failed -16

I see no output, and the whole machine eventually crashes.

I also tried turning "rombar" on in the config, and do get output for a while, but Nvidia GPU driver gives me error 43. As I say, I've successfully passed this GPU through fine, in the same machine, with the same Windows 10 VM clones from a fresh template.

The VM config is:

Code:
agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0;net0
cores: 16
cpu: host
efidisk0: local-lvm:vm-107-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:15:00.0,pcie=1,rombar=0. <-- The GPU. I tried with rombar on as well
hostpci1: 0000:01:00,pcie=1,x-vga=1,,rombar=0 <--- USB.
machine: pc-q35-8.1
memory: 8192
meta: creation-qemu=8.1.5,ctime=1711703824
name: q10-test-1
net0: virtio=BC:24:11:BD:89:EF,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
scsi0: local-lvm:vm-107-disk-1,cache=writeback,discard=on,iothread=1,size=128G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=e388d747-8e02-4b3c-a44d-2057510e27bb
sockets: 1
vmgenid: 8495e022-7cd7-4791-8f3f-517a928ed362

uname:

Code:
uname -a
Linux pve-maxi 6.8.4-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-3 (2024-05-02T11:55Z) x86_64 GNU/Linux

If I don't passthrough the GPU at all, it works fine.
 
Last edited:
what kernel do you boot currently? (can you post the output of 'dmesg'?)

does it work when booting an older kernel?
 
  • Like
Reactions: marcosscriven
what kernel do you boot currently? (can you post the output of 'dmesg'?)

does it work when booting an older kernel?

Thanks @dcsapak

I did a complete fresh install straight to 8.2, so I don't have the older kernels. Whatever was latest in 8.1 definitely worked.

Right now I have this:

Code:
Linux pve-maxi 6.8.4-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-3 (2024-05-02T11:55Z) x86_64 GNU/Linux

I also tried the only kernel I have before:
Code:
Linux pve-maxi 6.8.4-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-2 (2024-04-10T17:36Z) x86_64 GNU/Linux

With the same outcome.

I've attached two dmesg files. One from a clean boot without any VMs/LXCs starting. Then just the section after I start the VM in question.
 

Attachments

ok from the dmesg i can see that nouveau is loaded on boot and this comment in an older reddit thread: https://www.reddit.com/r/VFIO/comments/pwpm2h/comment/hym10e7/
had as a workaround to not bind the gpu to a real driver before passing through (it was an older kernel but similar symptoms)

could you try to blacklist the nouveau driver to test if that fixes the problem?
 
  • Like
Reactions: marcosscriven
could you try to blacklist the nouveau driver to test if that fixes the problem?

Thanks @dcsapak! Blacklisting worked. Not sure how I missed that Google result - but I see it was from two years ago.

I wonder what changed that blacklisting is now necessary, when it wasn't (or didn't appear to be) before?
 
not completely sure, could also be a problem of the nouveau driver ?
if i have time i could try to reproduce /bisect it, but that is very time intensive (build kernel + reboot) every time
 
  • Like
Reactions: marcosscriven