Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

Trying some 4 extra parameters to see if it will help or not. Will report when issue happens again.

quiet idle=nomwait pci=nocrs pci=realloc processor.max_cstate=5 amd_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interrupts=1 vfio-pci.ids=10de:22e8,10de:2bb1 initcall_blacklist=sysfb_init
 
Have a look at the attachment:
So the upgrading firmware did not help. I am looking for next steps. Already wrote to asrock support, also i will write to nvidia but i think it might be motherboard issue. And thinking what more can i do more.
Anyway i would like to thank you for your suggestion. If you need for example RTX5090 or 6000 to use for a while in VM then i do have datacenter on which i am renting those bu i can rent for free for some time if you would need it.
The scripts you sent i will check out later once i will have this stable.
 
Those commands did not help.
I am currently speaking to Nvidia enterprise support Tier II to fix this. They asked me to talk with Proxmox support team so i also sent them support ticket.
 
Hi!
I have basically the same problem and opened another threat:
my threat

I think I tried the firmware update aswell, but I am not quite sure about it.

Is there any help at the moment?

Regards

Christof
 
Today i got answer from Proxmox. It is not my proxmox installation and it was installed from clean debian. But here it is:

pve-edk2-firmware: not correctly installed
According to the report, the package is not installed correctly, which may affect EFI VMs. Please reinstall:

Code:
apt update
apt install --reinstall pve-edk2-firmware
# Check if apt is ok:
apt -f install

# You could also try testing Optin Kernel 6.14. This has fixed a GPU theme in the past.

apt install proxmox-kernel-6.14
 
I answered your thread on Level1Techs buddy - don't know if you saw it:

https://forum.level1techs.com/t/do-...ies-has-reset-bug-in-vm-passthrough/228549/35

What fixed it on my system (debian) was to disable the nvidia-drm modeset option on the VM - that's all. No changes needed on the host.
Oh i actually did not see it. Thanks.
Hmm that is interesting. I can try this setting in VM: options nvidia-drm modeset=0

Still you have rock solid Windows and we also got this issue after windows shutdown as well.
I am wondering if you set something special in windows or maybe drivers have to do something with it. Or if you were setting something special in VGA in proxmox for VM ?
 
I am wondering if you set something special in windows or maybe drivers have to do something with it. Or if you were setting something special in VGA in proxmox for VM ?
No nothing special, just a typical Windows 11 installation with the latest NVidia driver. I'm not on Proxmox though, just a Debian 12 host running libvirt KVMs. I've attached my VM definitions in case it helps.
 

Attachments

I answered your thread on Level1Techs buddy - don't know if you saw it:

https://forum.level1techs.com/t/do-...ies-has-reset-bug-in-vm-passthrough/228549/35

What fixed it on my system (debian) was to disable the nvidia-drm modeset option on the VM - that's all. No changes needed on the host.
One of my clients confirmed that he had crashed all the time that rtx6000 blackwell when he was training unsloth. And after adding that fix in VM, it no longer crashes !
Anyone can confirm.
I have asked nvidia support if they can fix this on host side or in gpu bios as changing anytging in my clients vms would be impossible lol and they still can change that config and crash again.
 
Same issue after guest shutdown on Linux 6.14.8-2-pve, X670E PG Lightning, Proxmox VE 9.0.3 x86_64, NVIDIA GeForce RTX 5070, AMD Ryzen 9 7900

Code:
Aug 11 21:11:48 proxmox kernel: vfio-pci 0000:01:00.1: Unable to change power state from D3cold to D0, device inaccessible
Aug 11 21:11:48 proxmox kernel: vfio-pci 0000:01:00.0: resetting
Aug 11 21:11:48 proxmox kernel: vfio-pci 0000:01:00.1: resetting
Aug 11 21:11:48 proxmox kernel: vfio-pci 0000:01:00.1: Unable to change power state from D3cold to D0, device inaccessible
Aug 11 21:11:50 proxmox kernel: pcieport 0000:00:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
Aug 11 21:11:50 proxmox kernel: vfio-pci 0000:01:00.0: reset done
Aug 11 21:11:50 proxmox kernel: vfio-pci 0000:01:00.1: reset done
Aug 11 21:11:50 proxmox kernel: vfio-pci 0000:01:00.1: Unable to change power state from D3cold to D0, device inaccessible
Aug 11 21:11:50 proxmox kernel: vfio-pci 0000:01:00.0: Unable to change power state from D0 to D3hot, device inaccessible
 
Got response from nvidia that they were able to reproduce this issue and they are thinking about fix.
Also i have installed apt install proxmox-kernel-6.14.8-2-bpo12-pve/stable and i see that RTX6000 boots super fast now vs very slow when i had older 6.8 and 6.11 kernels. In 6.14 they added some support for blackwell so worth to try it out.
https://www.phoronix.com/news/Linux-6.14-VFIO
Anyway the crash on shutdown is caused by either specific training itself or/and some module options for nvidia.
The training that caused issues afte applying options nvidia-drm modeset=0 and /etc/X11/xorg.conf.d now it does not crash gpu anymore.
But since client can do any stuff in VM, this is not good solution.
 
  • Like
Reactions: fuomag9
Yeah it seems like it !
I have upgraded few servers with RTX4090 and RTX5090 and l also RTX6000 blackwell to that kernel proxmox-kernel-6.14.8-2-bpo12-pve/stable
And so far it works ok + those crazy fast startup.

So only one thing stil remains. Crashing GPUs when VM guest does some strange modeset=1 and other things and then shuits down VM - then GPU goes to sleep forever xD
Will see if nvidia will fix it or not.

I also suspect RTX4090 to have same issue but i am not certain as there might be different issue, Will check it soon.
 
  • Like
Reactions: uzumo