Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

Which device is reporting the D3cold to D0 message?

If you have different hardware you can find using: lspci -nn
e.g
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GB202 [GeForce RTX 5090] [10de:2b85] (rev a1)
02:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

The appropriate vendor and device id's for your system should be used when binding to vfio-pic and setting pm rules.

For the interrupt, in the guest check dmesg and lspci to determine which device is having the issue, then follow through to the host to see what's next.

dmesg -T
lspci -vv |grep -i 'interrupt:'
I have implemented this also and i also see issue.

But here is the thing.
My RTX PRO 6000 blackwell 96GB 600w i think is working fine now
But RTX PRO 6000 blackwell 96GB 300w max-q is having this issue regardless.

Also did upgrade to 6.17 kernel and flashed that uefi firmware fix but still no luck with max-q version.

https://www.nvidia.com/content/Driv...ter_2.0-x64.exe&firmware=1&lang=us&type=Other
 
My RTX6000 max-q are disappearing on Host when i just power on VM like 2-3 times ... And radom GPU is gone. This is madness.
Nvidia is still "passed to developers" and no other info. Sitting on 48 GPUs that are broken .....

And yes i confirmed that ATTR was applied:
for d in /sys/bus/pci/devices/*; do [ -f "$d/vendor" ] && [ "$(cat $d/vendor)" = "0x10de" ] || continue; b=$(basename "$d"); drv=$(basename "$(readlink -f $d/driver 2>/dev/null)" 2>/dev/null || echo none); echo "$b dev=$(cat $d/device) driver=$drv d3cold=$(cat $d/d3cold_allowed 2>/dev/null) override=$(cat $d/driver_override 2>/dev/null)"; done
0000:01:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:21:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:41:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:61:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:81:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:a1:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:c1:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:e1:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
 
Last edited:
Which device is reporting the D3cold to D0 message?

If you have different hardware you can find using: lspci -nn
e.g
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GB202 [GeForce RTX 5090] [10de:2b85] (rev a1)
02:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

The appropriate vendor and device id's for your system should be used when binding to vfio-pic and setting pm rules.

For the interrupt, in the guest check dmesg and lspci to determine which device is having the issue, then follow through to the host to see what's next.

dmesg -T
lspci -vv |grep -i 'interrupt:'
I am RTX pro 6000 Max-q 300w. If you actually need the lspci let me know and I can grab the dev-ids for you
 
Rtx pro 6000 600w version and rtx 5090 seems to be fixed when we add udev rule.

But i just got also rtx pro 6000 q-max 300w version which has even stranger issue.
The crash happens instantly on random gpu when i turn on VM.
Every time different gpu.
And without passthrough it works fine.
Every time that issue happens, in guest i see that error:

IMG_0883.png

Vfio is correct bound to gpus. Even disabled gpu by switching gpus to compute mode but still nothing helps for those.
Any idea if this is same issue or new issue or i am kissing something ?
 
hey guys, any updates on the situation with the RTX PRO 6000 Max-Q?
I have two of these cards in a server and after VM shut down, they kinda lock the host.
I'm running the latest proxmox beta kernel 6.17 and I also did the changes with the correct ids for the max-q cards regarding the power management..
I also ready a comment about piping through the correct CCD so now I'm just using the whole CPU in the guest.

But the problem persists. The last thing I see is

Code:
vfio-pci [vga-id]: not ready 32767ms after FLR; waiting
 
hey guys, any updates on the situation with the RTX PRO 6000 Max-Q?
I have two of these cards in a server and after VM shut down, they kinda lock the host.
I'm running the latest proxmox beta kernel 6.17 and I also did the changes with the correct ids for the max-q cards regarding the power management..
I also ready a comment about piping through the correct CCD so now I'm just using the whole CPU in the guest.

But the problem persists. The last thing I see is

Code:
vfio-pci [vga-id]: not ready 32767ms after FLR; waiting
I have returned mine to nvidia and replaced them for 600w ones.
Anyway i did not solved the max-q ones.

Make sure you will also do this:
sudo echo 'SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{d3cold_allowed}="0"' | sudo tee /etc/udev/rules.d/99-nvidia-d3cold.rules

sudo udevadm control --reload
sudo udevadm trigger --subsystem-match=pci

And the rest jf things invluding uefi fix. But still it probably wont solve the issue.

The other solution might be using nvidia enterprise locense to have MIG support and mediated devices for passthrough but i had no success there so o just used all my power to replace max-q as it is just plain shi..