Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

Which device is reporting the D3cold to D0 message?

If you have different hardware you can find using: lspci -nn
e.g
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GB202 [GeForce RTX 5090] [10de:2b85] (rev a1)
02:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

The appropriate vendor and device id's for your system should be used when binding to vfio-pic and setting pm rules.

For the interrupt, in the guest check dmesg and lspci to determine which device is having the issue, then follow through to the host to see what's next.

dmesg -T
lspci -vv |grep -i 'interrupt:'
I have implemented this also and i also see issue.

But here is the thing.
My RTX PRO 6000 blackwell 96GB 600w i think is working fine now
But RTX PRO 6000 blackwell 96GB 300w max-q is having this issue regardless.

Also did upgrade to 6.17 kernel and flashed that uefi firmware fix but still no luck with max-q version.

https://www.nvidia.com/content/Driv...ter_2.0-x64.exe&firmware=1&lang=us&type=Other
 
My RTX6000 max-q are disappearing on Host when i just power on VM like 2-3 times ... And radom GPU is gone. This is madness.
Nvidia is still "passed to developers" and no other info. Sitting on 48 GPUs that are broken .....

And yes i confirmed that ATTR was applied:
for d in /sys/bus/pci/devices/*; do [ -f "$d/vendor" ] && [ "$(cat $d/vendor)" = "0x10de" ] || continue; b=$(basename "$d"); drv=$(basename "$(readlink -f $d/driver 2>/dev/null)" 2>/dev/null || echo none); echo "$b dev=$(cat $d/device) driver=$drv d3cold=$(cat $d/d3cold_allowed 2>/dev/null) override=$(cat $d/driver_override 2>/dev/null)"; done
0000:01:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:21:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:41:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:61:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:81:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:a1:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:c1:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:e1:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
 
Last edited:
Which device is reporting the D3cold to D0 message?

If you have different hardware you can find using: lspci -nn
e.g
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GB202 [GeForce RTX 5090] [10de:2b85] (rev a1)
02:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

The appropriate vendor and device id's for your system should be used when binding to vfio-pic and setting pm rules.

For the interrupt, in the guest check dmesg and lspci to determine which device is having the issue, then follow through to the host to see what's next.

dmesg -T
lspci -vv |grep -i 'interrupt:'
I am RTX pro 6000 Max-q 300w. If you actually need the lspci let me know and I can grab the dev-ids for you
 
Rtx pro 6000 600w version and rtx 5090 seems to be fixed when we add udev rule.

But i just got also rtx pro 6000 q-max 300w version which has even stranger issue.
The crash happens instantly on random gpu when i turn on VM.
Every time different gpu.
And without passthrough it works fine.
Every time that issue happens, in guest i see that error:

IMG_0883.png

Vfio is correct bound to gpus. Even disabled gpu by switching gpus to compute mode but still nothing helps for those.
Any idea if this is same issue or new issue or i am kissing something ?