Passthrough RTX 6000/5090 CPU Soft BUG lockup, D3cold to D0, after guest shutdown

Which device is reporting the D3cold to D0 message?

If you have different hardware you can find using: lspci -nn
e.g
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GB202 [GeForce RTX 5090] [10de:2b85] (rev a1)
02:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

The appropriate vendor and device id's for your system should be used when binding to vfio-pic and setting pm rules.

For the interrupt, in the guest check dmesg and lspci to determine which device is having the issue, then follow through to the host to see what's next.

dmesg -T
lspci -vv |grep -i 'interrupt:'
I have implemented this also and i also see issue.

But here is the thing.
My RTX PRO 6000 blackwell 96GB 600w i think is working fine now
But RTX PRO 6000 blackwell 96GB 300w max-q is having this issue regardless.

Also did upgrade to 6.17 kernel and flashed that uefi firmware fix but still no luck with max-q version.

https://www.nvidia.com/content/Driv...ter_2.0-x64.exe&firmware=1&lang=us&type=Other
 
My RTX6000 max-q are disappearing on Host when i just power on VM like 2-3 times ... And radom GPU is gone. This is madness.
Nvidia is still "passed to developers" and no other info. Sitting on 48 GPUs that are broken .....

And yes i confirmed that ATTR was applied:
for d in /sys/bus/pci/devices/*; do [ -f "$d/vendor" ] && [ "$(cat $d/vendor)" = "0x10de" ] || continue; b=$(basename "$d"); drv=$(basename "$(readlink -f $d/driver 2>/dev/null)" 2>/dev/null || echo none); echo "$b dev=$(cat $d/device) driver=$drv d3cold=$(cat $d/d3cold_allowed 2>/dev/null) override=$(cat $d/driver_override 2>/dev/null)"; done
0000:01:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:21:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:41:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:61:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:81:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:a1:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:c1:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
0000:e1:00.0 dev=0x2bb4 driver=vfio-pci d3cold=0 override=(null)
 
Last edited:
Which device is reporting the D3cold to D0 message?

If you have different hardware you can find using: lspci -nn
e.g
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GB202 [GeForce RTX 5090] [10de:2b85] (rev a1)
02:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

The appropriate vendor and device id's for your system should be used when binding to vfio-pic and setting pm rules.

For the interrupt, in the guest check dmesg and lspci to determine which device is having the issue, then follow through to the host to see what's next.

dmesg -T
lspci -vv |grep -i 'interrupt:'
I am RTX pro 6000 Max-q 300w. If you actually need the lspci let me know and I can grab the dev-ids for you
 
Rtx pro 6000 600w version and rtx 5090 seems to be fixed when we add udev rule.

But i just got also rtx pro 6000 q-max 300w version which has even stranger issue.
The crash happens instantly on random gpu when i turn on VM.
Every time different gpu.
And without passthrough it works fine.
Every time that issue happens, in guest i see that error:

IMG_0883.png

Vfio is correct bound to gpus. Even disabled gpu by switching gpus to compute mode but still nothing helps for those.
Any idea if this is same issue or new issue or i am kissing something ?
 
hey guys, any updates on the situation with the RTX PRO 6000 Max-Q?
I have two of these cards in a server and after VM shut down, they kinda lock the host.
I'm running the latest proxmox beta kernel 6.17 and I also did the changes with the correct ids for the max-q cards regarding the power management..
I also ready a comment about piping through the correct CCD so now I'm just using the whole CPU in the guest.

But the problem persists. The last thing I see is

Code:
vfio-pci [vga-id]: not ready 32767ms after FLR; waiting
 
hey guys, any updates on the situation with the RTX PRO 6000 Max-Q?
I have two of these cards in a server and after VM shut down, they kinda lock the host.
I'm running the latest proxmox beta kernel 6.17 and I also did the changes with the correct ids for the max-q cards regarding the power management..
I also ready a comment about piping through the correct CCD so now I'm just using the whole CPU in the guest.

But the problem persists. The last thing I see is

Code:
vfio-pci [vga-id]: not ready 32767ms after FLR; waiting
I have returned mine to nvidia and replaced them for 600w ones.
Anyway i did not solved the max-q ones.

Make sure you will also do this:
sudo echo 'SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{d3cold_allowed}="0"' | sudo tee /etc/udev/rules.d/99-nvidia-d3cold.rules

sudo udevadm control --reload
sudo udevadm trigger --subsystem-match=pci

And the rest jf things invluding uefi fix. But still it probably wont solve the issue.

The other solution might be using nvidia enterprise locense to have MIG support and mediated devices for passthrough but i had no success there so o just used all my power to replace max-q as it is just plain shi..
 
So I wanted to post my experience and feedback as to this bug issue and what I've seen so far.
I've had three workstations
(2x) z840's running two different E5-2690v4 and 2699v4 intel Xeon processors with stock configuration motherboard running 2.62 (latest public available bios) and tested both with the Max-Q encountering the host crash bug over and over and over.

I have a i9900k with an Asus Z390 Pro 4 Motherboard - Zero issue with the Max-Q, I transferred it over and ran flawlessly for weeks.

My third machine is a 9985wx thread ripper using the ASUS WRX90 motherboard and the Max-Q crashes the system

I added the RTX 5090 as a temp replacement (on the threadripper) for lighter AI workloads and without adding anything additional (includingUDEV configuration, only enabling IOMMU passthrough), works fine, no issue. Running latest Proxmox 9.0.11 with latest kernel from the update repo.

It almost seems the Combo of Proxmox / VFIO, HEDT systems and Blackwell to triggers this problem.

All three systems had days of tested and the outcome on all three different types were specifically consistent.
 
Having some issues with passing through 600W RTX 6000 Blackwell Server Edition cards.

Q35 guest machine with OVMF bios. I am using a AMD EPYC platform with the GPUs which I want to pass through so I can swap between windows and linux guests depending on what I am doing.

The cards are using vfio driver in host, above 4g and all that seems to be working fine, but in the guest - BARs are never mapped. Does anyone have pointers?

1761697462671.png

Confirming vfio-pci is in use on host:

Bash:
lspci -nnk | grep -A2 2bb5
06:00.0 3D controller [0302]: NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Server Edition] [10de:2bb5] (rev a1)
    Subsystem: NVIDIA Corporation Device [10de:204e]
    Kernel driver in use: vfio-pci

Various host config:

Bash:
# /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt vfio-pci.ids=10de:2bb5"

# /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:2bb5 disable_vga=1

# /etc/modprobe.d/vfio.conf
options kvm ignore_msrs=1 report_ignored_msrs=0
options vfio-pci ids=10de:2bb5 disable_vga=1 disable_idle_d3=1

# load vfio modules
sudo tee /etc/modules-load.d/vfio.conf >/dev/null <<'EOF'
vfio
vfio_iommu_type1
vfio_pci
EOF

qm start 100
qm monitor 100
info pci -> scroll to device shows

Bash:
  Bus  1, device   0, function 0:
    3D controller: PCI device 10de:2bb5
      PCI subsystem 10de:204e
      IRQ 0, pin A
      BAR0: 64 bit prefetchable memory (not mapped)
      BAR2: 64 bit prefetchable memory (not mapped)
      BAR4: 64 bit prefetchable memory (not mapped)
      id "hostpci0"

Edit: I think I solved it. I'll verify and post my process.
 
Last edited:
I'll continue finishing off my testing of the default pve kernel (Linux 6.14.11-4-pve x86_64) and then I'll move one of my nodes over to testing those additional kernels - but that will take a bit. I have a pretty comprehensive test suite for GPU workloads which normally takes around 48h.

So far though, this is looking very promising, I am seeing negligible loss in performance, no weird power states after guest shutdown.

I'll post my notes below, but will follow up with something more reasonable, this is for a 4x rtx 6000 passthrough vm:

Bash:
# guest setup

# disable secureboot in guest bios # todo: see if we can get secureboot enabled but i doubt it, not even az ND series has

# /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet pci=realloc"

# end of guest setup

# pve side guest setup

# command to set with qm
qm set <vmid> -args '-global q35-pcihost.pci-hole64-size=512G

# host cpu
# q35
# uefi bios but no secureboot

# pve config (4x rtx 6000)

args: -global q35-pcihost.pci-hole64-size=512G  # todo: verify that this is required
balloon: 0                                      # todo: verify that this is required, unsure of impact if ballooning permitted
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 48
cpu: host                                       # absolutely required, went from 2-8 visible GPUs with host CPU flag
efidisk0: local-zfs:vm-102-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:06:00.0,pcie=1
hostpci1: 0000:07:00.0,pcie=1
hostpci2: 0000:75:00.0,pcie=1
hostpci3: 0000:76:00.0,pcie=1
ide2: local:iso/ubuntu-22.04.5-live-server-amd64.iso,media=cdrom,size=2086842K
machine: q35
memory: 262144
meta: creation-qemu=10.0.2,ctime=1761713353
name: gpu-test-1
net0: virtio=BC:24:11:F2:EF:A0,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-zfs:vm-102-disk-1,iothread=1,size=256G
scsihw: virtio-scsi-single
smbios1: uuid=8685f01d-6ed9-46fa-9782-b615ad18a8b4
sockets: 1
tpmstate0: local-zfs:vm-102-disk-2,size=4M,version=v2.0

# end of pve side guest setup

# host setup

# verify resize bar, above 4g encoding, sr-iov and iommu in bios - todo: add host side verification script

# /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt vfio-pci.ids=10de:2bb5" # todo: verify if vfio-pci line required, has worked without

cat /etc/modprobe.d/blacklist-gpu.conf # cut down from exhaustive list
blacklist radeon
blacklist nouveau
blacklist nvidia

cat /etc/modprobe.d/vfio.conf
options kvm ignore_msrs=1 report_ignored_msrs=0 # todo: verify if %100 required
options vfio-pci ids=10de:2bb5 disable_vga=1 disable_idle_d3=1 # # todo: verified, d3 required or we enter weird power state

cat /etc/modules-load.d/vfio.conf # required
vfio
vfio_iommu_type1
vfio_pci

# end of host setup
 
I'll continue finishing off my testing of the default pve kernel (Linux 6.14.11-4-pve x86_64) and then I'll move one of my nodes over to testing those additional kernels - but that will take a bit. I have a pretty comprehensive test suite for GPU workloads which normally takes around 48h.

So far though, this is looking very promising, I am seeing negligible loss in performance, no weird power states after guest shutdown.

I'll post my notes below, but will follow up with something more reasonable, this is for a 4x rtx 6000 passthrough vm:

Bash:
# guest setup

# disable secureboot in guest bios # todo: see if we can get secureboot enabled but i doubt it, not even az ND series has

# /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet pci=realloc"

# end of guest setup

# pve side guest setup

# command to set with qm
qm set <vmid> -args '-global q35-pcihost.pci-hole64-size=512G

# host cpu
# q35
# uefi bios but no secureboot

# pve config (4x rtx 6000)

args: -global q35-pcihost.pci-hole64-size=512G  # todo: verify that this is required
balloon: 0                                      # todo: verify that this is required, unsure of impact if ballooning permitted
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 48
cpu: host                                       # absolutely required, went from 2-8 visible GPUs with host CPU flag
efidisk0: local-zfs:vm-102-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:06:00.0,pcie=1
hostpci1: 0000:07:00.0,pcie=1
hostpci2: 0000:75:00.0,pcie=1
hostpci3: 0000:76:00.0,pcie=1
ide2: local:iso/ubuntu-22.04.5-live-server-amd64.iso,media=cdrom,size=2086842K
machine: q35
memory: 262144
meta: creation-qemu=10.0.2,ctime=1761713353
name: gpu-test-1
net0: virtio=BC:24:11:F2:EF:A0,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-zfs:vm-102-disk-1,iothread=1,size=256G
scsihw: virtio-scsi-single
smbios1: uuid=8685f01d-6ed9-46fa-9782-b615ad18a8b4
sockets: 1
tpmstate0: local-zfs:vm-102-disk-2,size=4M,version=v2.0

# end of pve side guest setup

# host setup

# verify resize bar, above 4g encoding, sr-iov and iommu in bios - todo: add host side verification script

# /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt vfio-pci.ids=10de:2bb5" # todo: verify if vfio-pci line required, has worked without

cat /etc/modprobe.d/blacklist-gpu.conf # cut down from exhaustive list
blacklist radeon
blacklist nouveau
blacklist nvidia

cat /etc/modprobe.d/vfio.conf
options kvm ignore_msrs=1 report_ignored_msrs=0 # todo: verify if %100 required
options vfio-pci ids=10de:2bb5 disable_vga=1 disable_idle_d3=1 # # todo: verified, d3 required or we enter weird power state

cat /etc/modules-load.d/vfio.conf # required
vfio
vfio_iommu_type1
vfio_pci

# end of host setup
Can you share your test suite as this might be interesting. We are testing using some mining progress but i am wondering if we could use something that makes many different things.

Also f you have 2 cpus then numa should be 1 in host.
And if you allocating more than 300GB ram, consider using hugepages as allocation is much much faster while allocation of 700GB for example might even fail without hugepages or take like 5 minutes.

In guest make sure you dont use modeset in nvidia driver as this was doing some bad stuff.

Also you are using udev rule right ?

I did not use pcie realloc in guest but on host this was causing strange stuff or missing network card.
And also use uefi firmware upgrade from nvidia ad this fixed some crashing also but not all of them.
I am still having crash from time to time.
 
hey guys, I have a confirmed fix for the RTX PRO 6000 Max-Q and the CPU soft lock bug / GPU reset failure with error messages like not ready 32767ms after FLR; waiting

I managed to successfully shut down and restart the VM and the GPUs still work.

I found the fix here:
https://www.reddit.com/r/VFIO/comments/1mjoren/any_solutions_for_reset_bug_on_nvidia_gpus/
Which references this discussion:
https://forum.level1techs.com/t/do-...ies-has-reset-bug-in-vm-passthrough/228549/35

In the VM itself, you have to edit /etc/modprobe.d/nvidia-modeset.conf:
options nvidia-drm modeset=0

Then do update-initramfs -u

This change alone did the trick for me, all the other stuff did not help. I have not tested longterm stability yet.
 
Hmm, running into a new issue now. I wiped node to start fresh and confirm all config steps and can only pass 4 GPUs before NIC BAR allocation fails. Any ideas on that?

I've tested interleaving GPUs based on PCIe lanes and no issue as long as it stays at 4 or lower - seems like a capacity issue rather than NUMA limits.

edit:Lol whoops I forgot numa=1 when installing guest.

edit: adding screenshot of config:

1761786992581.png

With the above, and the steps outlined in my previous post I'm able to install a guest OS with 8 GPUs passed through. One important step is to attach all gpus before installation of the guest OS and to ensure numa is set to 1.

Verified that without this, the guest OS sets numa on all gpu pci devices to -1, even with numa=1 present.

I am actually not using hugepages or udev rules - I'm yet to have slow VM starts or any power state weirdnesses.
 
Last edited: