can't change power state from D3cold to D0 - invalid power transition (from D3cold to D3hot)

Feb 28, 2023
15
0
1
Hi Mates!
We are facing a problem with any of the GPU on pass-thru when starting a new VM give up. not changing the power state of any PCI at boot.

Mar 28 17:55:50 ODIN kernel: vfio-pci 0000:89:00.0: can't change power state from D3cold to D0 (config space inaccessible)
Mar 28 17:55:50 ODIN kernel: vfio-pci 0000:89:00.0: can't change power state from D3cold to D0 (config space inaccessible)
Mar 28 17:55:50 ODIN kernel: vfio-pci 0000:89:00.1: can't change power state from D3cold to D0 (config space inaccessible)
Mar 28 17:55:50 ODIN kernel: vfio-pci 0000:89:00.2: can't change power state from D3cold to D0 (config space inaccessible)
Mar 28 17:55:50 ODIN kernel: vfio-pci 0000:89:00.3: can't change power state from D3cold to D0 (config space inaccessible)

Mar 28 17:55:52 ODIN kernel: pcieport 0000:89:00.0: not ready 1023ms after bus reset; waiting
Mar 28 17:55:53 ODIN kernel: pcieport 0000:89:00.0: not ready 2047ms after bus reset; waiting
Mar 28 17:55:55 ODIN kernel: pcieport 0000:89:00.0: not ready 4095ms after bus reset; waiting
Mar 28 17:56:00 ODIN kernel: pcieport 0000:89:00.0: not ready 8191ms after bus reset; waiting
Mar 28 17:56:08 ODIN kernel: pcieport 0000:89:00.0: not ready 16383ms after bus reset; waiting
Mar 28 17:56:25 ODIN kernel: pcieport 0000:89:00.0: not ready 32767ms after bus reset; waiting
Mar 28 17:56:49 ODIN pmxcfs[2832]: [status] notice: received log
Mar 28 17:56:59 ODIN kernel: pcieport 0000:89:00.0: not ready 65535ms after bus reset; giving up
Mar 28 17:56:59 ODIN kernel: vfio-pci 0000:89:00.0: invalid power transition (from D3cold to D3hot)

any advice?
thanks in advance
angel
 
What kind of GPU is it and what is the output of cat /proc/cmdline?
If the GPU is used during boot of the Proxmox host (like showing the BIOS screen), then you need this work-around but you won't have a console or boot messages anymore.
 
What kind of GPU is it and what is the output of cat /proc/cmdline?
If the GPU is used during boot of the Proxmox host (like showing the BIOS screen), then you need this work-around but you won't have a console or boot messages anymore.
hi,
The GPUS are all Nvidia RTX 2080Ti
We use systemd-boot in the system with ZFS with raid 1 in /etc/kernel/cmdline (not proc/cmdline)

we have now updated it withyour workaround:
root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet intel_iommu=on nomodeset textonly initcall_blacklist=sysfb_init

before the workaround we had:
root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet intel_iommu=on nomodeset textonly video=vesafb:off video=efifb:off

but still does not boot, with any of the two configurations.!
 
The GPUS are all Nvidia RTX 2080Ti
Other people here had success with them (at least up to 7). Maybe you can compare notes? I did not see any special settings in their VMs.
before the workaround we had:
root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet intel_iommu=on nomodeset textonly video=vesafb:eek:ff video=efifb:eek:ff
nomodeset textonly video=vesafb:off video=efifb:off is useless since kernel 5.15 (or maybe 5.13) and now you need initcall_blacklist=sysfb_init, but only for the boot GPU.
root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet intel_iommu=on nomodeset textonly initcall_blacklist=sysfb_init
but still does not boot, with any of the two configurations.!
Are you early binding the GPU's (all functions) to vfio-pci so no driver touches it?
With Ryzen motherboard sometimes passthrough would fail with D3cold on some BIOS versions. Updating or sometimes downgrading to an older version, would fix such issues.
 
Other people here had success with them (at least up to 7). Maybe you can compare notes? I did not see any special settings in their VMs.

nomodeset textonly video=vesafb:off video=efifb:off is useless since kernel 5.15 (or maybe 5.13) and now you need initcall_blacklist=sysfb_init, but only for the boot GPU.

Are you early binding the GPU's (all functions) to vfio-pci so no driver touches it?
With Ryzen motherboard sometimes passthrough would fail with D3cold on some BIOS versions. Updating or sometimes downgrading to an older version, would fix such issues.
yes all GPUs are bind,
We had previously boot a VM with 4xGPUS in this server and we didn't face this problem.
all we have made is replace some of the GPUs and import a VM template and now doesn't change their power state ...
 
We had previously boot a VM with 4xGPUS in this server and we didn't face this problem.
Then it's probably not an issue of the motherboard or your Proxmox configuration, which is good to know. Where those also NVidia GPUs?
all we have made is replace some of the GPUs and import a VM template and now doesn't change their power state ...
Maybe the GPUs (this particular brand and model) you got just don't work with passthrough? It is not uncommon that the device (or the drivers) cannot handle passthrough because they don't reset properly. Sometimes there's a work-around but I have no experience with your GPU.
Maybe dumping the ROM and passing it, or even patching the ROM is sometimes needed for NVidia GPUs but I don't know the details. Maybe someone else here knows.
 
Then it's probably not an issue of the motherboard or your Proxmox configuration, which is good to know. Where those also NVidia GPUs?

Maybe the GPUs (this particular brand and model) you got just don't work with passthrough? It is not uncommon that the device (or the drivers) cannot handle passthrough because they don't reset properly. Sometimes there's a work-around but I have no experience with your GPU.
Maybe dumping the ROM and passing it, or even patching the ROM is sometimes needed for NVidia GPUs but I don't know the details. Maybe someone else here knows.
yes, the replacement are all 2080Ti same models and vendor. just replaced in different PCI slots. i hope so some one can give us a hint, we have no experience dumping the ROM.
 
yes, the replacement are all 2080Ti same models and vendor. just replaced in different PCI slots. i hope so some one can give us a hint, we have no experience dumping the ROM.
You had multiple 2080Ti's passed through working fine and replaced them with 2080Ti's that are the same model from the same vendor, but those don't work and nothing else changed? Maybe there is something wrong with the new GPUs.
 
You had multiple 2080Ti's passed through working fine and replaced them with 2080Ti's that are the same model from the same vendor, but those don't work and nothing else changed? Maybe there is something wrong with the new GPUs.
all we actually did was change the PCi lines on the server where GPU are... ¿?
 
Sort of, I find this much more readable: cat /proc/cmdline; for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done. You probably don't have problems with the groups because otherwise you would experience entirely different issues.

Can't you put the GPUs back into the working PCIe lanes? It is no uncommon that not all PCIe slots work well with passthrough. Do you know the PCIe layout of your motherboard (including PCIe multiplexers)?
 
Sort of, I find this much more readable: cat /proc/cmdline; for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done. You probably don't have problems with the groups because otherwise you would experience entirely different issues.

Can't you put the GPUs back into the working PCIe lanes? It is no uncommon that not all PCIe slots work well with passthrough. Do you know the PCIe layout of your motherboard (including PCIe multiplexers)?
like this: ?
 

Attachments

Please note that nomodeset textonly video=vesafb:off video=efifb:off does not work since Proxmox 7.2 and you need this work-around instead. Your IOMMU group 93, 97 and 98 look fine. I can't explain your passthrough problems.
The IOMMU groups don't show the physical PCIe lanes layout of the motherboard, so this might still be an issue with passthrough and PCIe lanes/multiplexers.
Maybe you could use a Proxmox support ticket on this particular issue (as it worked before in a very similar configuration, even though passthrough cannot be guaranteed)?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!