NVMe Issue: Unable to change power state from D3cold to D0, device inaccessible

KrisFromFuture · Jun 22, 2024

in my opinion there is bug in kernel latest than 6.8.4-2

i cant start Win11 VM with NVME pass

Code:

Jun 22 11:37:51 proxmox kernel: vfio-pci 0000:02:00.0: timed out waiting for pending transaction; performing function level reset anyway
Jun 22 11:37:51 proxmox kernel: vfio-pci 0000:02:00.0: Unable to change power state from D0 to D3hot, device inaccessible

maybe someone could help ?

Giggling3999 · Jul 9, 2024

Did anyone get to the bottom of this?

I have experienced in both kernel
6.5.13-5-pve
6.8.8-2-pve

I thought this was due to be undervolting my cpu (and it could be) but I'm pretty sure I have reset settings and still got this. Currently I have the voltage applied and no amount of rebooting appears to kick the error out. I just reset my bios.

Other times I can reboot reboot reboot and continue to get it.

It only appears to have started following an upgrade. I have not upgraded in 4 months ish

aliang · Jul 28, 2024

Giggling3999 said:
Did anyone get to the bottom of this?

I have experienced in both kernel
6.5.13-5-pve
6.8.8-2-pve

I thought this was due to be undervolting my cpu (and it could be) but I'm pretty sure I have reset settings and still got this. Currently I have the voltage applied and no amount of rebooting appears to kick the error out. I just reset my bios.

Other times I can reboot reboot reboot and continue to get it.

It only appears to have started following an upgrade. I have not upgraded in 4 months ish

Code:

Jul 28 17:00:45 pve kernel: pcieport 0000:02:0b.0: Unable to change power state from D3hot to D0, device inaccessible
Jul 28 17:00:45 pve kernel: pcieport 0000:02:0a.0: Unable to change power state from D3hot to D0, device inaccessible
Jul 28 17:00:45 pve kernel: pcieport 0000:02:09.0: Unable to change power state from D3hot to D0, device inaccessible
Jul 28 17:00:45 pve kernel: pcieport 0000:02:08.0: Unable to change power state from D3hot to D0, device inaccessible
Jul 28 17:01:46 pve kernel: nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jul 28 17:01:46 pve kernel: nvme nvme2: Does your device have a faulty power saving mode enabled?
Jul 28 17:01:46 pve kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Jul 28 17:01:46 pve kernel: nvme 0000:09:00.0: Unable to change power state from D3cold to D0, device inaccessible

I have the same problem, I have 4 intel P4510 4T on my PCIE switch card, They will suddenly lose power shortly after being turned on. I have config "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" in /etc/default/grub, It still doesn't work.

I have tried 6.8.8-3-pve and 6.5.13-5-pve, my filesystem is zfs with raidz1

GrazDiesel90 · Jul 28, 2024

Same here:

Setup:
B650D4U
Ryzen 9750X3D
Memory 128GB
2* Samsung 990 Pro (with Heatsink)

Lost NVME after 24 hours. I think I have done an upgrade to 6.8.8-3-pve before.
Changed nVME 3 Times.

Tried everything in this thread. Any other Ideas.

aliang · Jul 28, 2024

aliang said:

Code:

Jul 28 17:00:45 pve kernel: pcieport 0000:02:0b.0: Unable to change power state from D3hot to D0, device inaccessible
Jul 28 17:00:45 pve kernel: pcieport 0000:02:0a.0: Unable to change power state from D3hot to D0, device inaccessible
Jul 28 17:00:45 pve kernel: pcieport 0000:02:09.0: Unable to change power state from D3hot to D0, device inaccessible
Jul 28 17:00:45 pve kernel: pcieport 0000:02:08.0: Unable to change power state from D3hot to D0, device inaccessible
Jul 28 17:01:46 pve kernel: nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jul 28 17:01:46 pve kernel: nvme nvme2: Does your device have a faulty power saving mode enabled?
Jul 28 17:01:46 pve kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Jul 28 17:01:46 pve kernel: nvme 0000:09:00.0: Unable to change power state from D3cold to D0, device inaccessible

I have the same problem, I have 4 intel P4510 4T on my PCIE switch card, They will suddenly lose power shortly after being turned on. I have config "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" in /etc/default/grub, It still doesn't work.

I have tried 6.8.8-3-pve and 6.5.13-5-pve, my filesystem is zfs with raidz1

So far I think this is due to a power supply issue as I noticed all my drive power lights were off before the error. So I swapped out the diy power cable that link to Hard Drive Bay back to the original one. Yes, no problem has occurred yet. I guess this is also why developers pay so little attention to this, syslog isn't really the problem

Giggling3999 · Jul 28, 2024

I was sick of weird behavior, so I have dramatically had to change my setup. Got rid of my hyper card; combined drives, got rid of my ZFS Slog. No problem on the new kernel, yet - Such a pain

GrazDiesel90 · Jul 28, 2024

GrazDiesel90 said:
Same here:

Setup:
B650D4U
Ryzen 9750X3D
Memory 128GB
2* Samsung 990 Pro (with Heatsink)

Lost NVME after 24 hours. I think I have done an upgrade to 6.8.8-3-pve before.
Changed nVME 3 Times.

Tried everything in this thread. Any other Ideas.

My Setup has 2* Samsung 990 Pro (with Heatsink) M.2 directly on the board. (B650D4U) No PCIE. So it could not be a cable Problem. Maybe I try to change the PSU, if there is a problem with any cable.

Also tried to downgrade Kernel and BIOS etc. No luck.

aliang · Jul 28, 2024

GrazDiesel90 said:
My Setup has 2* Samsung 990 Pro (with Heatsink) M.2 directly on the board. (B650D4U) No PCIE. So it could not be a cable Problem. Maybe I try to change the PSU, if there is a problem with any cable.

Also tried to downgrade Kernel and BIOS etc. No luck.

what is your error log? did you try

Code:

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

at first?

aliang · Jul 28, 2024

Giggling3999 said:
I was sick of weird behavior, so I have dramatically had to change my setup. Got rid of my hyper card; combined drives, got rid of my ZFS Slog. No problem on the new kernel, yet - Such a pain

I feel the same way. I spent two months working on these painful hardware problems. Why is human technology so backward?

GrazDiesel90 · Jul 28, 2024

aliang said:
what is your error log? did you try

Code:

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

at first?

Hi.
Following Log(s) - Two different nvme(s), same slot:
Jul 26 09:31:38 proxmox kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jul 26 09:31:38 proxmox kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Jul 26 09:31:38 proxmox kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Jul 26 09:31:38 proxmox kernel: nvme 0000:0a:00.0: Unable to change power state from D3cold to D0, device inaccessible
Jul 26 09:31:38 proxmox kernel: nvme nvme0: Disabling device after reset failure: -19

Jul 27 13:09:24 proxmox kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jul 27 13:09:24 proxmox kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Jul 27 13:09:24 proxmox kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Jul 27 13:09:24 proxmox kernel: nvme 0000:0a:00.0: Unable to change power state from D3cold to D0, device inaccessible
Jul 27 13:09:24 proxmox kernel: nvme nvme0: Disabling device after reset failure: -19

Yes, tried nvme_core.default_ps_max_latency_us=0 pcie_aspm=off first.

Giggling3999 · Jul 28, 2024

GrazDiesel90 said:
My Setup has 2* Samsung 990 Pro (with Heatsink) M.2 directly on the board. (B650D4U) No PCIE. So it could not be a cable Problem. Maybe I try to change the PSU, if there is a problem with any cable.

Also tried to downgrade Kernel and BIOS etc. No luck.

A couple of things I tried with varied success - Looks like I was experiencing multiple issues though

BIOS reset
pcie powersaving bios disable (aspm?)
kernel pinning ( thought adding as manual was pinning - It isn't)
disable any power saving, powertop etc
990 firmware update - I had to install windows to check this

Giggling3999 · Jul 29, 2024

Aaaand I'm back to

Code:

kvm: ../hw/pci/pci.c:1633: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.

On the latest kernel

aliang · Jul 29, 2024

GrazDiesel90 said:
Hi. Following Log(s) - Two different nvme(s), same slot: Jul 26 09:31:38 proxmox kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff Jul 26 09:31:38 proxmox kernel: nvme nvme0: Does your device have a faulty power saving mode enabled? Jul 26 09:31:38 proxmox kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug Jul 26 09:31:38 proxmox kernel: nvme 0000:0a:00.0: Unable to change power state from D3cold to D0, device inaccessible Jul 26 09:31:38 proxmox kernel: nvme nvme0: Disabling device after reset failure: -19 Jul 27 13:09:24 proxmox kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff Jul 27 13:09:24 proxmox kernel: nvme nvme0: Does your device have a faulty power saving mode enabled? Jul 27 13:09:24 proxmox kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug Jul 27 13:09:24 proxmox kernel: nvme 0000:0a:00.0: Unable to change power state from D3cold to D0, device inaccessible Jul 27 13:09:24 proxmox kernel: nvme nvme0: Disabling device after reset failure: -19 Yes, tried nvme_core.default_ps_max_latency_us=0 pcie_aspm=off first.

I think you may have missed some logs or operation before this, as I said, these logs are just a result, the device was accidentally disconnected before this

GrazDiesel90 · Jul 29, 2024

aliang said:
I think you may have missed some logs or operation before this, as I said, these logs are just a result, the device was accidentally disconnected before this

This was the first log. Nothing relevant before.

KrisFromFuture · Jul 29, 2024

Giggling3999 said:
Aaaand I'm back to

Code:

kvm: ../hw/pci/pci.c:1633: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.

On the latest kernel

so nothin is helping ?

Giggling3999 · Jul 29, 2024

KrisFromFuture said:
so nothin is helping ?

I've pinned the old kernel. I'll let you know.

This does show that it is nothing to do with hyper card or pcie lanes or anything though. I am now just 2xdirect m.2 Samsung 990

KrisFromFuture · Jul 29, 2024

Giggling3999 said:
I've pinned the old kernel. I'll let you know.

This does show that it is nothing to do with hyper card or pcie lanes or anything though. I am now just 2xdirect m.2 Samsung 990

Thanks - please post updates

im waiting for 6.9/6.10 kernel

Giggling3999 · Jul 29, 2024

Honestly I don't know what is going on. Mix of errors

I was unable to boot at one point with my boot nvme in the hyper card, no matter the kernel. (IO Error or something)

aliang · Aug 11, 2024

Sadly, painful mistakes have appeared again

So far, I have tried the following methods：
1. `nvme_core.default_ps_max_latency_us=0 pcie_aspm=off`
2. enable vmd support in bios
3. disable aspm by bios
4. change pcie cable
5 .update pve kernel

They will work occasionally, but they will fail after the next restart, so I am constantly confused.
I found more reports in various applications, such as openzfs: https://github.com/openzfs/zfs/discussions/14793 , or other linux kernel
~~I have no idea now.~~ Now i will try `pcie_port_pm=off pcie_aspm=off` on grub, no error so far.

Looking forward to the help of professionals

GrazDiesel90 · Aug 12, 2024

aliang said:
Sadly, painful mistakes have appeared again
So far, I have tried the following methods：
1. `nvme_core.default_ps_max_latency_us=0 pcie_aspm=off`
2. enable vmd support in bios
3. disable aspm by bios
4. change pcie cable
5 .update pve kernel

They will work occasionally, but they will fail after the next restart, so I am constantly confused.
I found more reports in various applications, such as openzfs: https://github.com/openzfs/zfs/discussions/14793 , or other linux kernel
~~I have no idea now.~~ Now i will try `pcie_port_pm=off pcie_aspm=off` on grub, no error so far.

Looking forward to the help of professionals

Hi have a B650D4U with the latest BIOS and BMC. I downgraded to the previous version and no fail/error so far.
Maybe it helps you do investigate the error.

NVMe Issue: Unable to change power state from D3cold to D0, device inaccessible

New Member

Member

New Member

New Member

New Member

Member

New Member

New Member

New Member

New Member

Member

Member

New Member

New Member

New Member

Member

New Member

Member

New Member

New Member

We value your privacy