NVMe Issue: Unable to change power state from D3cold to D0, device inaccessible

GrazDiesel90 · Aug 18, 2024

GrazDiesel90 said:
Hi have a B650D4U with the latest BIOS and BMC. I downgraded to the previous version and no fail/error so far.
Maybe it helps you do investigate the error.

Today it happend again. Looking forward to the help of professionals too.

aliang · Sep 11, 2024

GrazDiesel90 said:
Today it happend again. Looking forward to the help of professionals too.

Hi everyone, for now, I think disable `Pci Express Clock Gating` may be the solution. As I understand it, Pci Express Clock Gating is to turn off the clock signal when the device is idle to save power consumption. Maybe some nvme devices do not support this.
It currently works for me through a simple AB test, I had it running normally for over a month until I rebooted .
you can give it a try

pmtuxar · Oct 27, 2024

This is just awful. Is there a solution in site? My m.2 drives (both 990 Pro) seem to be fine, but the u.2 drives will both drop off after some time (both of them within ~30-60 seconds of each other). What seems to make it more stable (less unstable) is making sure this is set to on:

/sys/class/nvme/nvme#/device/power/control

If it is set to auto (which I saw in several other installs, but not my latest for some reason), the drives drop off very quickly....

Just like some others' experience, these drives worked perfectly under ESXi.

GrazDiesel90 · Oct 28, 2024

I found a solution that worked for me:

(install apt-get install nvme-cli)
Check "nvme get-feature /dev/nvme0 -f 0xc -H" // Scroll Up, first Line "Autonomous Power State Transition Enable (APSTE): Enabled"
nano /etc/kernel/cmdline
Append "boot=zfs nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
"root=ZFS=rpool/ROOT/pve-1 boot=zfs boot=zfs nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
proxmox-boot-tool refresh
Reboot
Check "nvme get-feature /dev/nvme0 -f 0xc -H" again // Autonomous Power State Transition Enable (APSTE): Disabled

No Problems so far, since weeks.

Giggling3999 · Oct 28, 2024

Are you sure this has fixed it and it's not just a long time since you had error? I had it like this for long long stretches.
I also definitely got the error while ASPM was disabled. (Disabled at bios level)

All that - And I'm sat 100% stable with ASPM enabled on kernel 6.5.13-6-pve, now

GrazDiesel90 · Oct 28, 2024

Giggling3999 said:
Are you sure this has fixed it and it's not just a long time since you had error? I had it like this for long long stretches.
I also definitely got the error while ASPM was disabled. (Disabled at bios level)

All that - And I'm sat 100% stable with ASPM enabled on kernel 6.5.13-6-pve, now

I´m pretty sure that this was the problem. Normally we had the error on one of our 10 Server 1-2 times per week (Random). Now no issues since weeks. No other changes.

Giggling3999 · Oct 28, 2024

I ran for seemingly months.

Then after a reboot it would happen sporadically, but frequently

pmtuxar · Oct 28, 2024

GrazDiesel90 said:
I found a solution that worked for me:

(install apt-get install nvme-cli)

Check "nvme get-feature /dev/nvme0 -f 0xc -H" // Scroll Up, first Line "Autonomous Power State Transition Enable (APSTE): Enabled"

nano /etc/kernel/cmdline

Append "boot=zfs nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"

"root=ZFS=rpool/ROOT/pve-1 boot=zfs boot=zfs nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"

proxmox-boot-tool refresh

Reboot

Check "nvme get-feature /dev/nvme0 -f 0xc -H" again // Autonomous Power State Transition Enable (APSTE): Disabled

No Problems so far, since weeks.

Thanks! But... already have those in my GRUB file, and can confirm APSTE is Disabled for the two 990 Pro drives, but I get the following for my two u.2 drives:

Code:

# nvme get-feature /dev/nvme2 -f 0xc -H
NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x2)

Maybe I can find out what will return that info, if available, on them.

I've been hunting for opposite solutions too. I can't figure out, yet, how to get ASPM to show that it's ENABLED.

I have:

Enabled ASPM in BIOS
Removed 'pcei_aspm=off' from GRUB
Rebooted
cat /proc/cmdline to confirm it was as expected after change
'lspci -vvv' and observed all PCIe devices still show linkctl of 'ASPM Disabled'
Added 'pcie_aspm=on' to GRUB
cat /proc/cmdline to confirm it was as expected after change
'spci -vvv' and observed all PCIe devices still show linkctl of 'ASPM Disabled'

I just want to prove out that this ASPM setting is actually doing something, and needed, not just anecdotal.

walterz · Dec 24, 2024

I installed a version above 8, but it wasn't the latest one either. I encountered this problem about a week ago. I used an NVME hard disk enclosure to install the Proxmox VE (PVE) system. Yesterday, I updated it to the latest version 8.3.2 and also added the above code in GRUB. It's been a day and a night without any problem so far. Before that, it happened several times a day. I hope there won't be any more error messages after the update.

To solve this problem, I specifically bought a hard disk enclosure with the RTL9210 chip again. Before, it was the ASM2362 chip. Now it seems that there may be some improvements after updating the Proxmox VE (PVE) version and the kernel. The current kernel is 6.8.12-5-pve. I just checked and it has been running for 18 hours. I'll continue to observe it.

usridzero · Jan 12, 2025

I'm running pve-manager/8.3.2/3e76eec21c4a14a7 (running kernel: 6.8.12-5-pve) and I'm having the same problem... Can anyone please help?

Here is a link to my journal files: https://file.io/4tM9tRWDnk6V

usridzero · Jan 12, 2025

here is another link for my journal: https://www.dropbox.com/scl/fi/9ltq...ey=c9ghoe6ww74ezg8xqz067zlsz&st=kcuk68g3&dl=0

jf21 · Jan 13, 2025

We didnt find any solution for the Samsung 990 Pro's problem - tried everything. They just seem to get dropped every ~30 days (most runtime we got). We ended up replacing all of them from our servers and use a different vendor.
Interesstingly the issue itself seems to be firmware related to the samsung 990 pro. For example: we had multiple servers with 990 pro running and updated them (patchday). All of them dropped atleast one SSD after around ~30 days after that - and all servers within almost the same time (max. 1 hour difference, could be due to uptime not 100% identical). So i guess its some counter or something running over and causing this. This can probably only be fixed by samsung...but as its not a server/datacenter product, who knows if they will do anything....

usridzero · Jan 13, 2025

jf21 said:
We didnt find any solution for the Samsung 990 Pro's problem - tried everything. They just seem to get dropped every ~30 days (most runtime we got). We ended up replacing all of them from our servers and use a different vendor.
Interesstingly the issue itself seems to be firmware related to the samsung 990 pro. For example: we had multiple servers with 990 pro running and updated them (patchday). All of them dropped atleast one SSD after around ~30 days after that - and all servers within almost the same time (max. 1 hour difference, could be due to uptime not 100% identical). So i guess its some counter or something running over and causing this. This can probably only be fixed by samsung...but as its not a server/datacenter product, who knows if they will do anything....

Sadly enough that's the one drive I'm running the OS from. I will have to replace and rebuild.

usridzero · Jan 14, 2025

so, is that it? I keep seeing a high load average, and the system becomes unresponsive until a reboot. What can I do about it? Any help is greatly appreciated.

usridzero · Jan 14, 2025

as an FSA, I noticed today that there was a KVM process taking all my CPU resources. Once I killed it everything became normal again. Why am I seeing a KVM issue? Any thoughts?

usridzero · Jan 14, 2025

system crashed overnight again.

jf21 · Jan 14, 2025

usridzero said:
system crashed overnight again.

This thread was/is about the samsung 990 pro dropping out of the system (with the error message from the subject of the thread). If your system crashes then you probably have different logs/effects and you should probably open a different thread.

usridzero · Jan 15, 2025

jf21 said:
This thread was/is about the samsung 990 pro dropping out of the system (with the error message from the subject of the thread). If your system crashes then you probably have different logs/effects and you should probably open a different thread.

my bad, this thread came up after I search for a system hanging and high load average numbers. I'll start a new thread.

Search

Search

NVMe Issue: Unable to change power state from D3cold to D0, device inaccessible

GrazDiesel90

New Member

aliang

New Member

pmtuxar

New Member

GrazDiesel90

New Member

Giggling3999

Member

GrazDiesel90

New Member

Giggling3999

Member

pmtuxar

New Member

walterz

New Member

usridzero

Member

usridzero

Member

jf21

Active Member

usridzero

Member

usridzero

Member

usridzero

Member

usridzero

Member

jf21

Active Member

usridzero

Member

We value your privacy