NVMe Issue: Unable to change power state from D3cold to D0, device inaccessible

Today it happend again. Looking forward to the help of professionals too.
Hi everyone, for now, I think disable `Pci Express Clock Gating` may be the solution. As I understand it, Pci Express Clock Gating is to turn off the clock signal when the device is idle to save power consumption. Maybe some nvme devices do not support this.
It currently works for me through a simple AB test, I had it running normally for over a month until I rebooted .
you can give it a try
 
Last edited:
This is just awful. Is there a solution in site? My m.2 drives (both 990 Pro) seem to be fine, but the u.2 drives will both drop off after some time (both of them within ~30-60 seconds of each other). What seems to make it more stable (less unstable) is making sure this is set to on:

/sys/class/nvme/nvme#/device/power/control

If it is set to auto (which I saw in several other installs, but not my latest for some reason), the drives drop off very quickly....

Just like some others' experience, these drives worked perfectly under ESXi.
 
I found a solution that worked for me:

  • (install apt-get install nvme-cli)
  • Check "nvme get-feature /dev/nvme0 -f 0xc -H" // Scroll Up, first Line "Autonomous Power State Transition Enable (APSTE): Enabled"
  • nano /etc/kernel/cmdline
  • Append "boot=zfs nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
  • "root=ZFS=rpool/ROOT/pve-1 boot=zfs boot=zfs nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
  • proxmox-boot-tool refresh
  • Reboot
  • Check "nvme get-feature /dev/nvme0 -f 0xc -H" again // Autonomous Power State Transition Enable (APSTE): Disabled

No Problems so far, since weeks.
 
Last edited:
Are you sure this has fixed it and it's not just a long time since you had error? I had it like this for long long stretches.
I also definitely got the error while ASPM was disabled. (Disabled at bios level)

All that - And I'm sat 100% stable with ASPM enabled on kernel 6.5.13-6-pve, now
 
Are you sure this has fixed it and it's not just a long time since you had error? I had it like this for long long stretches.
I also definitely got the error while ASPM was disabled. (Disabled at bios level)

All that - And I'm sat 100% stable with ASPM enabled on kernel 6.5.13-6-pve, now
I´m pretty sure that this was the problem. Normally we had the error on one of our 10 Server 1-2 times per week (Random). Now no issues since weeks. No other changes.
 
I found a solution that worked for me:

  • (install apt-get install nvme-cli)
  • Check "nvme get-feature /dev/nvme0 -f 0xc -H" // Scroll Up, first Line "Autonomous Power State Transition Enable (APSTE): Enabled"
  • nano /etc/kernel/cmdline
  • Append "boot=zfs nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
  • "root=ZFS=rpool/ROOT/pve-1 boot=zfs boot=zfs nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
  • proxmox-boot-tool refresh
  • Reboot
  • Check "nvme get-feature /dev/nvme0 -f 0xc -H" again // Autonomous Power State Transition Enable (APSTE): Disabled

No Problems so far, since weeks.
Thanks! But... already have those in my GRUB file, and can confirm APSTE is Disabled for the two 990 Pro drives, but I get the following for my two u.2 drives:

Code:
# nvme get-feature /dev/nvme2 -f 0xc -H
NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x2)

Maybe I can find out what will return that info, if available, on them.

I've been hunting for opposite solutions too. I can't figure out, yet, how to get ASPM to show that it's ENABLED.

I have:

Enabled ASPM in BIOS
Removed 'pcei_aspm=off' from GRUB
Rebooted
cat /proc/cmdline to confirm it was as expected after change
'lspci -vvv' and observed all PCIe devices still show linkctl of 'ASPM Disabled'
Added 'pcie_aspm=on' to GRUB
cat /proc/cmdline to confirm it was as expected after change
'spci -vvv' and observed all PCIe devices still show linkctl of 'ASPM Disabled'

I just want to prove out that this ASPM setting is actually doing something, and needed, not just anecdotal.
 
Last edited:
I installed a version above 8, but it wasn't the latest one either. I encountered this problem about a week ago. I used an NVME hard disk enclosure to install the Proxmox VE (PVE) system. Yesterday, I updated it to the latest version 8.3.2 and also added the above code in GRUB. It's been a day and a night without any problem so far. Before that, it happened several times a day. I hope there won't be any more error messages after the update.

To solve this problem, I specifically bought a hard disk enclosure with the RTL9210 chip again. Before, it was the ASM2362 chip. Now it seems that there may be some improvements after updating the Proxmox VE (PVE) version and the kernel. The current kernel is 6.8.12-5-pve. I just checked and it has been running for 18 hours. I'll continue to observe it.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!