NVMe Issue: Unable to change power state from D3cold to D0, device inaccessible

Today it happend again. Looking forward to the help of professionals too.
Hi everyone, for now, I think disable `Pci Express Clock Gating` may be the solution. As I understand it, Pci Express Clock Gating is to turn off the clock signal when the device is idle to save power consumption. Maybe some nvme devices do not support this.
It currently works for me through a simple AB test, I had it running normally for over a month until I rebooted .
you can give it a try
 
Last edited:
This is just awful. Is there a solution in site? My m.2 drives (both 990 Pro) seem to be fine, but the u.2 drives will both drop off after some time (both of them within ~30-60 seconds of each other). What seems to make it more stable (less unstable) is making sure this is set to on:

/sys/class/nvme/nvme#/device/power/control

If it is set to auto (which I saw in several other installs, but not my latest for some reason), the drives drop off very quickly....

Just like some others' experience, these drives worked perfectly under ESXi.
 
I found a solution that worked for me:

  • (install apt-get install nvme-cli)
  • Check "nvme get-feature /dev/nvme0 -f 0xc -H" // Scroll Up, first Line "Autonomous Power State Transition Enable (APSTE): Enabled"
  • nano /etc/kernel/cmdline
  • Append "boot=zfs nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
  • "root=ZFS=rpool/ROOT/pve-1 boot=zfs boot=zfs nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
  • proxmox-boot-tool refresh
  • Reboot
  • Check "nvme get-feature /dev/nvme0 -f 0xc -H" again // Autonomous Power State Transition Enable (APSTE): Disabled

No Problems so far, since weeks.
 
Last edited:
Are you sure this has fixed it and it's not just a long time since you had error? I had it like this for long long stretches.
I also definitely got the error while ASPM was disabled. (Disabled at bios level)

All that - And I'm sat 100% stable with ASPM enabled on kernel 6.5.13-6-pve, now
 
Are you sure this has fixed it and it's not just a long time since you had error? I had it like this for long long stretches.
I also definitely got the error while ASPM was disabled. (Disabled at bios level)

All that - And I'm sat 100% stable with ASPM enabled on kernel 6.5.13-6-pve, now
I´m pretty sure that this was the problem. Normally we had the error on one of our 10 Server 1-2 times per week (Random). Now no issues since weeks. No other changes.
 
I found a solution that worked for me:

  • (install apt-get install nvme-cli)
  • Check "nvme get-feature /dev/nvme0 -f 0xc -H" // Scroll Up, first Line "Autonomous Power State Transition Enable (APSTE): Enabled"
  • nano /etc/kernel/cmdline
  • Append "boot=zfs nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
  • "root=ZFS=rpool/ROOT/pve-1 boot=zfs boot=zfs nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
  • proxmox-boot-tool refresh
  • Reboot
  • Check "nvme get-feature /dev/nvme0 -f 0xc -H" again // Autonomous Power State Transition Enable (APSTE): Disabled

No Problems so far, since weeks.
Thanks! But... already have those in my GRUB file, and can confirm APSTE is Disabled for the two 990 Pro drives, but I get the following for my two u.2 drives:

Code:
# nvme get-feature /dev/nvme2 -f 0xc -H
NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x2)

Maybe I can find out what will return that info, if available, on them.

I've been hunting for opposite solutions too. I can't figure out, yet, how to get ASPM to show that it's ENABLED.

I have:

Enabled ASPM in BIOS
Removed 'pcei_aspm=off' from GRUB
Rebooted
cat /proc/cmdline to confirm it was as expected after change
'lspci -vvv' and observed all PCIe devices still show linkctl of 'ASPM Disabled'
Added 'pcie_aspm=on' to GRUB
cat /proc/cmdline to confirm it was as expected after change
'spci -vvv' and observed all PCIe devices still show linkctl of 'ASPM Disabled'

I just want to prove out that this ASPM setting is actually doing something, and needed, not just anecdotal.
 
Last edited:
I installed a version above 8, but it wasn't the latest one either. I encountered this problem about a week ago. I used an NVME hard disk enclosure to install the Proxmox VE (PVE) system. Yesterday, I updated it to the latest version 8.3.2 and also added the above code in GRUB. It's been a day and a night without any problem so far. Before that, it happened several times a day. I hope there won't be any more error messages after the update.

To solve this problem, I specifically bought a hard disk enclosure with the RTL9210 chip again. Before, it was the ASM2362 chip. Now it seems that there may be some improvements after updating the Proxmox VE (PVE) version and the kernel. The current kernel is 6.8.12-5-pve. I just checked and it has been running for 18 hours. I'll continue to observe it.
 
We didnt find any solution for the Samsung 990 Pro's problem - tried everything. They just seem to get dropped every ~30 days (most runtime we got). We ended up replacing all of them from our servers and use a different vendor.
Interesstingly the issue itself seems to be firmware related to the samsung 990 pro. For example: we had multiple servers with 990 pro running and updated them (patchday). All of them dropped atleast one SSD after around ~30 days after that - and all servers within almost the same time (max. 1 hour difference, could be due to uptime not 100% identical). So i guess its some counter or something running over and causing this. This can probably only be fixed by samsung...but as its not a server/datacenter product, who knows if they will do anything....
 
We didnt find any solution for the Samsung 990 Pro's problem - tried everything. They just seem to get dropped every ~30 days (most runtime we got). We ended up replacing all of them from our servers and use a different vendor.
Interesstingly the issue itself seems to be firmware related to the samsung 990 pro. For example: we had multiple servers with 990 pro running and updated them (patchday). All of them dropped atleast one SSD after around ~30 days after that - and all servers within almost the same time (max. 1 hour difference, could be due to uptime not 100% identical). So i guess its some counter or something running over and causing this. This can probably only be fixed by samsung...but as its not a server/datacenter product, who knows if they will do anything....
Sadly enough that's the one drive I'm running the OS from. I will have to replace and rebuild.
 
so, is that it? I keep seeing a high load average, and the system becomes unresponsive until a reboot. What can I do about it? Any help is greatly appreciated.
 
as an FSA, I noticed today that there was a KVM process taking all my CPU resources. Once I killed it everything became normal again. Why am I seeing a KVM issue? Any thoughts?
 
system crashed overnight again.
This thread was/is about the samsung 990 pro dropping out of the system (with the error message from the subject of the thread). If your system crashes then you probably have different logs/effects and you should probably open a different thread.
 
This thread was/is about the samsung 990 pro dropping out of the system (with the error message from the subject of the thread). If your system crashes then you probably have different logs/effects and you should probably open a different thread.
my bad, this thread came up after I search for a system hanging and high load average numbers. I'll start a new thread.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!