NVMe Issue: Unable to change power state from D3cold to D0, device inaccessible

flomp · Jan 14, 2024

My system is running stable with the current workaround. However, a bad feeling remains. How long will it work? What other quirks does it have?

I googled a bit more and it looks like this is not a Linux-only problem.
For example look here: https://www.reddit.com/r/ZephyrusM16/comments/18p13y6/samsung_990_pro_d_drive_keeps_disappearing/

Thinking about returning the drive and getting some other brand. My trust in Samsung SSDs is dropping.

Vengance · Feb 2, 2024

I have the same issue with one of my two Samsung PM9A3 U.2 NVMes
I tried applying the fix but it doesnt seem to work

Code:

# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
GRUB_CMDLINE_LINUX=""

I ran upgrade grub after and rebooted the system

Trace8773 · Feb 3, 2024

Hi, maybe try the suggested kernel parameters but without pcie_aspm=off as this seemed to work better for me and others (see this)

Regards,
Trace

Vengance · Feb 3, 2024

That seems to have worked for some time but when I cloned a VM template it triggered the issue again

emunt6 · Feb 7, 2024

Hi!
Disable "Sleep / Power Saving" on nvme, resolving the "power reset" problem:

Code:

# Limit PowerMangement #
$> nvme set-feature /dev/nvme1 -f 2 -v 2

# Disable PowerMangement #
$> nvme set-feature /dev/nvme1 -f 2 -v 0

# Check status #
$> nvme get-feature /dev/nvme1 -f 2 -H

https://nvmexpress.org/resource/technology-power-features/

Vengance · Feb 8, 2024

Thank you for the suggestions! In my case, the issue was actually caused by a faulty SFF-8643 U.2 SFF-8639 cable. I replaced it yesterday, and the system has been stable since.

csx · Mar 13, 2024

I just registered here to post the same issue as well.

For completeness my setup:
Supermicro X10SRi-F
Debian Bookworm
2* Samsung 990 Pro 4T (with heatsink) as some of the others, reporting this issue.

The NVMe's having the (currently) latest firmware 4B2QJXD7 are both placed on (passive) PCIe Cards and are running in a mirror.

The same SSD's always fails after about 1 1/2 days, I tested this several times.
Then I tested the known kernel parameters, provided in the error message and the SSD's where stable for about 14 days.
After that, Just out of curiosity, I removed the kernel parameters again and switched the PCIe slot placement of the two cards.
Now the other one of the two NVMe's disappears...

So it seems to depend on the order of the two card on the PCIe bus and the kernel (module) triggers certain events suggesting a timing issue?

showiproute · Mar 14, 2024

I experienced the same issue and also added the kernel parameter now.
Not sure if this is relevant or somehow related to but I am using four NVMe drives at a PCIe extension card with PCIe bifuraction enabled (x4x4x4x4) and one "disappeared"

emunt6 · Mar 14, 2024

csx said:
I just registered here to post the same issue as well.

For completeness my setup:
Supermicro X10SRi-F
Debian Bookworm
2* Samsung 990 Pro 4T (with heatsink) as some of the others, reporting this issue.

The NVMe's having the (currently) latest firmware 4B2QJXD7 are both placed on (passive) PCIe Cards and are running in a mirror.

The same SSD's always fails after about 1 1/2 days, I tested this several times.
Then I tested the known kernel parameters, provided in the error message and the SSD's where stable for about 14 days.
After that, Just out of curiosity, I removed the kernel parameters again and switched the PCIe slot placement of the two cards.
Now the other one of the two NVMe's disappears...

So it seems to depend on the order of the two card on the PCIe bus and the kernel (module) triggers certain events suggesting a timing issue?

Samsung PRO series is a consumer flash, not suitable for server/raid setup, it will wear out quickly.
Samsung PM series is for server/raid setup.

Code:

Example: SAMSUNG PM9A3 NVME M2
3,84TB NVME M2 - SKU: MZ1L23T8HBLA-00A07
1,92TB NVME M2 - SKU: MZ1L21T9HCLS-00A07
960GB NVME M2 - SKU: MZ1L2960HCJR-00A07

https://semiconductor.samsung.com/ssd/datacenter-ssd/pm9a3/

Solution for your problem, maybe use "M2 Carrier Board" which has PCIE-switch so doesn't need BIOS bifurcation support.

Code:

Example:
Supermicro AOC-SHG3-4M2P card
QNAP QM2-4P-384 card
Synology M2D20 card

showiproute · Mar 15, 2024

I logged a bug report at Bugzilla for the Proxmox team to investigate: https://bugzilla.proxmox.com/show_bug.cgi?id=5306

y2kbug · Mar 16, 2024

I am new to PVE. Just installed PVE 8.1.4 and a newly bought Kingston NV2 250GB as LVM.
It shows this error periodically:

Code:

Connection failed (Error 500: unable to open file '/var/tmp/pve-reserved-ports.tmp.5189' - Read-only file system)

And later I have found out the root cause is the issue of this thread.
Tried adding this but no luck. Just fail again within minutes to no more than an hour.

Code:

nvme_core.default_ps_max_latency_us=0

showiproute · Mar 16, 2024

y2kbug said:
I am new to PVE. Just installed PVE 8.1.4 and a newly bought Kingston NV2 250GB as LVM.
It shows this error periodically:

Code:

Connection failed (Error 500: unable to open file '/var/tmp/pve-reserved-ports.tmp.5189' - Read-only file system)

And later I have found out the root cause is the issue of this thread.
Tried adding this but no luck. Just fail again within minutes to no more than an hour.

Code:

nvme_core.default_ps_max_latency_us=0

Can you please send the output of pvecm status

y2kbug · Mar 16, 2024

showiproute said:
Can you please send the output of pvecm status

Code:

root@pve:~# pvecm status
Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster?

No. It is not part of a cluster.

I have found an old Kingston A2000. It looks normal for an hour.

patrick999 · Apr 16, 2024

I'm glad I found this thread which was active as recently as a few weeks ago. I'm having this exact same issue in Truenas SCALE, which is similar to Proxmox in that it's a hypervisor built on top of Debian.

I have a ZFS pool that's a three-way NVME SSD mirror on which I have a number of VMs. One of the SSDs in the pool periodically goes offline, putting the pool into a degraded state (but still functional since it's a mirror). When this happens, I get the same exact error messages as all of you did in /var/log/messages, which is:

Apr 15 20:04:06 patrick-server1 kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Apr 15 20:04:06 patrick-server1 kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Apr 15 20:04:06 patrick-server1 kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Apr 15 20:04:06 patrick-server1 kernel: nvme 0000:10:00.0: enabling device (0000 -> 0002)
Apr 15 20:04:06 patrick-server1 kernel: nvme nvme0: Removing after probe failure status: -19

After a failure such as this, if I reboot the server, the SSD typically comes back online, and after a short resilvering operation, it's OK for a while. That could be a couple of hours until it happens again, a couple of days, or if I'm lucky it will be a couple of weeks before the failure occurs again.

This problem started for me a couple of months ago, in late February 2024.

Here is some more detail. First, this only happens with the one SSD installed on a PCIE card. It does not happen with the two SSDs installed In M.2 slots directly on the motherboard. Second, I have tried switching SSDs including adding a brand new SSD, and that doesn't fix the problem. Third, I have tried swapping out the PCIE card that holds the SSD, and that doesn't fix the problem either. Fourth, the kernel version is 6.1.74.

It's only yesterday when it happened that I thought to look in /var/log/messages, so I have not yet tried the recommended fix of disabling power saver mode with nvme_core.default_ps_max_latency_us=0 pcie_aspm=off.

I'm going to try that now, but I'm not optimistic that it will solve the problem since some in this thread have tried it and still had SSD failure. Also, it doesn't seem like an ideal solution because it's bound to increase power consumption overall on the server. Still, since I can't think of anything else to try, I will try it.

If anyone has anything more to report, I would appreciate hearing it.

Search

Search

NVMe Issue: Unable to change power state from D3cold to D0, device inaccessible

flomp

New Member

Vengance

Well-Known Member

Attachments

Trace8773

New Member

Vengance

Well-Known Member

emunt6

Active Member

Vengance

Well-Known Member

csx

New Member

showiproute

Well-Known Member

emunt6

Active Member

showiproute

Well-Known Member

y2kbug

New Member

showiproute

Well-Known Member

y2kbug

New Member

patrick999

New Member