NVMe Issue: Unable to change power state from D3cold to D0, device inaccessible

flomp · Jan 14, 2024

My system is running stable with the current workaround. However, a bad feeling remains. How long will it work? What other quirks does it have?

I googled a bit more and it looks like this is not a Linux-only problem.
For example look here: https://www.reddit.com/r/ZephyrusM16/comments/18p13y6/samsung_990_pro_d_drive_keeps_disappearing/

Thinking about returning the drive and getting some other brand. My trust in Samsung SSDs is dropping.

Vengance · Feb 2, 2024

I have the same issue with one of my two Samsung PM9A3 U.2 NVMes
I tried applying the fix but it doesnt seem to work

Code:

# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
GRUB_CMDLINE_LINUX=""

I ran upgrade grub after and rebooted the system

Trace8773 · Feb 3, 2024

Hi, maybe try the suggested kernel parameters but without pcie_aspm=off as this seemed to work better for me and others (see this)

Regards,
Trace

Vengance · Feb 3, 2024

That seems to have worked for some time but when I cloned a VM template it triggered the issue again

emunt6 · Feb 7, 2024

Hi!
Disable "Sleep / Power Saving" on nvme, resolving the "power reset" problem:

Code:

# Limit PowerMangement #
$> nvme set-feature /dev/nvme1 -f 2 -v 2

# Disable PowerMangement #
$> nvme set-feature /dev/nvme1 -f 2 -v 0

# Check status #
$> nvme get-feature /dev/nvme1 -f 2 -H

https://nvmexpress.org/resource/technology-power-features/

Vengance · Feb 8, 2024

Thank you for the suggestions! In my case, the issue was actually caused by a faulty SFF-8643 U.2 SFF-8639 cable. I replaced it yesterday, and the system has been stable since.

csx · Mar 13, 2024

I just registered here to post the same issue as well.

For completeness my setup:
Supermicro X10SRi-F
Debian Bookworm
2* Samsung 990 Pro 4T (with heatsink) as some of the others, reporting this issue.

The NVMe's having the (currently) latest firmware 4B2QJXD7 are both placed on (passive) PCIe Cards and are running in a mirror.

The same SSD's always fails after about 1 1/2 days, I tested this several times.
Then I tested the known kernel parameters, provided in the error message and the SSD's where stable for about 14 days.
After that, Just out of curiosity, I removed the kernel parameters again and switched the PCIe slot placement of the two cards.
Now the other one of the two NVMe's disappears...

So it seems to depend on the order of the two card on the PCIe bus and the kernel (module) triggers certain events suggesting a timing issue?

showiproute · Mar 14, 2024

I experienced the same issue and also added the kernel parameter now.
Not sure if this is relevant or somehow related to but I am using four NVMe drives at a PCIe extension card with PCIe bifuraction enabled (x4x4x4x4) and one "disappeared"

emunt6 · Mar 14, 2024

csx said:
I just registered here to post the same issue as well.

For completeness my setup:
Supermicro X10SRi-F
Debian Bookworm
2* Samsung 990 Pro 4T (with heatsink) as some of the others, reporting this issue.

The NVMe's having the (currently) latest firmware 4B2QJXD7 are both placed on (passive) PCIe Cards and are running in a mirror.

The same SSD's always fails after about 1 1/2 days, I tested this several times.
Then I tested the known kernel parameters, provided in the error message and the SSD's where stable for about 14 days.
After that, Just out of curiosity, I removed the kernel parameters again and switched the PCIe slot placement of the two cards.
Now the other one of the two NVMe's disappears...

So it seems to depend on the order of the two card on the PCIe bus and the kernel (module) triggers certain events suggesting a timing issue?

Samsung PRO series is a consumer flash, not suitable for server/raid setup, it will wear out quickly.
Samsung PM series is for server/raid setup.

Code:

Example: SAMSUNG PM9A3 NVME M2
3,84TB NVME M2 - SKU: MZ1L23T8HBLA-00A07
1,92TB NVME M2 - SKU: MZ1L21T9HCLS-00A07
960GB NVME M2 - SKU: MZ1L2960HCJR-00A07

https://semiconductor.samsung.com/ssd/datacenter-ssd/pm9a3/

Solution for your problem, maybe use "M2 Carrier Board" which has PCIE-switch so doesn't need BIOS bifurcation support.

Code:

Example:
Supermicro AOC-SHG3-4M2P card
QNAP QM2-4P-384 card
Synology M2D20 card

showiproute · Mar 15, 2024

I logged a bug report at Bugzilla for the Proxmox team to investigate: https://bugzilla.proxmox.com/show_bug.cgi?id=5306

y2kbug · Mar 16, 2024

I am new to PVE. Just installed PVE 8.1.4 and a newly bought Kingston NV2 250GB as LVM.
It shows this error periodically:

Code:

Connection failed (Error 500: unable to open file '/var/tmp/pve-reserved-ports.tmp.5189' - Read-only file system)

And later I have found out the root cause is the issue of this thread.
Tried adding this but no luck. Just fail again within minutes to no more than an hour.

Code:

nvme_core.default_ps_max_latency_us=0

showiproute · Mar 16, 2024

y2kbug said:
I am new to PVE. Just installed PVE 8.1.4 and a newly bought Kingston NV2 250GB as LVM.
It shows this error periodically:

Code:

Connection failed (Error 500: unable to open file '/var/tmp/pve-reserved-ports.tmp.5189' - Read-only file system)

And later I have found out the root cause is the issue of this thread.
Tried adding this but no luck. Just fail again within minutes to no more than an hour.

Code:

nvme_core.default_ps_max_latency_us=0

Can you please send the output of pvecm status

y2kbug · Mar 16, 2024

showiproute said:
Can you please send the output of pvecm status

Code:

root@pve:~# pvecm status
Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster?

No. It is not part of a cluster.

I have found an old Kingston A2000. It looks normal for an hour.

patrick999 · Apr 16, 2024

I'm glad I found this thread which was active as recently as a few weeks ago. I'm having this exact same issue in Truenas SCALE, which is similar to Proxmox in that it's a hypervisor built on top of Debian.

I have a ZFS pool that's a three-way NVME SSD mirror on which I have a number of VMs. One of the SSDs in the pool periodically goes offline, putting the pool into a degraded state (but still functional since it's a mirror). When this happens, I get the same exact error messages as all of you did in /var/log/messages, which is:

Apr 15 20:04:06 patrick-server1 kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Apr 15 20:04:06 patrick-server1 kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Apr 15 20:04:06 patrick-server1 kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Apr 15 20:04:06 patrick-server1 kernel: nvme 0000:10:00.0: enabling device (0000 -> 0002)
Apr 15 20:04:06 patrick-server1 kernel: nvme nvme0: Removing after probe failure status: -19

After a failure such as this, if I reboot the server, the SSD typically comes back online, and after a short resilvering operation, it's OK for a while. That could be a couple of hours until it happens again, a couple of days, or if I'm lucky it will be a couple of weeks before the failure occurs again.

This problem started for me a couple of months ago, in late February 2024.

Here is some more detail. First, this only happens with the one SSD installed on a PCIE card. It does not happen with the two SSDs installed In M.2 slots directly on the motherboard. Second, I have tried switching SSDs including adding a brand new SSD, and that doesn't fix the problem. Third, I have tried swapping out the PCIE card that holds the SSD, and that doesn't fix the problem either. Fourth, the kernel version is 6.1.74.

It's only yesterday when it happened that I thought to look in /var/log/messages, so I have not yet tried the recommended fix of disabling power saver mode with nvme_core.default_ps_max_latency_us=0 pcie_aspm=off.

I'm going to try that now, but I'm not optimistic that it will solve the problem since some in this thread have tried it and still had SSD failure. Also, it doesn't seem like an ideal solution because it's bound to increase power consumption overall on the server. Still, since I can't think of anything else to try, I will try it.

If anyone has anything more to report, I would appreciate hearing it.

Mart814656 · May 10, 2024

The same thing happens to me with a Crucial CT2000P5SSD8 NVMe SSD.

Just as I had finished moving VMWare 24 hours ago, the data storage disappeared. With ESXi7 there were no problems for years. Runtimes of one year were not uncommon.

I now have a very strange feeling about Proxmox.

I activate `nvme_core.default_ps_max_latency_us=0` now, without `pcie_aspm=off`

i hope this solve this issue

jf21 · May 13, 2024

I can confirm this issue with the samsung 990, both 2TB and 4TB models. Updating the firmware doesnt help. Btw, you dont need to boot the iso or use magician to update the firmware. Just download the firmware update iso and mount it:

mount Samsung_SSD_990_PRO_4B2QJXD7.iso -o loop /mnt
then create temporary directory and copy the initrd to your system:
mkdir /root/samsung-fw
cd /mnt; cp initrd /root/samsung-fw/initrd.gz
Now you have to extract the initrd.gz:
cd /root/samsung-fw; zcat initrd.gz | cpio -idmv
Now you can switch to the folder where the update is located and run the update programm (it will guide you through the process and ask which SSD should be updated):
/root/samung-fw/root/fumagician; ./fumagician

Mart814656 · May 15, 2024

Next outage

Code:

May 15 11:08:17 pve kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
May 15 11:08:17 pve kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
May 15 11:08:17 pve kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
May 15 11:08:17 pve kernel: nvme 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
May 15 11:08:17 pve kernel: nvme nvme0: Disabling device after reset failure: -19
May 15 11:08:17 pve kernel: I/O error, dev nvme0n1, sector 2117920536 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
May 15 11:08:17 pve kernel: device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -5
May 15 11:08:17 pve kernel: iou-wrk-2468: attempt to access beyond end of device
nvme0n1: rw=34817, sector=573309512, nr_sectors = 56 limit=0

The whole thing has nothing to do with any firmwares or Smasung SSDs. It must be a bug in the kernel. If you install VMware ESXi 7.0U3 on the same hardware, everything runs stable.

An bug track report should be created.

evilJazz · Jun 4, 2024

Update: Never mind. Same issue again. The d3cold_allowed hack below does not work, but limiting nvme_core.default_ps_max_latency_us works, albeit with higher power consumption...

Just wanted to chime in. Not using Proxmox yet, but seeing this issue on Ubuntu 24.04 with latest 6.8.0-35-generic kernel. Been suffering from this controller down issue for the last two months. Already destroyed one XFS filesystem backed by bcache in writeback mode. Fortunately backup to the rescue. :/

Three 990 Pro 2TB in my server, all newest firmware 4B2QJXD7. Two of the 990 Pros are of the heatsink-ed variety and on a 16x to 4x4x bifurcation carrier board (two 990 Pro 2TB, two 6-port ASM1166 SATA controllers)

Code:

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

works, but the power consumption is 6-8 Watts higher.

My hunch is that the new D3cold handling code in Linux is interfering with the bifurcation and "combined" power management of the bifurcated PCIe bus.

I cooked up this line in my /etc/rc.local (yes, old-school...) that is disabling D3cold on _all_ PCIe ports:

Bash:

find -L /sys/bus/pci/devices/ -maxdepth 2 -name "*d3cold_allowed*" -exec sh -c 'echo -n "Disabling D3cold on $1: "; echo 0 | tee "$1"' _ {} \;

I removed

Code:

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

and it has been working for 2 days already. I'll update you how that goes.

pabb · Jun 5, 2024

I signed up to bump the thread and say that I've had the same (similar) issue myself;

Code:

nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
nvme nvme0: Does your device have a faulty power saving mode enabled?
nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
nvme 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible
nvme nvme0: Disabling device after reset failure: -19

Fully updated Proxmox, 2x 4TB 990 Pro with heatsink, have not tried the kernel params yet but doesn't seem like they will work anyway judging by this thread? Latest firmware.

FWIW, I'm going to try disabling ASPM Support in the BIOS and see how I go.

pabb · Jun 6, 2024

....aaaaand that didn't work. Lost one of the m.2 drives this morning again. Have not rebooted it to see if it comes back online yet.

NVMe Issue: Unable to change power state from D3cold to D0, device inaccessible

New Member

Renowned Member

Attachments

New Member

Renowned Member

Active Member

Renowned Member

New Member

Renowned Member

Active Member

Renowned Member

New Member

Renowned Member

New Member

New Member

New Member

Active Member

New Member

New Member

New Member

New Member

We value your privacy