NVMe Issue: Unable to change power state from D3cold to D0, device inaccessible

I have the same issue with one of my two Samsung PM9A3 U.2 NVMes
I tried applying the fix but it doesnt seem to work


Code:
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
GRUB_CMDLINE_LINUX=""
I ran upgrade grub after and rebooted the system
 

Attachments

  • Bildschirmfoto 2024-02-03 um 00.00.43.png
    Bildschirmfoto 2024-02-03 um 00.00.43.png
    837 KB · Views: 13
That seems to have worked for some time but when I cloned a VM template it triggered the issue again
 
Thank you for the suggestions! In my case, the issue was actually caused by a faulty SFF-8643 U.2 SFF-8639 cable. I replaced it yesterday, and the system has been stable since.
 
I just registered here to post the same issue as well.

For completeness my setup:
Supermicro X10SRi-F
Debian Bookworm
2* Samsung 990 Pro 4T (with heatsink) as some of the others, reporting this issue.

The NVMe's having the (currently) latest firmware 4B2QJXD7 are both placed on (passive) PCIe Cards and are running in a mirror.

The same SSD's always fails after about 1 1/2 days, I tested this several times.
Then I tested the known kernel parameters, provided in the error message and the SSD's where stable for about 14 days.
After that, Just out of curiosity, I removed the kernel parameters again and switched the PCIe slot placement of the two cards.
Now the other one of the two NVMe's disappears...

So it seems to depend on the order of the two card on the PCIe bus and the kernel (module) triggers certain events suggesting a timing issue?
 
Last edited:
I experienced the same issue and also added the kernel parameter now.
Not sure if this is relevant or somehow related to but I am using four NVMe drives at a PCIe extension card with PCIe bifuraction enabled (x4x4x4x4) and one "disappeared"
 
I just registered here to post the same issue as well.

For completeness my setup:
Supermicro X10SRi-F
Debian Bookworm
2* Samsung 990 Pro 4T (with heatsink) as some of the others, reporting this issue.

The NVMe's having the (currently) latest firmware 4B2QJXD7 are both placed on (passive) PCIe Cards and are running in a mirror.

The same SSD's always fails after about 1 1/2 days, I tested this several times.
Then I tested the known kernel parameters, provided in the error message and the SSD's where stable for about 14 days.
After that, Just out of curiosity, I removed the kernel parameters again and switched the PCIe slot placement of the two cards.
Now the other one of the two NVMe's disappears...

So it seems to depend on the order of the two card on the PCIe bus and the kernel (module) triggers certain events suggesting a timing issue?
Samsung PRO series is a consumer flash, not suitable for server/raid setup, it will wear out quickly.
Samsung PM series is for server/raid setup.

Code:
Example: SAMSUNG PM9A3 NVME M2
3,84TB NVME M2 - SKU: MZ1L23T8HBLA-00A07
1,92TB NVME M2 - SKU: MZ1L21T9HCLS-00A07
960GB NVME M2 - SKU: MZ1L2960HCJR-00A07

https://semiconductor.samsung.com/ssd/datacenter-ssd/pm9a3/


Solution for your problem, maybe use "M2 Carrier Board" which has PCIE-switch so doesn't need BIOS bifurcation support.
Code:
Example:
Supermicro AOC-SHG3-4M2P card
QNAP QM2-4P-384 card
Synology M2D20 card
 
Last edited:
I am new to PVE. Just installed PVE 8.1.4 and a newly bought Kingston NV2 250GB as LVM.
It shows this error periodically:
Code:
Connection failed (Error 500: unable to open file '/var/tmp/pve-reserved-ports.tmp.5189' - Read-only file system)

And later I have found out the root cause is the issue of this thread.
Tried adding this but no luck. Just fail again within minutes to no more than an hour.
Code:
nvme_core.default_ps_max_latency_us=0
 
Last edited:
I am new to PVE. Just installed PVE 8.1.4 and a newly bought Kingston NV2 250GB as LVM.
It shows this error periodically:
Code:
Connection failed (Error 500: unable to open file '/var/tmp/pve-reserved-ports.tmp.5189' - Read-only file system)

And later I have found out the root cause is the issue of this thread.
Tried adding this but no luck. Just fail again within minutes to no more than an hour.
Code:
nvme_core.default_ps_max_latency_us=0

Can you please send the output of pvecm status
 
Can you please send the output of pvecm status
Code:
root@pve:~# pvecm status
Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster?

No. It is not part of a cluster.

I have found an old Kingston A2000. It looks normal for an hour.
 
I'm glad I found this thread which was active as recently as a few weeks ago. I'm having this exact same issue in Truenas SCALE, which is similar to Proxmox in that it's a hypervisor built on top of Debian.

I have a ZFS pool that's a three-way NVME SSD mirror on which I have a number of VMs. One of the SSDs in the pool periodically goes offline, putting the pool into a degraded state (but still functional since it's a mirror). When this happens, I get the same exact error messages as all of you did in /var/log/messages, which is:

Apr 15 20:04:06 patrick-server1 kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Apr 15 20:04:06 patrick-server1 kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Apr 15 20:04:06 patrick-server1 kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Apr 15 20:04:06 patrick-server1 kernel: nvme 0000:10:00.0: enabling device (0000 -> 0002)
Apr 15 20:04:06 patrick-server1 kernel: nvme nvme0: Removing after probe failure status: -19

After a failure such as this, if I reboot the server, the SSD typically comes back online, and after a short resilvering operation, it's OK for a while. That could be a couple of hours until it happens again, a couple of days, or if I'm lucky it will be a couple of weeks before the failure occurs again.

This problem started for me a couple of months ago, in late February 2024.

Here is some more detail. First, this only happens with the one SSD installed on a PCIE card. It does not happen with the two SSDs installed In M.2 slots directly on the motherboard. Second, I have tried switching SSDs including adding a brand new SSD, and that doesn't fix the problem. Third, I have tried swapping out the PCIE card that holds the SSD, and that doesn't fix the problem either. Fourth, the kernel version is 6.1.74.

It's only yesterday when it happened that I thought to look in /var/log/messages, so I have not yet tried the recommended fix of disabling power saver mode with nvme_core.default_ps_max_latency_us=0 pcie_aspm=off.

I'm going to try that now, but I'm not optimistic that it will solve the problem since some in this thread have tried it and still had SSD failure. Also, it doesn't seem like an ideal solution because it's bound to increase power consumption overall on the server. Still, since I can't think of anything else to try, I will try it.

If anyone has anything more to report, I would appreciate hearing it.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!