Proxmox just died with: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10

Looking into this further, I think it might just be an issue with power states:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1705748
https://bugzilla.kernel.org/show_bug.cgi?id=195039
https://docs.microsoft.com/en-us/wi...-management-for-storage-hardware-devices-nvme

On the WD SN850 I have:

Code:
ps    0 : mp:9.00W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:9.00W
ps    1 : mp:4.10W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:4.10W
ps    2 : mp:3.50W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:3.50W
ps    3 : mp:0.0250W non-operational enlat:5000 exlat:10000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:0.0250W active_power:-
ps    4 : mp:0.0050W non-operational enlat:5000 exlat:45000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:0.0050W active_power:-

Code:
nvme get-feature -f 0x0c -H /dev/nvme0
get-feature:0xc (Autonomous Power State Transition), Current value:0x000001
    Autonomous Power State Transition Enable (APSTE): Enabled
    Auto PST Entries    .................
    Entry[ 0]
    .................
    Idle Time Prior to Transition (ITPT): 750 ms
    Idle Transition Power State   (ITPS): 3
    .................
    Entry[ 1]
    .................
    Idle Time Prior to Transition (ITPT): 750 ms
    Idle Transition Power State   (ITPS): 3
    .................
    Entry[ 2]
    .................
    Idle Time Prior to Transition (ITPT): 750 ms
    Idle Transition Power State   (ITPS): 3
    .................
    Entry[ 3]
    .................
    Idle Time Prior to Transition (ITPT): 2500 ms
    Idle Transition Power State   (ITPS): 4
    .................
 
Last edited:
  • Like
Reactions: leesteken
I think I may be near the end of the journey https://git.launchpad.net/~ubuntu-k.../?id=47add9f75714fabd3702dca0e5899a56d2f3ee2f

Essentially, it seems some deep power states are not working on some SSDs on Linux, and there's a quirk patch.

That said, the annoying thing is how to reliably reproduce this error in order to be confident in the solution.

I've sent an email to the linux-nvme mailgroup offering to try the quirk http://lists.infradead.org/pipermail/linux-nvme/2022-May/thread.html

I see there's a fairly recent one as well on the same topic, but for a Seagate Firecuda 530: http://lists.infradead.org/pipermail/linux-nvme/2022-May/031923.html

I just worked out how to search the linux-nvme mailgroup. You can see there's loads of reports on this https://lore.kernel.org/linux-nvme/?q="controller+is+down"
 
Last edited:
Hi all,

I'm no Proxmox user but I just registered because this thread led me to the solution, after my WD SN850 started "failing" under Ubuntu 20.04 with 5.18 Liquorix Kernel. I figured it was a power management issue, which I solved by deactivating PCIe APM in BIOS and using the following kernel parameters at boot:

Code:
acpi_enforce_resources=lax nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

Not all of those may be necessary, but I could not be bothered yet to narrow it down any further - I'm happy I can access my drive again even 5min after boot! :)

Thanks for the pointers especially @marcosscriven and I hope you did get the issue sorted yourself!


Cheers,
r.
 
Last edited:
hey guys i have a samsung 980 pro ssd with the same problem. The thing that did it for me is updating the bios. I am using an ryzen am4 platform. that did the tricks
 
In a recent experiment with 4xNVMe NAS setup I faced the same issue and was tied to insufficient power.
The system was stable no matter the workload with 3 drives, stable when only reading lightly with 4 drives and reliably failing when reading intensively (for example ZFS scrubbing) or writing (ZFS rebuild).
Fast NVMe drives and old drives can absorb north of 2.5A on 3.3v
When I installed slow and less power-hungry drives the instability was gone for good, and the difference from the outside was non-existent.
 
  • Like
Reactions: leesteken
Hello everyone,

I'm facing the same problem with Proxmox with an NVMe KC3000 disk. The Proxmox version is 8.4.0 and kernel "Linux bius 6.14.0-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.0-2 (2025-04-10T17:57Z) x86_64 GNU/Linux"

I installed a virtual machine with AlmaLinux. The installation goes well, but when I connect to the virtual machine and try to update with dnf update, it displays the error

systemd x86_64 252-46.el9_5.3.alma.1 baseos 4.0 M
systemd-libs x86_64 252-46.el9_5.3.alma.1 baseos 672 k
systemd-pam x86_64 252-46.el9_5.3.alma.1 baseos 278 k
systemd-rpm-macros noarch 252-46.el9_5.3.alma.1 baseos 66 k
systemd-udev x86_64 252-46.el9_5.3.alma.1 baseos 1.9 M
tuned noarch 2.24.0-2.el9_5.alma.1 baseos 340 k
tzdata noarch 2025b-1.el9 baseos 430 k
unzip x86_64 6.0-58.el9_5 baseos 180 k
webkit2gtk3-jsc x86_64 2.48.1-1.el9_5 appstream 4.7 M
Installing dependencies:
grub2-tools-efi x86_64 1:2.06-94.el9_5.alma.1 baseos 539 k
grub2-tools-extra x86_64 1:2.06-94.el9_5.alma.1 baseos 840 k

Transaction Summary
============================================================================================================================================================================
Install 6 Packages
Upgrade 135 Packages

mapping pread: Input/output error
mapping pread: Input/output error
mapping pread: Input/output error
mapping pread: Input/output error
mapping pread: Input/output error
mapping pread: Input/output error
mapping pread: Input/output error
mapping pread: Input/output error
mapping pread: Input/output error
mapping pread: Input/output error
mapping pread: Input/output error

and in the proxmox dmesg I have

root@bius:~# dmesg
[ 2834.725144] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[ 2834.725149] nvme nvme1: Does your device have a faulty power saving mode enabled?
[ 2834.725152] nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
[ 2834.761172] nvme 0000:01:00.0: enabling device (0000 -> 0002)
[ 2834.761286] nvme nvme1: Disabling device after reset failure: -19
[ 2834.768565] iou-wrk-8981: attempt to access beyond end of device
nvme1n1: rw=34817, sector=37660680, nr_sectors = 8 limit=0
[ 2834.768589] iou-wrk-8981: attempt to access beyond end of device
nvme1n1: rw=34817, sector=36452904, nr_sectors = 16 limit=0
[ 2834.768608] iou-wrk-8981: attempt to access beyond end of device
nvme1n1: rw=34817, sector=36452848, nr_sectors = 16 limit=0
[ 2834.768624] iou-wrk-8981: attempt to access beyond end of device
nvme1n1: rw=34817, sector=36402352, nr_sectors = 8 limit=0
[ 2834.768642] iou-wrk-8981: attempt to access beyond end of device
nvme1n1: rw=34817, sector=33502149, nr_sectors = 22 limit=0
[ 2834.768659] iou-wrk-8981: attempt to access beyond end of device
nvme1n1: rw=0, sector=33570912, nr_sectors = 8 limit=0
[ 2834.776422] iou-wrk-8981: attempt to access beyond end of device
nvme1n1: rw=34817, sector=37660680, nr_sectors = 8 limit=0
[ 2834.776437] iou-wrk-8981: attempt to access beyond end of device
nvme1n1: rw=34817, sector=36452904, nr_sectors = 16 limit=0
[ 2834.776448] iou-wrk-8981: attempt to access beyond end of device
nvme1n1: rw=34817, sector=36452848, nr_sectors = 16 limit=0
[ 2834.776459] iou-wrk-8981: attempt to access beyond end of device
nvme1n1: rw=34817, sector=36402352, nr_sectors = 8 limit=0
root@bius:~# uname -a
Linux bius 6.14.0-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.0-2 (2025-04-10T17:57Z) x86_64 GNU/Linux
root@bius:~#

I have already set the options in the grub file /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="quiet acpi_enforce_resources=lax nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off"

But the error persists, I updated the motherboard BIOS (Asus PRIME B550M-A motherboard) and disabled APM in the BIOS as well.


Any ideas on what I can do?

Thanks in advance.