Proxmox just died with: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10

Looking into this further, I think it might just be an issue with power states:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1705748
https://bugzilla.kernel.org/show_bug.cgi?id=195039
https://docs.microsoft.com/en-us/wi...-management-for-storage-hardware-devices-nvme

On the WD SN850 I have:

Code:
ps    0 : mp:9.00W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:9.00W
ps    1 : mp:4.10W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:4.10W
ps    2 : mp:3.50W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:3.50W
ps    3 : mp:0.0250W non-operational enlat:5000 exlat:10000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:0.0250W active_power:-
ps    4 : mp:0.0050W non-operational enlat:5000 exlat:45000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:0.0050W active_power:-

Code:
nvme get-feature -f 0x0c -H /dev/nvme0
get-feature:0xc (Autonomous Power State Transition), Current value:0x000001
    Autonomous Power State Transition Enable (APSTE): Enabled
    Auto PST Entries    .................
    Entry[ 0]
    .................
    Idle Time Prior to Transition (ITPT): 750 ms
    Idle Transition Power State   (ITPS): 3
    .................
    Entry[ 1]
    .................
    Idle Time Prior to Transition (ITPT): 750 ms
    Idle Transition Power State   (ITPS): 3
    .................
    Entry[ 2]
    .................
    Idle Time Prior to Transition (ITPT): 750 ms
    Idle Transition Power State   (ITPS): 3
    .................
    Entry[ 3]
    .................
    Idle Time Prior to Transition (ITPT): 2500 ms
    Idle Transition Power State   (ITPS): 4
    .................
 
Last edited:
  • Like
Reactions: leesteken
I think I may be near the end of the journey https://git.launchpad.net/~ubuntu-k.../?id=47add9f75714fabd3702dca0e5899a56d2f3ee2f

Essentially, it seems some deep power states are not working on some SSDs on Linux, and there's a quirk patch.

That said, the annoying thing is how to reliably reproduce this error in order to be confident in the solution.

I've sent an email to the linux-nvme mailgroup offering to try the quirk http://lists.infradead.org/pipermail/linux-nvme/2022-May/thread.html

I see there's a fairly recent one as well on the same topic, but for a Seagate Firecuda 530: http://lists.infradead.org/pipermail/linux-nvme/2022-May/031923.html

I just worked out how to search the linux-nvme mailgroup. You can see there's loads of reports on this https://lore.kernel.org/linux-nvme/?q="controller+is+down"
 
Last edited:
Hi all,

I'm no Proxmox user but I just registered because this thread led me to the solution, after my WD SN850 started "failing" under Ubuntu 20.04 with 5.18 Liquorix Kernel. I figured it was a power management issue, which I solved by deactivating PCIe APM in BIOS and using the following kernel parameters at boot:

Code:
acpi_enforce_resources=lax nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

Not all of those may be necessary, but I could not be bothered yet to narrow it down any further - I'm happy I can access my drive again even 5min after boot! :)

Thanks for the pointers especially @marcosscriven and I hope you did get the issue sorted yourself!


Cheers,
r.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!