Proxmox just died with: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10

Looking into this further, I think it might just be an issue with power states:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1705748
https://bugzilla.kernel.org/show_bug.cgi?id=195039
https://docs.microsoft.com/en-us/wi...-management-for-storage-hardware-devices-nvme

On the WD SN850 I have:

Code:
ps    0 : mp:9.00W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:9.00W
ps    1 : mp:4.10W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:4.10W
ps    2 : mp:3.50W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:3.50W
ps    3 : mp:0.0250W non-operational enlat:5000 exlat:10000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:0.0250W active_power:-
ps    4 : mp:0.0050W non-operational enlat:5000 exlat:45000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:0.0050W active_power:-

Code:
nvme get-feature -f 0x0c -H /dev/nvme0
get-feature:0xc (Autonomous Power State Transition), Current value:0x000001
    Autonomous Power State Transition Enable (APSTE): Enabled
    Auto PST Entries    .................
    Entry[ 0]
    .................
    Idle Time Prior to Transition (ITPT): 750 ms
    Idle Transition Power State   (ITPS): 3
    .................
    Entry[ 1]
    .................
    Idle Time Prior to Transition (ITPT): 750 ms
    Idle Transition Power State   (ITPS): 3
    .................
    Entry[ 2]
    .................
    Idle Time Prior to Transition (ITPT): 750 ms
    Idle Transition Power State   (ITPS): 3
    .................
    Entry[ 3]
    .................
    Idle Time Prior to Transition (ITPT): 2500 ms
    Idle Transition Power State   (ITPS): 4
    .................
 
Last edited:
  • Like
Reactions: leesteken
I think I may be near the end of the journey https://git.launchpad.net/~ubuntu-k.../?id=47add9f75714fabd3702dca0e5899a56d2f3ee2f

Essentially, it seems some deep power states are not working on some SSDs on Linux, and there's a quirk patch.

That said, the annoying thing is how to reliably reproduce this error in order to be confident in the solution.

I've sent an email to the linux-nvme mailgroup offering to try the quirk http://lists.infradead.org/pipermail/linux-nvme/2022-May/thread.html

I see there's a fairly recent one as well on the same topic, but for a Seagate Firecuda 530: http://lists.infradead.org/pipermail/linux-nvme/2022-May/031923.html

I just worked out how to search the linux-nvme mailgroup. You can see there's loads of reports on this https://lore.kernel.org/linux-nvme/?q="controller+is+down"
 
Last edited:
Hi all,

I'm no Proxmox user but I just registered because this thread led me to the solution, after my WD SN850 started "failing" under Ubuntu 20.04 with 5.18 Liquorix Kernel. I figured it was a power management issue, which I solved by deactivating PCIe APM in BIOS and using the following kernel parameters at boot:

Code:
acpi_enforce_resources=lax nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

Not all of those may be necessary, but I could not be bothered yet to narrow it down any further - I'm happy I can access my drive again even 5min after boot! :)

Thanks for the pointers especially @marcosscriven and I hope you did get the issue sorted yourself!


Cheers,
r.
 
Last edited: