[SOLVED] Are SSD NVMe issues resurfacing with the latest PVE kernel upgrade?

Unfortunately, I cannot confirm that this issue has been resolved. I updated the kernel and added the command line, but the problem returned after a few days. Currently, 13 of 88 OSD's are down due to this issue. I would greatly appreciate any further assistance on this issue.
Have you tried asking Proxmox directly via a support subscription?
 
  • Like
Reactions: EllerholdAG
Hi,

Unfortunately, I cannot confirm that this issue has been resolved. I updated the kernel and added the command line, but the problem returned after a few days. Currently, 13 of 88 OSD's are down due to this issue. I would greatly appreciate any further assistance on this issue.
The only thing that helped was putting the drives in a USB-M.2 adapter. In an adapter, we had no problems.

Besides that, nothing helped and the drives seemed to fail more often as time went on. We ended up replacing the problem drives with another manufacturer's drives and have had no problems since.
 
The only thing that helped was putting the drives in a USB-M.2 adapter. In an adapter, we had no problems.

Besides that, nothing helped and the drives seemed to fail more often as time went on. We ended up replacing the problem drives with another manufacturer's drives and have had no problems since.
What drives did you switch to?

Also is there any form of HCL for proxmox?
 
I hope it is ok to chime in with a problem I am currently experiencing with Kioxia enterprise NVMe drives in an AMD AM5 B650 system (Asrock B650D4U) with the newest kernel on Proxmox 9

root@pve0:~# uname --all
Linux pve0 6.14.11-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.11-2 (2025-09-12T09:46Z) x86_64 GNU/Linux

After only a few hours (sometimes just minutes) of use, one or both of the drives in the ZFS mirrors are reported as "controller is down; will reset":

Sep 26 03:24:49 pve0 kernel: nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Does your device have a faulty power saving mode enabled?
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
Sep 26 03:24:49 pve0 kernel: nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Does your device have a faulty power saving mode enabled?
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
Sep 26 03:24:49 pve0 kernel: nvme 0000:04:00.0: enabling device (0000 -> 0002)
Sep 26 03:24:49 pve0 kernel: nvme 0000:07:00.0: enabling device (0000 -> 0002)
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Disabling device after reset failure: -19
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Disabling device after reset failure: -19

After a hard power cycle, the drives show up fine until the next error. The drives affected are these:

/dev/nvme1n1 /dev/ng1n1 22P0A0xxxxxx KCD61VUL3T20 0x1 254.86 GB / 3.20 TB 512 B + 0 B 0106
/dev/nvme2n1 /dev/ng2n1 22P0A0xxxxxx KCD61VUL3T20 0x1 252.50 GB / 3.20 TB 512 B + 0 B 0106

I have followed all the recommended countermeasures - to no avail:

cat /etc/kernel/cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs module_blacklist=amdgpu nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off

The boot ZFS mirror is not affected (pair of Intel Optane drives). Since one of them is attached to the same type of cable.

Is there anything I can do? "Get new" drives is not an option that I am very happy with and I don't understand why the same type of drive is working fine in two other systems on the same CPU but a different motherboard (Asrock DeskMeet X600).
 
i think no one found the root course so far.

i only found that it is not an proxmox problem but a general thing...
(also happens on ceph, zfs, btrfs... so has nothing to do with the filesystem)

problem on unraid

problem on fedora

problem on ubuntu

some people can fix it with the kernel parameters, for some a bios update fixes it, for others only mainboard or ssd swap helps (somewhere i read that an heat sink for m.2 fixed it or changing the m.2 slot to an adapter card)
 
I hope it is ok to chime in with a problem I am currently experiencing with Kioxia enterprise NVMe drives in an AMD AM5 B650 system (Asrock B650D4U) with the newest kernel on Proxmox 9

root@pve0:~# uname --all
Linux pve0 6.14.11-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.11-2 (2025-09-12T09:46Z) x86_64 GNU/Linux

After only a few hours (sometimes just minutes) of use, one or both of the drives in the ZFS mirrors are reported as "controller is down; will reset":

Sep 26 03:24:49 pve0 kernel: nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Does your device have a faulty power saving mode enabled?
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
Sep 26 03:24:49 pve0 kernel: nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Does your device have a faulty power saving mode enabled?
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
Sep 26 03:24:49 pve0 kernel: nvme 0000:04:00.0: enabling device (0000 -> 0002)
Sep 26 03:24:49 pve0 kernel: nvme 0000:07:00.0: enabling device (0000 -> 0002)
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Disabling device after reset failure: -19
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Disabling device after reset failure: -19

After a hard power cycle, the drives show up fine until the next error. The drives affected are these:

/dev/nvme1n1 /dev/ng1n1 22P0A0xxxxxx KCD61VUL3T20 0x1 254.86 GB / 3.20 TB 512 B + 0 B 0106
/dev/nvme2n1 /dev/ng2n1 22P0A0xxxxxx KCD61VUL3T20 0x1 252.50 GB / 3.20 TB 512 B + 0 B 0106

I have followed all the recommended countermeasures - to no avail:

cat /etc/kernel/cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs module_blacklist=amdgpu nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off

The boot ZFS mirror is not affected (pair of Intel Optane drives). Since one of them is attached to the same type of cable.

Is there anything I can do? "Get new" drives is not an option that I am very happy with and I don't understand why the same type of drive is working fine in two other systems on the same CPU but a different motherboard (Asrock DeskMeet X600).
Hi,

with my Samsung NVME drives and ceph, a drive firmware update helped to get rid of that issue and they are running stable since around 2 months now. Maybe this could help here as well.

best regards
Steffen
 
  • Like
Reactions: Kingneutron
Hi Team,

After conducting intensive tests with several 4TB 990 Pro SSDs with heatsinks, the issue now seems to be resolved. Here is my functional configuration:
  • Kernel Linux 6.14.5-1-bpo12-pve
  • GRUB_CMDLINE_LINUX_DEFAULT="quiet nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off"
Can others confirm so that we can flag this thread as "[solved]"?

I've been trying this too, but turning off power management for all of PCIe is of course more of a workaround, not a real solution.

Interestingly I just discovered there is a new firmware version for Samsung 990 PRO out since September, and while the release notes are windows-centric and lack details, it does sound promising:

*(7B2QJXD7) To address the intermittent non-recognition and blue screen issue. (Release: September 2025)

https://semiconductor.samsung.com/consumer-storage/support/tools/

Updated one drive today and the update seems fine so far but in my case it will take at least several weeks of stability before I dare begin to hope.
 
My configuration:

pve-manager/8.4.14/b502d23c55afcba1 (running kernel: 6.8.12-15-pve)

4x Minisforum 890 PRO: 96 GB RAM + 2x SSD 1TB for proxmox + 4TB for OSD on each machine.

Samsung 990 PRO 4TB randomly drops out. If one OSD drops out, another one drops out shortly after.

The disk in the computer disappears, only a restart helps. I'm not sure if it has to be hardware, but probably yes.

I haven't found any connection. The disks are brand new, with 3% wear. The temperature is about 32°C. It's cold, they are in a room where the temperature is about 16-17°C. The fans on the computers are running at minimum speed.

I'm going to update the kernel, so I'll let you know what the result is.

Thanks Tom
 
[...]

Samsung 990 PRO 4TB randomly drops out. If one OSD drops out, another one drops out shortly after.

The disk in the computer disappears, only a restart helps. I'm not sure if it has to be hardware, but probably yes.

[...]

Did you see the two solutions posted previously in my comment (https://forum.proxmox.com/threads/a...-latest-pve-kernel-upgrade.163503/post-808000). One solution is the workaround using kernel parameters (which I just quoted from earlier commenters) and then recently there is also firmware upgrade which seems like it's intended to fix the issue.

So far firmware 7B2QJXD7 seems to be running stable for me but it's only been a week and a half yet.