[SOLVED] Are SSD NVMe issues resurfacing with the latest PVE kernel upgrade?

Unfortunately, I cannot confirm that this issue has been resolved. I updated the kernel and added the command line, but the problem returned after a few days. Currently, 13 of 88 OSD's are down due to this issue. I would greatly appreciate any further assistance on this issue.
Have you tried asking Proxmox directly via a support subscription?
 
  • Like
Reactions: EllerholdAG
Hi,

Unfortunately, I cannot confirm that this issue has been resolved. I updated the kernel and added the command line, but the problem returned after a few days. Currently, 13 of 88 OSD's are down due to this issue. I would greatly appreciate any further assistance on this issue.
The only thing that helped was putting the drives in a USB-M.2 adapter. In an adapter, we had no problems.

Besides that, nothing helped and the drives seemed to fail more often as time went on. We ended up replacing the problem drives with another manufacturer's drives and have had no problems since.
 
The only thing that helped was putting the drives in a USB-M.2 adapter. In an adapter, we had no problems.

Besides that, nothing helped and the drives seemed to fail more often as time went on. We ended up replacing the problem drives with another manufacturer's drives and have had no problems since.
What drives did you switch to?

Also is there any form of HCL for proxmox?
 
I hope it is ok to chime in with a problem I am currently experiencing with Kioxia enterprise NVMe drives in an AMD AM5 B650 system (Asrock B650D4U) with the newest kernel on Proxmox 9

root@pve0:~# uname --all
Linux pve0 6.14.11-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.11-2 (2025-09-12T09:46Z) x86_64 GNU/Linux

After only a few hours (sometimes just minutes) of use, one or both of the drives in the ZFS mirrors are reported as "controller is down; will reset":

Sep 26 03:24:49 pve0 kernel: nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Does your device have a faulty power saving mode enabled?
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
Sep 26 03:24:49 pve0 kernel: nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Does your device have a faulty power saving mode enabled?
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
Sep 26 03:24:49 pve0 kernel: nvme 0000:04:00.0: enabling device (0000 -> 0002)
Sep 26 03:24:49 pve0 kernel: nvme 0000:07:00.0: enabling device (0000 -> 0002)
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Disabling device after reset failure: -19
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Disabling device after reset failure: -19

After a hard power cycle, the drives show up fine until the next error. The drives affected are these:

/dev/nvme1n1 /dev/ng1n1 22P0A0xxxxxx KCD61VUL3T20 0x1 254.86 GB / 3.20 TB 512 B + 0 B 0106
/dev/nvme2n1 /dev/ng2n1 22P0A0xxxxxx KCD61VUL3T20 0x1 252.50 GB / 3.20 TB 512 B + 0 B 0106

I have followed all the recommended countermeasures - to no avail:

cat /etc/kernel/cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs module_blacklist=amdgpu nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off

The boot ZFS mirror is not affected (pair of Intel Optane drives). Since one of them is attached to the same type of cable.

Is there anything I can do? "Get new" drives is not an option that I am very happy with and I don't understand why the same type of drive is working fine in two other systems on the same CPU but a different motherboard (Asrock DeskMeet X600).
 
i think no one found the root course so far.

i only found that it is not an proxmox problem but a general thing...
(also happens on ceph, zfs, btrfs... so has nothing to do with the filesystem)

problem on unraid

problem on fedora

problem on ubuntu

some people can fix it with the kernel parameters, for some a bios update fixes it, for others only mainboard or ssd swap helps (somewhere i read that an heat sink for m.2 fixed it or changing the m.2 slot to an adapter card)
 
I hope it is ok to chime in with a problem I am currently experiencing with Kioxia enterprise NVMe drives in an AMD AM5 B650 system (Asrock B650D4U) with the newest kernel on Proxmox 9

root@pve0:~# uname --all
Linux pve0 6.14.11-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.11-2 (2025-09-12T09:46Z) x86_64 GNU/Linux

After only a few hours (sometimes just minutes) of use, one or both of the drives in the ZFS mirrors are reported as "controller is down; will reset":

Sep 26 03:24:49 pve0 kernel: nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Does your device have a faulty power saving mode enabled?
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
Sep 26 03:24:49 pve0 kernel: nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Does your device have a faulty power saving mode enabled?
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
Sep 26 03:24:49 pve0 kernel: nvme 0000:04:00.0: enabling device (0000 -> 0002)
Sep 26 03:24:49 pve0 kernel: nvme 0000:07:00.0: enabling device (0000 -> 0002)
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Disabling device after reset failure: -19
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Disabling device after reset failure: -19

After a hard power cycle, the drives show up fine until the next error. The drives affected are these:

/dev/nvme1n1 /dev/ng1n1 22P0A0xxxxxx KCD61VUL3T20 0x1 254.86 GB / 3.20 TB 512 B + 0 B 0106
/dev/nvme2n1 /dev/ng2n1 22P0A0xxxxxx KCD61VUL3T20 0x1 252.50 GB / 3.20 TB 512 B + 0 B 0106

I have followed all the recommended countermeasures - to no avail:

cat /etc/kernel/cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs module_blacklist=amdgpu nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off

The boot ZFS mirror is not affected (pair of Intel Optane drives). Since one of them is attached to the same type of cable.

Is there anything I can do? "Get new" drives is not an option that I am very happy with and I don't understand why the same type of drive is working fine in two other systems on the same CPU but a different motherboard (Asrock DeskMeet X600).
Hi,

with my Samsung NVME drives and ceph, a drive firmware update helped to get rid of that issue and they are running stable since around 2 months now. Maybe this could help here as well.

best regards
Steffen
 
  • Like
Reactions: Kingneutron
Hi Team,

After conducting intensive tests with several 4TB 990 Pro SSDs with heatsinks, the issue now seems to be resolved. Here is my functional configuration:
  • Kernel Linux 6.14.5-1-bpo12-pve
  • GRUB_CMDLINE_LINUX_DEFAULT="quiet nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off"
Can others confirm so that we can flag this thread as "[solved]"?

I've been trying this too, but turning off power management for all of PCIe is of course more of a workaround, not a real solution.

Interestingly I just discovered there is a new firmware version for Samsung 990 PRO out since September, and while the release notes are windows-centric and lack details, it does sound promising:

*(7B2QJXD7) To address the intermittent non-recognition and blue screen issue. (Release: September 2025)

https://semiconductor.samsung.com/consumer-storage/support/tools/

Updated one drive today and the update seems fine so far but in my case it will take at least several weeks of stability before I dare begin to hope.
 
My configuration:

pve-manager/8.4.14/b502d23c55afcba1 (running kernel: 6.8.12-15-pve)

4x Minisforum 890 PRO: 96 GB RAM + 2x SSD 1TB for proxmox + 4TB for OSD on each machine.

Samsung 990 PRO 4TB randomly drops out. If one OSD drops out, another one drops out shortly after.

The disk in the computer disappears, only a restart helps. I'm not sure if it has to be hardware, but probably yes.

I haven't found any connection. The disks are brand new, with 3% wear. The temperature is about 32°C. It's cold, they are in a room where the temperature is about 16-17°C. The fans on the computers are running at minimum speed.

I'm going to update the kernel, so I'll let you know what the result is.

Thanks Tom
 
[...]

Samsung 990 PRO 4TB randomly drops out. If one OSD drops out, another one drops out shortly after.

The disk in the computer disappears, only a restart helps. I'm not sure if it has to be hardware, but probably yes.

[...]

Did you see the two solutions posted previously in my comment (https://forum.proxmox.com/threads/a...-latest-pve-kernel-upgrade.163503/post-808000). One solution is the workaround using kernel parameters (which I just quoted from earlier commenters) and then recently there is also firmware upgrade which seems like it's intended to fix the issue.

So far firmware 7B2QJXD7 seems to be running stable for me but it's only been a week and a half yet.
 
I still saw similar error after upgrade to firmware 7B2QJXD7. However once I applied the options mentioned by YAGA, I have not see any related errors in last 10 days.
 
As I couldn't boot from the ISO (virtual media on the IPMI would require additional license) I copied the initrd from the ISO and extracted the files from ti:
zcat ../initrd | cpio -idmv
and then copy fumagician folder to /root/ and run it from there. after the firmware has been uploaded the server needs to reboot (or maybe even better shut it down for a cold reset) to take the new firmware.
Let's see if the system will stay stable now.
 
I am using the following hardware and am having trouble with the NVMes that are attached via the PCIe to NVMe adapters:

MINISFORUM MS-01-S1260 Mini PC with Intel Core i5-12600H
NVMe to PCIe Adapter: ICY DOCK M.2 NVMe SSD to PCIe 3.0/4.0 x4
NVMe OSD: MZ1LB1T9HALS Samsung PM983 1.92TB NVMe PCIe M.2 22110 SSD MZ-1LB1T90


PVE 9.1.1
Kernel: 6.17.2-1-pve

I tried the following GRUB settings, but these nvmes are still dropping:


Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off"

Are there any other suggestions?

P.S. I cannot find the firmware to download anywhere online, even though I have the toolkit, so I can't do that.
 
Last edited:
Hi,
seems I've run into the same issue.

I've got 3 Proxmox nodes, all the same configuration:
  • AMD Ryzen 7 5700X 8-Core Processor
  • Mainboard Asrock Rack X570DVU
  • vm pool: mirror of two WD Red SN700 4000GB, firmware version 11C120WD
The three nodes have been running for the last three years without problems.

But now, two nodes upgraded to kernel 7.0.2-7, the last one still running 6.17.13-9 (forgot to reboot). The nodes with kernel 7.0.2 last week showed "Fatal or unknown test error" at SMART extended test on three disks. As nothing problematic could be found in SMART data/log, I wanted to check this further this week (disks are expensive nowadays).
During the weekend: boom, the three disks completely stopped working. One node down, the the second node kept running with degraded pool.

Log from node1:
Code:
2026-06-07T00:24:31.732060+02:00 node1 kernel: nvme nvme0: I/O tag 443 (41bb) opcode 0x9 (I/O Cmd) QID 5 timeout, aborting req_op:DISCARD(3) size:102400
2026-06-07T00:24:31.732072+02:00 node1 kernel: nvme nvme0: I/O tag 287 (911f) opcode 0x9 (I/O Cmd) QID 7 timeout, aborting req_op:DISCARD(3) size:40960
2026-06-07T00:24:31.732074+02:00 node1 kernel: nvme nvme1: I/O tag 587 (624b) opcode 0x9 (I/O Cmd) QID 7 timeout, aborting req_op:DISCARD(3) size:40960
2026-06-07T00:24:31.732075+02:00 node1 kernel: nvme nvme1: I/O tag 131 (f083) opcode 0x9 (I/O Cmd) QID 9 timeout, aborting req_op:DISCARD(3) size:102400
2026-06-07T00:25:01.940062+02:00 node1 kernel: nvme nvme1: I/O tag 587 (624b) opcode 0x9 (I/O Cmd) QID 7 timeout, reset controller
2026-06-07T00:25:01.944053+02:00 node1 kernel: nvme nvme0: I/O tag 443 (41bb) opcode 0x9 (I/O Cmd) QID 5 timeout, reset controller
2026-06-07T00:26:14.914058+02:00 node1 kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
2026-06-07T00:26:14.914069+02:00 node1 kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
2026-06-07T00:26:14.922053+02:00 node1 kernel: nvme nvme0: Abort status: 0x371
2026-06-07T00:26:14.922058+02:00 node1 kernel: nvme nvme0: Abort status: 0x371
2026-06-07T00:26:14.930051+02:00 node1 kernel: nvme nvme1: Abort status: 0x371
2026-06-07T00:26:14.930056+02:00 node1 kernel: nvme nvme1: Abort status: 0x371
2026-06-07T00:26:24.929069+02:00 node1 kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
2026-06-07T00:26:24.938157+02:00 node1 kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
2026-06-07T00:34:58.661057+02:00 node1 kernel: nvme nvme0: Disabling device after reset failure: -19
2026-06-07T00:34:58.957106+02:00 node1 kernel: I/O error, dev nvme0n1, sector 25008 op 0x3:(DISCARD) flags 0x800800 phys_seg 1 prio class 2
2026-06-07T00:34:59.211058+02:00 node1 kernel: nvme nvme1: Disabling device after reset failure: -19
2026-06-07T00:52:59.540116+02:00 node1 smartd[1565]: Device: /dev/nvme0, open() of NVMe device failed: Resource temporarily unavailable
2026-06-07T00:52:59.567516+02:00 node1 smartd[1565]: Device: /dev/nvme1, open() of NVMe device failed: Resource temporarily unavailable
2026-06-07T01:22:59.618617+02:00 node1 smartd[1565]: Device: /dev/nvme0, open() of NVMe device failed: Resource temporarily unavailable
2026-06-07T01:22:59.618706+02:00 node1 smartd[1565]: Device: /dev/nvme1, open() of NVMe device failed: Resource temporarily unavailable
2026-06-07T01:52:59.652925+02:00 node1 smartd[1565]: Device: /dev/nvme0, open() of NVMe device failed: Resource temporarily unavailable
2026-06-07T01:52:59.653088+02:00 node1 smartd[1565]: Device: /dev/nvme1, open() of NVMe device failed: Resource temporarily unavailable
2026-06-07T02:22:59.708403+02:00 node1 smartd[1565]: Device: /dev/nvme0, open() of NVMe device failed: Resource temporarily unavailable

After power off and reboot the pools are running again (for now).
 
One other thing I see here in common:
- Samsung/WD "consumer" drives
- Adapters (M2 to NVMe; SFF to U2)

Also, just setting the kernel parameters to go without power management is often not enough, the EFI handles and could be overriding your kernel, so go into your motherboard settings and make sure that any power management options there for PCIe busses are turned off as well.
 
Last edited:
The Samsung 990 Pro SSDs are fixed using firmware upgrade:
NVMe SSD-990 PRO Series Firmware
*(8B2QJXD7) To improve read-operation stability. (Release : December 2025)
*(7B2QJXD7) To address the intermittent non-recognition and blue screen issue. (Release: September 2025)

If you can't run Magician Windows software, you can download bootable ISOs to upgrade your Samsung consumer SSDs here:
https://semiconductor.samsung.com/consumer-storage/support/tools/
 
I've booted my node1 with kernel 6.17.13 and SMART extended test still fails with "Fatal or unknown test error" on both WD Red SN700.

So I've exchanged one of the WD Red SN700 with a Lexar NM1090 PRO and booted with 7.0.2 kernel again. The pool resilvered and SMART extended test now completes without error on both SSDs (WD and Lexar).

I've put the WD Red in a Windows PC to run the Sandisk Dashboard tool. The tool says the firmware on my WD Red is the most recent version (11C120WD). Also it run a SMART extended test, which completed without error.

My node3 still runs without issues with two WD Red, but I will exchange one of them to an alternative (Lexar or Crucial) on all nodes.
 
with the usual workaround "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" it worked fine/stable for me for the last half year with the 6x kernels.

but with the recent (last two months?) 7.x kernels, my SSDs where dropped like every other day..

there was even a kernel where the "pcie_aspm=off" made the system boot for ever with timeouts and errors..

so i think there is still some unhappy combinations between pci devices and the kernel power management.

at the same time i dont see the same issues on the windows side, which makes me assume that either the windows drivers have workarounds or something is working differed there with pci power management.
maybe it boils down to uefi implementations that are more optimized for windows and linux trips over a not so clean uefi.