[SOLVED] Are SSD NVMe issues resurfacing with the latest PVE kernel upgrade?

Unfortunately, I cannot confirm that this issue has been resolved. I updated the kernel and added the command line, but the problem returned after a few days. Currently, 13 of 88 OSD's are down due to this issue. I would greatly appreciate any further assistance on this issue.
Have you tried asking Proxmox directly via a support subscription?
 
  • Like
Reactions: EllerholdAG
Hi,

Unfortunately, I cannot confirm that this issue has been resolved. I updated the kernel and added the command line, but the problem returned after a few days. Currently, 13 of 88 OSD's are down due to this issue. I would greatly appreciate any further assistance on this issue.
The only thing that helped was putting the drives in a USB-M.2 adapter. In an adapter, we had no problems.

Besides that, nothing helped and the drives seemed to fail more often as time went on. We ended up replacing the problem drives with another manufacturer's drives and have had no problems since.
 
The only thing that helped was putting the drives in a USB-M.2 adapter. In an adapter, we had no problems.

Besides that, nothing helped and the drives seemed to fail more often as time went on. We ended up replacing the problem drives with another manufacturer's drives and have had no problems since.
What drives did you switch to?

Also is there any form of HCL for proxmox?
 
I hope it is ok to chime in with a problem I am currently experiencing with Kioxia enterprise NVMe drives in an AMD AM5 B650 system (Asrock B650D4U) with the newest kernel on Proxmox 9

root@pve0:~# uname --all
Linux pve0 6.14.11-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.11-2 (2025-09-12T09:46Z) x86_64 GNU/Linux

After only a few hours (sometimes just minutes) of use, one or both of the drives in the ZFS mirrors are reported as "controller is down; will reset":

Sep 26 03:24:49 pve0 kernel: nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Does your device have a faulty power saving mode enabled?
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
Sep 26 03:24:49 pve0 kernel: nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Does your device have a faulty power saving mode enabled?
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
Sep 26 03:24:49 pve0 kernel: nvme 0000:04:00.0: enabling device (0000 -> 0002)
Sep 26 03:24:49 pve0 kernel: nvme 0000:07:00.0: enabling device (0000 -> 0002)
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Disabling device after reset failure: -19
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Disabling device after reset failure: -19

After a hard power cycle, the drives show up fine until the next error. The drives affected are these:

/dev/nvme1n1 /dev/ng1n1 22P0A0xxxxxx KCD61VUL3T20 0x1 254.86 GB / 3.20 TB 512 B + 0 B 0106
/dev/nvme2n1 /dev/ng2n1 22P0A0xxxxxx KCD61VUL3T20 0x1 252.50 GB / 3.20 TB 512 B + 0 B 0106

I have followed all the recommended countermeasures - to no avail:

cat /etc/kernel/cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs module_blacklist=amdgpu nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off

The boot ZFS mirror is not affected (pair of Intel Optane drives). Since one of them is attached to the same type of cable.

Is there anything I can do? "Get new" drives is not an option that I am very happy with and I don't understand why the same type of drive is working fine in two other systems on the same CPU but a different motherboard (Asrock DeskMeet X600).
 
i think no one found the root course so far.

i only found that it is not an proxmox problem but a general thing...
(also happens on ceph, zfs, btrfs... so has nothing to do with the filesystem)

problem on unraid

problem on fedora

problem on ubuntu

some people can fix it with the kernel parameters, for some a bios update fixes it, for others only mainboard or ssd swap helps (somewhere i read that an heat sink for m.2 fixed it or changing the m.2 slot to an adapter card)
 
I hope it is ok to chime in with a problem I am currently experiencing with Kioxia enterprise NVMe drives in an AMD AM5 B650 system (Asrock B650D4U) with the newest kernel on Proxmox 9

root@pve0:~# uname --all
Linux pve0 6.14.11-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.11-2 (2025-09-12T09:46Z) x86_64 GNU/Linux

After only a few hours (sometimes just minutes) of use, one or both of the drives in the ZFS mirrors are reported as "controller is down; will reset":

Sep 26 03:24:49 pve0 kernel: nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Does your device have a faulty power saving mode enabled?
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
Sep 26 03:24:49 pve0 kernel: nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Does your device have a faulty power saving mode enabled?
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
Sep 26 03:24:49 pve0 kernel: nvme 0000:04:00.0: enabling device (0000 -> 0002)
Sep 26 03:24:49 pve0 kernel: nvme 0000:07:00.0: enabling device (0000 -> 0002)
Sep 26 03:24:49 pve0 kernel: nvme nvme1: Disabling device after reset failure: -19
Sep 26 03:24:49 pve0 kernel: nvme nvme2: Disabling device after reset failure: -19

After a hard power cycle, the drives show up fine until the next error. The drives affected are these:

/dev/nvme1n1 /dev/ng1n1 22P0A0xxxxxx KCD61VUL3T20 0x1 254.86 GB / 3.20 TB 512 B + 0 B 0106
/dev/nvme2n1 /dev/ng2n1 22P0A0xxxxxx KCD61VUL3T20 0x1 252.50 GB / 3.20 TB 512 B + 0 B 0106

I have followed all the recommended countermeasures - to no avail:

cat /etc/kernel/cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs module_blacklist=amdgpu nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off

The boot ZFS mirror is not affected (pair of Intel Optane drives). Since one of them is attached to the same type of cable.

Is there anything I can do? "Get new" drives is not an option that I am very happy with and I don't understand why the same type of drive is working fine in two other systems on the same CPU but a different motherboard (Asrock DeskMeet X600).
Hi,

with my Samsung NVME drives and ceph, a drive firmware update helped to get rid of that issue and they are running stable since around 2 months now. Maybe this could help here as well.

best regards
Steffen
 
  • Like
Reactions: Kingneutron