[SOLVED] Are SSD NVMe issues resurfacing with the latest PVE kernel upgrade?

YAGA

Renowned Member
Feb 15, 2016
94
10
73
Hi Team,

At the end of 2023, several users reported issues with unexplained loss of access to NVMe SSDs, particularly Samsung 990 Pro NVMe SSDs, which I have. One or more NVMe SSDs suddenly disconnected and were no longer detected by Linux.

The server had to be powered off and then powered back on to detect the NVMe SSD; a simple reboot was insufficient.

The solution was to add `nvme_core.default_ps_max_latency_us=0` in GRUB as follows:

GRUB_CMDLINE_LINUX_DEFAULT="quiet nvme_core.default_ps_max_latency_us=0"

Then, update GRUB with `update-grub` before rebooting.

Subsequently, regular Linux kernel updates in 2024 completely resolved the issue, with no defects reported for a year.

However, in early 2025, the problem suddenly reappeared, likely due to the latest updates of PVE Community Edition.

This is not a hardware failure, as the issue occurs randomly on different servers with various NVMe SSDs. The more the NVMe SSDs are used (e.g., for backups), the more frequently the failure occurs.

I verified that the GRUB parameter was still in effect:

cat /sys/module/nvme_core/parameters/default_ps_max_latency_us
0

Here is the kernel version used:

Linux mars 6.8.12-8-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-8 (2025-01-24T12:32Z) x86_64 GNU/Linux

All these NVMe SSDs are configured as OSD BlueStore (Ceph). When a fault occurs, Ceph reports that 'daemons have recently crashed' I believe this is a consequence rather than the cause.

The first fault occurred at the end of February few days after kernel update.

Am I the only one experiencing this issue again?

Although I'm not entirely sure it's solely a kernel issue, what would be the most prudent method to roll back the kernel?

Which kernel version would be the most reliable?

Any suggestions are welcome.

Kind regards,
 
I've had nothing but trouble with certain datacenter-class NVMe u.2 drives (drop out usually within 24 hours):
  • Intel P4600 6.4TB
  • Micron 9200 MAX 6.4TB
Both of them work great under ESXi, but randomly drop out in Proxmox. Kernel revs during the trouble were floating around 6.8.12-0-6.8.12-3 (I think). I tried everything I could find. Two posts in this semi-related thread here, around October, 2024 (maybe one of the threads you already found):


It's very, very annoying since I have these wonderful drives just sitting a drawer waiting to find evidence that I can use them again.

Now, what I am currently running with great success *ARE* Samsung 990 Pro drives (1 x 2TB and 1x 4TB). Not a lick of trouble with them. Uptime in excess of 30 days... I rarely will run longer than that for various reasons, mostly other hardware changes as I am still building the system.
 
Last edited:
I am using arch Linux and i had drive disappearing after waking up from sleep. Previously i fixed it with settings `nvme_core.default_ps_max_latency_us=0` But after updating to latest kernel it started to occur again. it seems like `nvme_core.default_ps_max_latency_us` doesn't work any more.
 
I've had nothing but trouble with certain datacenter-class NVMe u.2 drives (drop out usually within 24 hours):
  • Intel P4600 6.4TB
  • Micron 9200 MAX 6.4TB
Both of them work great under ESXi, but randomly drop out in Proxmox. Kernel revs during the trouble were floating around 6.8.12-0-6.8.12-3 (I think). I tried everything I could find. Two posts in this semi-related thread here, around October, 2024 (maybe one of the threads you already found):

I agree, the bug appeared at some point in kernel version 6.8.12-x and it is still present in version 6.8.12-8.

I wonder if it is better to downgrade the kernel before 6.8.12-x, such as kernel-6.8.8-2, or upgrade to kernel 6.11.11-1 which is currently being tested.


Yes, thank you, I had also contributed to this post.

It's very, very annoying since I have these wonderful drives just sitting a drawer waiting to find evidence that I can use them again.

Now, what I am currently running with great success *ARE* Samsung 990 Pro drives (1 x 2TB and 1x 4TB). Not a lick of trouble with them. Uptime in excess of 30 days... I rarely will run longer than that for various reasons, mostly other hardware changes as I am still building the system.

According to user feedback, the problem is more frequent with Samsung 990 Pro drives. It's a very strange bug if for you it's the opposite.
 
I am using arch Linux and i had drive disappearing after waking up from sleep. Previously i fixed it with settings `nvme_core.default_ps_max_latency_us=0` But after updating to latest kernel it started to occur again. it seems like `nvme_core.default_ps_max_latency_us` doesn't work any more.

It does indeed seem like the parameter nvme_core.default_ps_max_latency_us=0 is not being taken into account with the new version of grub, yet after booting we do have 0.

cat /sys/module/nvme_core/parameters/default_ps_max_latency_us

returns the value 0.

Maybe default_ps_max_latency_us is no longer properly managed by the kernel.

I want to test kernel 6.11.11-1 because I saw that there are improvements in the management of NVME SSDs, particularly in the handling of interrupts starting from kernel 6.10 https://kernelnewbies.org/Linux_6.10.
 
Last edited:
As discussed, I have updated the kernel to version 6.11.11-1-pve on each node.

So far, everything is working.

I will keep you informed after a few days if this kernel version no longer causes bugs with the NVMe SSDs.
 
  • Like
Reactions: marcio79
As discussed, I have updated the kernel to version 6.11.11-1-pve on each node.

So far, everything is working.

I will keep you informed after a few days if this kernel version no longer causes bugs with the NVMe SSDs.
6.11 is gold my friend
 
@marcio79

I still have the issue of NVME SSDs disappearing after a few days of operation. This happens during intensive use such as backup.

I also updated the BIOS to the latest version including AGESA 1.2.0.C. I get the same errors.

I'm stuck, I don't know what else to try.

Any advice is welcome
 
I still have the issue of NVME SSDs disappearing after a few days of operation. This happens during intensive use such as backup.
I don't know how you have those NVMEs connected to the system, but I would try a different controller/slot/connection/bus/solution with these drives.
If the problem still persists, I would try changing the drives themselves.
 
They are inserted into M.2 slots on the MOBO. I'll try to test with a PCIe <-> M.2 adapter inside a PCIe slot.
 
Last edited:
Hi Team,

Before connecting the SSDs with an M.2 adapter in a PCIe slot, I noticed that the disconnection of the SSDs occurs only under very specific conditions.

My configuration is based on 4 nodes and 1 qdevice with the latest PVE community updates including Ceph Squid 19.2 :

  • 1 SSD on each node for PVE
  • 2 SSDs on each node for CephRBD
  • 3 HDDs on each node for CephFS
  • All VMs use High Availability (HA)
  • 1 PBS server or 1 NFS server or CephFS for backups
  • The VM disks are located on CephRBD

I have never had SSD disconnections during operation when the VMs are running.

I have not noticed any SSD disconnections when the VMs are stopped before the backup.

I notice frequent and random SSD disconnections when the VMs are running during the backup (snapshot).

Could the random SSD disconnections be related to HA, Ceph, or the backup process itself when the VMs are running?

Regards
 
I also want to add that I’ve been having problems with 6.8.12-9 and even 6.11.11-2. When I was using 6.8.12-4 it was somewhat ok. For reference my disks are Intel D4502 7.68TB U.2 NVMe. Other NVMe drives are fine however. Maybe I’ll revert to older versions such as 6.8.8 and give it a try.
 
I also want to add that I’ve been having problems with 6.8.12-9 and even 6.11.11-2. When I was using 6.8.12-4 it was somewhat ok. For reference my disks are Intel D4502 7.68TB U.2 NVMe. Other NVMe drives are fine however. Maybe I’ll revert to older versions such as 6.8.8 and give it a try.
Hello,

Please could give us more details regarding your setup: #nodes, #ssd per node, HA?, CEPH? and when the failure occurs ?

Regards,
 
Hello,

Please could give us more details regarding your setup: #nodes, #ssd per node, HA?, CEPH? and when the failure occurs ?

Regards,
Happy to give some background and thx for offering to give your thoughts…

Nodes: 3
Drive details per node: 1&2) 1x Samsung PM1733a 15.36TB, 1x Optane 905P 380GB as db; 3) 2x Intel D4502 7.68TB (dual link running on single link), 2x Optane P1600X 118GB as db
Ceph setup: replication 3 with 2 min
Proxmox kernel boot commands: PCIE_ASPM=off, nvme idle = 0
Proxmox versions tried: 6.8.12-7, 6.8.12-9, 6.11.11-2
Failure happens: when I have heavy load (running plex and tdarr LXCs using ceph pool storage processing unRAID shared NFS files) usually after 12-24hr. Only the Intel drives drop, and the Samsungs are fine. Cooling is all ok as all drives are around 40-50C. Strange thing was that when I had Plex and tdarr running in a Windows VM (with storage in a ceph pool), it was stable. When I was using the Intel drives on ESXi 6.7u3, they were rock solid.
 
Happy to give some background and thx for offering to give your thoughts…

Nodes: 3
Drive details per node: 1&2) 1x Samsung PM1733a 15.36TB, 1x Optane 905P 380GB as db; 3) 2x Intel D4502 7.68TB (dual link running on single link), 2x Optane P1600X 118GB as db
Ceph setup: replication 3 with 2 min
Proxmox kernel boot commands: PCIE_ASPM=off, nvme idle = 0
Proxmox versions tried: 6.8.12-7, 6.8.12-9, 6.11.11-2
Failure happens: when I have heavy load (running plex and tdarr LXCs using ceph pool storage processing unRAID shared NFS files) usually after 12-24hr. Only the Intel drives drop, and the Samsungs are fine. Cooling is all ok as all drives are around 40-50C. Strange thing was that when I had Plex and tdarr running in a Windows VM (with storage in a ceph pool), it was stable. When I was using the Intel drives on ESXi 6.7u3, they were rock solid.

We have many points in common:

  • The hardware worked well in a previous software configuration
  • Ceph is used for SSDs, is it a CephRBD volume?
  • Several SSDs and several nodes are affected
  • The failure is random but the more the system is loaded, the faster the failure occurs
I noticed that the problem was more frequent when the VMs were running with HA enabled during backups (snapshots).

I noticed each time an error message "1 daemons have recently crashed" and 1 or more OSDs disappeared from 1 or more nodes. It is necessary to shut down the affected nodes and then restart them to bring the OSDs back online.

I have just made today the latest updates from the community repo including Ceph 19.2.1.

From now, I have a new warning message: [WRN] BLUESTORE_SLOW_OP_ALERT: 4 OSD(s) experiencing slow operations in BlueStore

This problem seems to be known with Ceph 19.2.1 https://github.com/rook/rook/discussions/15403

I'll keep you informed,
Regards
 
hi,

i added 2x micron 7400 (m.2) as zfs special device in kernel 6.8.x
and had random one of the drives drop/offline after 1-2 days.
i think this fixed it for me:
/etc/kernel/cmdline
pcie_aspm=off pcie_port_pm=off

now with 6 days of uptime on kernel 6.14.0-1-pve and still nothing dropped, i would say 6.14 works fine.

(hp microserver gen10, opteron X3216)
regards

edit to be precise:
kernel 6.8.x and 6.14.x are stable with the parameters: pcie_aspm=off pcie_port_pm=off
without both kernels drop the nvme disks after 1-3 days
 
Last edited:
hp microserver gen10
AFAIK that doesn't have any m.2 native controller/connection. You must be using some PCI-riser card as a controller - or some other shenanigans through something else. (A waste of that m.2 Gen 4 drive!). But my point being - I'm not sure you are justified in commenting on kernel stability with drives, when in your case, it is probably down to the interfacing you are doing, and not the drives themselves vs the kernel.

opteron X3216
What are you running on that dual-core calculator?