Are SSD NVMe issues resurfacing with the latest PVE kernel upgrade?

YAGA

Renowned Member
Feb 15, 2016
77
8
73
59
Hi Team,

At the end of 2023, several users reported issues with unexplained loss of access to NVMe SSDs, particularly Samsung 990 Pro NVMe SSDs, which I have. One or more NVMe SSDs suddenly disconnected and were no longer detected by Linux.

The server had to be powered off and then powered back on to detect the NVMe SSD; a simple reboot was insufficient.

The solution was to add `nvme_core.default_ps_max_latency_us=0` in GRUB as follows:

GRUB_CMDLINE_LINUX_DEFAULT="quiet nvme_core.default_ps_max_latency_us=0"

Then, update GRUB with `update-grub` before rebooting.

Subsequently, regular Linux kernel updates in 2024 completely resolved the issue, with no defects reported for a year.

However, in early 2025, the problem suddenly reappeared, likely due to the latest updates of PVE Community Edition.

This is not a hardware failure, as the issue occurs randomly on different servers with various NVMe SSDs. The more the NVMe SSDs are used (e.g., for backups), the more frequently the failure occurs.

I verified that the GRUB parameter was still in effect:

cat /sys/module/nvme_core/parameters/default_ps_max_latency_us
0

Here is the kernel version used:

Linux mars 6.8.12-8-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-8 (2025-01-24T12:32Z) x86_64 GNU/Linux

All these NVMe SSDs are configured as OSD BlueStore (Ceph). When a fault occurs, Ceph reports that 'daemons have recently crashed' I believe this is a consequence rather than the cause.

The first fault occurred at the end of February few days after kernel update.

Am I the only one experiencing this issue again?

Although I'm not entirely sure it's solely a kernel issue, what would be the most prudent method to roll back the kernel?

Which kernel version would be the most reliable?

Any suggestions are welcome.

Kind regards,