[SOLVED] Are SSD NVMe issues resurfacing with the latest PVE kernel upgrade?

ALFi · Apr 11, 2025

gfngfn256 said:
I'm not sure you are justified in commenting on kernel stability with drives, when in your case, it is probably down to the interfacing you are doing, and not the drives themselves vs the kernel.

well E3, U2, M2 they are all connecting via NVMe over PCIe, it is just a different "plug", in fact most of the time the SSD use the same controller chip regardless of form factor.

and you are correct this system has no M2 slots

the SSDs are fittet via a passive pcie/m2 adapter.

system has even only PCIe v3 and one of the SSD has even only one PCIe lane attached.

but even so they are miles faster (latency!) and cheeper then attaching enterprise sata SSD's as special device

anyway i had the same symtoms as other people here, just wanted to throw in another data point that it also happens on different hardware and non ceph related usecase.

What are you running on that dual-core calculator?

mostly PBS sync target and a little PVE for homelab thinkering.
the calculator is 99% idle and still uses to much power for what it does

YAGA · Apr 22, 2025

Linux 6.14.0-2-pve fixed this issue. So far so good.

But a new one appeared with ceph version 19.2.1 https://forum.proxmox.com/threads/c...riencing-slow-operations-in-bluestore.164856/

guruevi · Apr 22, 2025

The problem with the converter from PCIe -> M2 is that the adapters themselves are relatively dodgy. And then if you use a cheap drive on top of that with a system that’s so old some of the caps are probably bulging…

As far as the difference between PCIe, M2 and U2, it’s a bit more than signaling, there is the issue of power, latency, bandwidth, hot plugging, namespaces, security concerns with VFIO (in some setups) and whether or not your system supports it as a boot target.

If you’re concerned about power usage, a newer system, even a low cost used one is going to be much more reliable and efficient than that model, depending on where you live you may save the money just on the power bill. I went from a Nehalem for my home lab to CascadeLake and the power usage dropped during idle from ~100W to 50W or even below that, just because the entire system is so much more efficient. I don’t really care about that at 4c/kWh, but you may, even at my cost, I will ‘save’ money for the next system in just a few years.

gfngfn256 · Apr 23, 2025

guruevi said:
I went from a Nehalem for my home lab to CascadeLake and the power usage dropped during idle from ~100W to 50W or even below that

Although this is rather off-topic, if you really care about power consumption (which you should, these days .....), you should deploy a mini pc in a home environment to run essentials etc. 24/7, & you won't hit 20w idle. I'm at ~13w with 3/4 VMs & 4/5 LXCs.

Just my off-topic (reply) 2 cent rant ....

YAGA · Apr 25, 2025

YAGA said:
Linux 6.14.0-2-pve fixed this issue. So far so good.

But a new one appeared with ceph version 19.2.1 https://forum.proxmox.com/threads/c...riencing-slow-operations-in-bluestore.164856/

Unfortunately I still see other crashes on the OSD even with Linux 6.14.0-2-pve kernel.

SoVeryRural · May 26, 2025

YAGA said:
Unfortunately I still see other crashes on the OSD even with Linux 6.14.0-2-pve kernel.

Just trying to make sure I understand you. Even with 6.14.0, you still see the same SSD crashing behaviour that enspired this thread?

I've got a similar, if not identical, problem. Certain Samsung specific drives (4TB 990 Pro with Heatsink) fail exactly as yours. Some of exactly the same model of drive have no issues at all. Initially, the problem drives may have ran for months before their first isssue, but it seems to get worse with time. The last failure happened 15 hours after the computer was shutdown and powered back up.

Updating the drive's firmware hasn't helped. I've played with the link power management policy, both through Linux and through BIOS to no avail. I updated the Linux kernel to 6.8.12-10-pve with no impact. Updating to even later kernels was the last option I had to explore to make the drive useable.

A replacement drive from a different manufacturer arrives today, but I've got at least two of these Samsung drives that are old enough that I have my doubts about the RMA, and I'm not sure I want them replaced with Samsung drives.

guruevi · May 26, 2025

There are lots of issues reported both on Windows and Linux with the 990Pro, a firmware update may fix it, but please note these are now old desktop consumer drives, they may be worn out or have other issues.

SoVeryRural · May 26, 2025

guruevi said:
There are lots of issues reported both on Windows and Linux with the 990Pro, a firmware update may fix it, but please note these are now old desktop consumer drives, they may be worn out or have other issues.

I've got five of these drives. Three have had no issues. Two have had this issue. One of those is 10 months old, the other is only a couple of months old.

I don't think age is a factor. I do think the problem is ongoing. It's going to be a while before I buy another Samsung SSD.

guruevi · May 26, 2025

One of the issues was wear out, people reported 10% of wear out per month in a desktop environment, regular drop outs and heat issues. Some of it was fixed with firmware, but the flash configuration itself doesn’t allow for much room for improvements,

It should have a heat sink, make sure it has sufficient air circulation. I had to put my server fans in performance mode. Luckily mine were just for ZFS read cache in PBS but I still eventually lost both of them.

I’m personally not enamored with Samsung either, their warranty is non-existent and if you get a replacement it is often a refurbished (used) drive. Their performance when I tried to use them was atrocious.

SoVeryRural · May 26, 2025

guruevi said:
One of the issues was wear out, people reported 10% of wear out per month in a desktop environment, regular drop outs and heat issues. Some of it was fixed with firmware, but the flash configuration itself doesn’t allow for much room for improvements,

It should have a heat sink, make sure it has sufficient air circulation. I had to put my server fans in performance mode. Luckily mine were just for ZFS read cache in PBS but I still eventually lost both of them.

I’m personally not enamored with Samsung either, their warranty is non-existent and if you get a replacement it is often a refurbished (used) drive. Their performance when I tried to use them was atrocious.

Yup. Over the last 10 months, I thought our issue was each of those problems.

Our have always been in ZFS mirrors, which saved our bacon. Lucky for us, the two problem drives weren't ever in the same mirror.

I'd like to keep these drives in service. As far as I know, they are fine in Intel motherboards. I may try one of the problem drives as one-half of a ZFS mirror in one of our Intel-based servers. But I've already burned way more than the value of the two drives in terms of time.

gfngfn256 · May 26, 2025

SoVeryRural said:
I'd like to keep these drives in service. As far as I know, they are fine in Intel motherboards. I may try one of the problem drives as one-half of a ZFS mirror in one of our Intel-based servers.

If your happy to help the community with your "testing" of these drives - so be it.

BUT in my books - a drive that may or may not work - that drive belongs in the trash not in a working system even as mirror of a mirror of a cache volume!

YAGA · May 26, 2025

gfngfn256 said:
If your happy to help the community with your "testing" of these drives - so be it.

BUT in my books - a drive that may or may not work - that drive belongs in the trash not in a working system even as mirror of a mirror of a cache volume!

@SoVeryRural

Interesting

I'm not convinced that this failure is only SSD related because I've noticed this failure occurs at the same time, on different nodes and usually 2 ou 3 SSD (OSD in CEPH) are concerned

It might be CEPH, communication layer between nodes, hardware issues, kernel, cascading errors...

Please could you describe your setup, and when this issue occurs for the first time?

Regards,

SoVeryRural · May 26, 2025

YAGA said:
@SoVeryRural

Interesting

I'm not convinced that this failure is only SSD related because I've noticed this failure occurs at the same time, on different nodes and usually 2 ou 3 SSD (OSD in CEPH) are concerned

It might be CEPH, communication layer between nodes, hardware issues, kernel, cascading errors...

Please could you describe your setup, and when this issue occurs for the first time?

Regards,

I'm not running CEPH. Just a Proxmox cluster of three machines. One has an AMD CPU and that's the one that fails.

Even in the server where the drive fails, I've got a Samsung 990 Pro with Heatsink that has been just fine for almost a year.

It might be triggered by activity, but I have my doubts. Mine can fail anytime. It failed at 2:03 AM today. There was nothing going on then. Backups finished up 25 minutes earlier.

YAGA · May 27, 2025

SoVeryRural said:
I'm not running CEPH. Just a Proxmox cluster of three machines. One has an AMD CPU and that's the one that fails.

My servers also have AMD CPU, Ryzen with x470 chipset in my case.

SoVeryRural said:
Even in the server where the drive fails, I've got a Samsung 990 Pro with Heatsink that has been just fine for almost a year.

Same here, everything was working fine during months and months but from February issues reappeared

SoVeryRural said:
It might be triggered by activity, but I have my doubts. Mine can fail anytime. It failed at 2:03 AM today. There was nothing going on then. Backups finished up 25 minutes earlier.

Same here, it might be triggered by activity usually during backups

SoVeryRural · May 27, 2025

YAGA said:
My servers also have AMD CPU, Ryzen with x470 chipset in my case.

Same here, everything was working fine during months and months but from February issues reappeared

Same here, it might be triggered by activity usually during backups

I just had a look through the emails I get when the ZFS pool has a device fault to see if there was any pattern regarding the time of day that the fault occurs. If there is a pattern, I can't see it. It's all over the place. If anything, it might happen after hours more than during the work day, but I've only got seven samples so I don't even want to make that conclusion.

I received the replacement for the problem drive yesterday and will swap it out tonight or tomorrow morning. Hopefully, that's the end of the problem.

YAGA · May 28, 2025

SoVeryRural said:
I just had a look through the emails I get when the ZFS pool has a device fault to see if there was any pattern regarding the time of day that the fault occurs. If there is a pattern, I can't see it. It's all over the place. If anything, it might happen after hours more than during the work day, but I've only got seven samples so I don't even want to make that conclusion.

I received the replacement for the problem drive yesterday and will swap it out tonight or tomorrow morning. Hopefully, that's the end of the problem.

Hi @SoVeryRural

Please could you give us more details on your setup: CPU, MoBo's chipset, AGESA version, PCIe speed, M.2 slot or PCIe adapter, disabled c-states in bios?

Regards,

SoVeryRural · May 28, 2025

No problem.

The CPU is an AMD Ryzen 9 7900. Chipset is AMD B650. AGESA ComboAM5 1.2.0.2b. The problem Samsung SSD is in a Gen4 x4 M.2 slot.

I have not disabled c-states in the BIOS, but I have tried setting everything matching /sys/class/scsi_host/host*/link_power_management_policy to max_performance. It didn't help as far as I could tell.

In other news, I replaced the problem drive this morning. ZFS is resilvering at the moment. Hopefully that's the last of the failures.

killergoalie · Jun 14, 2025

Coming here after searching for issues with NVME drives. Have been having repeat issues with a 990 Pro 1tb in a minipc. Boot drive is also a 990 Pro 1tb (mistakes were made) Applied the boot options and installed 6.14. In the mean time is there a list of recommended NVME drives that people don't have issues with, or is it basically just the 990 pro and some enterprise drives?

YAGA · Jun 15, 2025

Hi Team,

After conducting intensive tests with several 4TB 990 Pro SSDs with heatsinks, the issue now seems to be resolved. Here is my functional configuration:

Kernel Linux 6.14.5-1-bpo12-pve
GRUB_CMDLINE_LINUX_DEFAULT="quiet nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off"

Can others confirm so that we can flag this thread as "[solved]"?

sjerke · Jul 9, 2025

YAGA said:
Hi Team,

After conducting intensive tests with several 4TB 990 Pro SSDs with heatsinks, the issue now seems to be resolved. Here is my functional configuration:

Kernel Linux 6.14.5-1-bpo12-pve

GRUB_CMDLINE_LINUX_DEFAULT="quiet nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off"

Can others confirm so that we can flag this thread as "[solved]"?

Hi,

Unfortunately, I cannot confirm that this issue has been resolved. I updated the kernel and added the command line, but the problem returned after a few days. Currently, 13 of 88 OSD's are down due to this issue. I would greatly appreciate any further assistance on this issue.

[SOLVED] Are SSD NVMe issues resurfacing with the latest PVE kernel upgrade?

Member

Renowned Member

Well-Known Member

Distinguished Member

Renowned Member

New Member

Well-Known Member

New Member

Well-Known Member

New Member

Distinguished Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

New Member

Renowned Member

New Member

We value your privacy