Frequent NVMe Disconnects Samsung PM1735 Drives

eliyahuadam · Jun 10, 2024

Dear Proxmox Community,

I'm facing a critical issue with my Proxmox 8.2.2 cluster that's severely impacting our Ceph storage performance and stability. I hope you can help me troubleshoot and resolve this problem.

Environment:

Proxmox version: 8.2.2
Cluster size: 5 physical servers
Storage: Each server has one Samsung PM1735 NVMe SSD and Three Intel SSD via SAS (No issue with SSD).
NVME Version: EPK71H3Q
Storage Configuration: Each NVMe is part of a dedicated Ceph pool
OS: Debian 12 (Bookworm)

Issue:Our Samsung PM1735 NVMe drives are disconnecting frequently across all five servers. This is causing I/O errors, performance degradation, and potential data integrity issues in our Ceph storage.

Error Messages:
NVMe Disconnect:
nvme nvme0: Shutdown timeout set to 10 seconds
nvme nvme0: 63/0/0 default/read/poll queues
nvme 0000:18:00.0: Using 48-bit DMA addresses
nvme nvme0: resetting controller due to AER
block nvme0n1: no usable path - requeuing I/O
nvme nvme0: Device not ready; aborting reset, CSTS=0x1
I/O Callbacks Suppressed:
nvme_ns_head_submit_bio: 19 callbacks suppressed
Steps Taken:
Kernel Parameter: Set nvme_core.default_ps_max_latency_us=0 to disable NVMe power state transitions. Issue persists.
Samsung PM1735 specs:

ps 0 : mp:25.00W operational enlat:180 exlat:180 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:7.00W active_power:19.00W active_power_workload:80K 128KiB SW

Ceph configuration for the spespfic OSDs:

bluestore_min_alloc_size_ssd = 8K
osd_memory_target = 8G
osd_op_num_threads_per_shard_ssd = 8

Has anyone else experienced similar issues with high-performance NVMes like the PM1735 in a Proxmox/Ceph environment?
Are there known PCIe or power-related issues with Proxmox 8 on Debian 12 that could cause these symptoms?

We're willing to provide any additional logs or test results as needed.

Thank you in advance for your help!

groszgil · Jun 10, 2024

i have also the same issue....

leesteken · Jun 10, 2024

A search for PM1735 on this forum shows other people not having this issue. Maybe you two can compare your motherboard and BIOS versions?
EDIT: And also the Samsung firmware version?

eliyahuadam · Jun 11, 2024

Hi Leesteken,

Thanks for the replay.

The motherboard and bios versions are up to date.

Apparently, the current NVME firmware is very old EPK71H3Q.
Today I've upgraded the five NVME disks to the latest version (Samsung\Hpe): EPK76H3Q.

I will update the results.

Thanks.

matrix1999 · Jun 13, 2024

I just made a post on this similar issue. Hope this will solve your problem.

https://forum.proxmox.com/threads/r...aining-the-host-pve-8-1-10.144773/post-673410

justinclift · Jun 13, 2024

Just to double check something, which version of the kernel is currently running on the system?

Asking because the Proxmox kernel recently went from 6.5.x to 6.8.x, and a lot of people are having strange problems caused by it.

If you're now running kernel 6.8.x, and the problems only recently started, then it's probably worth dropping back to the older 6.5.x kernel series for now and seeing if that's more stable.

Instructions for switching back to the 6.5.x kernel are here, under the "Kernel 6.8" heading:

https://pve.proxmox.com/wiki/Roadmap#Known_Issues_&_Breaking_Changes

cwt · Jun 13, 2024

We‘re running a total of 38 PM1735 of different sizes and firmwares in customer servers with kernel 6.8 and don’t observe these problems. Mixed boards, chipsets and CPUs (AMD EPYC/Threadripper, Intel Xeon).
Did you modify any kernel parameters or ACPI settings?

eliyahuadam · Jun 13, 2024

matrix1999 said:
I just made a post on this similar issue. Hope this will solve your problem.

https://forum.proxmox.com/threads/r...aining-the-host-pve-8-1-10.144773/post-673410

@matrix1999 I've added the same parameter by editing "/etc/default/grub" and then running proxmox-boot-tool refresh:
cat /sys/module/nvme_core/parameters/default_ps_max_latency_us = 0
cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.8.4-2-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet intel_iommu=on nvme_core.default_ps_max_latency_us=0

The OS is resided on ZFS disks, should I add the parameter "nvme_core.default_ps_max_latency_us=0" to /etc/kernel/cmdline as well?

justinclift said:
Just to double check something, which version of the kernel is currently running on the system?

Asking because the Proxmox kernel recently went from 6.5.x to 6.8.x, and a lot of people are having strange problems caused by it.

If you're now running kernel 6.8.x, and the problems only recently started, then it's probably worth dropping back to the older 6.5.x kernel series for now and seeing if that's more stable.

Instructions for switching back to the 6.5.x kernel are here, under the "Kernel 6.8" heading:

https://pve.proxmox.com/wiki/Roadmap#Known_Issues_&_Breaking_Changes

@justinclift I didn't find any issue related to the NVME but I will consider about that as the last resort, Thanks.

@cwt Can you share please the APCI settings?
the actions performed are:
Adding nvme_core.default_ps_max_latency_us=0 to the grub.
Upgrading the firmware from EPK71H3Q to EPK76H3Q
https://support.hpe.com/connect/s/s...onId=MTX-77f2ff3d7ebf4999&tab=revisionHistory

Many Thanks.

cwt · Jun 13, 2024

We usually disable high C-States and enable x2APIC, if available. Especially on AMD chipsets enabling high C-States resulted in strange problems since PVE 7.x. But this was usually a problem on „lower“ systems like small Threadrippers/Ryzen CPUs. x2APIC has also a side effect on IOMMU.

Did you add the kernel parameter before or after the problem occured?

eliyahuadam · Jun 13, 2024

cwt said:
We usually disable high C-States and enable x2APIC, if available. Especially on AMD chipsets enabling high C-States resulted in strange problems since PVE 7.x. But this was usually a problem on „lower“ systems like small Threadrippers/Ryzen CPUs. x2APIC has also a side effect on IOMMU.

Did you add the kernel parameter before or after the problem occured?

The sever have Intel chipsets.
CPU: Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz

Do you mean to those parameters:
acpi_enforce_resources=lax
pcie_aspm=off

I've added the "nvme_core.default_ps_max_latency_us=0" after the NVME failures.

Thanks.

cwt · Jun 13, 2024

I meant settings directly within the BIOS.

Side question: for how long do you run/use the NVME for now? Did you also check temps? Although the PMs have massive heatsinks, they can become pretty hot without proper airflow. We ensured that (in some scenarios) additional air is blown from front to rear in rack cases along and between the cards.

eliyahuadam · Jun 13, 2024

cwt said:
I meant settings directly within the BIOS.

Side question: for how long do you run/use the NVME for now? Did you also check temps? Although the PMs have massive heatsinks, they can become pretty hot without proper airflow. We ensured that (in some scenarios) additional air is blown from front to rear in rack cases along and between the cards.

Within the BIOS, the server power configured to "High Performance".
There is no "saving power" for the Disks \ PCI.
The server sensors values are ok (Temp\Airflow), in addition, we are monitoring the NVME Temp on Grafana via Telegraf metric that send from the Proxmox servers:

The temperature overall around 25 - 35 for each NVME disk.
When the NVME has disconnected, we didn't see any issue with the Temp\Power or other sensor.

Thanks.

matrix1999 · Jun 13, 2024

eliyahuadam said:
I've added the same parameter by editing "/etc/default/grub" and then running proxmox-boot-tool refresh:
cat /sys/module/nvme_core/parameters/default_ps_max_latency_us = 0
cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.8.4-2-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet intel_iommu=on nvme_core.default_ps_max_latency_us=0

The OS is resided on ZFS disks, should I add the parameter "nvme_core.default_ps_max_latency_us=0" to /etc/kernel/cmdline as well?

if your proxmox runs on zfs, then yes you would need to update /etc/kernel/cmdline. Otherwise, if you have a "standard" proxmox install then just updating the grub file is all you need.

matrix1999 · Jun 14, 2024

justinclift said:
Just to double check something, which version of the kernel is currently running on the system?

Asking because the Proxmox kernel recently went from 6.5.x to 6.8.x, and a lot of people are having strange problems caused by it.

If you're now running kernel 6.8.x, and the problems only recently started, then it's probably worth dropping back to the older 6.5.x kernel series for now and seeing if that's more stable.

Instructions for switching back to the 6.5.x kernel are here, under the "Kernel 6.8" heading:

https://pve.proxmox.com/wiki/Roadmap#Known_Issues_&_Breaking_Changes

I have been having this issue from 6.5.x (and even before but I don't remember the exact number) and even now with 6.8.4-3.

justinclift · Jun 14, 2024

Well, looks like it's not another one of the kernel 6.8 problems then.

matrix1999 · Jun 14, 2024

justinclift said:
Well, looks like it's not another one of the kernel 6.8 problems then.

No, I have had this problem for quite some time. But hopefully my fix will solve the issue for good.

eliyahuadam · Jul 23, 2024

Hi All,
After a month, I can confirm that both issues are resolved:
1- Server freeze.
2- NVME discontented

Actions that made:
1- Upgrade kernel from 6.8.4-2 to 6.8.8-2 2.
2- Upgrade the NVME firmware from EPK71H3Q to EPK76H3Q.

Thanks.

Search

Search

Frequent NVMe Disconnects Samsung PM1735 Drives

eliyahuadam

Member

groszgil

New Member

leesteken

Distinguished Member

eliyahuadam

Member

matrix1999

New Member

justinclift

Active Member

cwt

Well-Known Member

eliyahuadam

Member

cwt

Well-Known Member

eliyahuadam

Member

cwt

Well-Known Member

eliyahuadam

Member

matrix1999

New Member

matrix1999

New Member

justinclift

Active Member

matrix1999

New Member

eliyahuadam

Member