Frequent NVMe Disconnects Samsung PM1735 Drives

eliyahuadam

Member
Mar 26, 2020
26
5
23
36
Dear Proxmox Community,


I'm facing a critical issue with my Proxmox 8.2.2 cluster that's severely impacting our Ceph storage performance and stability. I hope you can help me troubleshoot and resolve this problem.

Environment:

Proxmox version: 8.2.2
Cluster size: 5 physical servers
Storage: Each server has one Samsung PM1735 NVMe SSD and Three Intel SSD via SAS (No issue with SSD).
NVME Version: EPK71H3Q
Storage Configuration: Each NVMe is part of a dedicated Ceph pool
OS: Debian 12 (Bookworm)

Issue:Our Samsung PM1735 NVMe drives are disconnecting frequently across all five servers. This is causing I/O errors, performance degradation, and potential data integrity issues in our Ceph storage.


Error Messages:
NVMe Disconnect:
nvme nvme0: Shutdown timeout set to 10 seconds
nvme nvme0: 63/0/0 default/read/poll queues
nvme 0000:18:00.0: Using 48-bit DMA addresses
nvme nvme0: resetting controller due to AER
block nvme0n1: no usable path - requeuing I/O
nvme nvme0: Device not ready; aborting reset, CSTS=0x1
I/O Callbacks Suppressed:
nvme_ns_head_submit_bio: 19 callbacks suppressed
Steps Taken:
Kernel Parameter: Set nvme_core.default_ps_max_latency_us=0 to disable NVMe power state transitions. Issue persists.
Samsung PM1735 specs:

ps 0 : mp:25.00W operational enlat:180 exlat:180 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:7.00W active_power:19.00W active_power_workload:80K 128KiB SW


Ceph configuration for the spespfic OSDs:

bluestore_min_alloc_size_ssd = 8K
osd_memory_target = 8G
osd_op_num_threads_per_shard_ssd = 8

Has anyone else experienced similar issues with high-performance NVMes like the PM1735 in a Proxmox/Ceph environment?
Are there known PCIe or power-related issues with Proxmox 8 on Debian 12 that could cause these symptoms?



We're willing to provide any additional logs or test results as needed.

Thank you in advance for your help!
 
A search for PM1735 on this forum shows other people not having this issue. Maybe you two can compare your motherboard and BIOS versions?
EDIT: And also the Samsung firmware version?
 
Last edited:
  • Like
Reactions: Kingneutron
Hi Leesteken,

Thanks for the replay.

The motherboard and bios versions are up to date.

Apparently, the current NVME firmware is very old EPK71H3Q.
Today I've upgraded the five NVME disks to the latest version (Samsung\Hpe): EPK76H3Q.

I will update the results.

Thanks.
 
Just to double check something, which version of the kernel is currently running on the system?

Asking because the Proxmox kernel recently went from 6.5.x to 6.8.x, and a lot of people are having strange problems caused by it.

If you're now running kernel 6.8.x, and the problems only recently started, then it's probably worth dropping back to the older 6.5.x kernel series for now and seeing if that's more stable.

Instructions for switching back to the 6.5.x kernel are here, under the "Kernel 6.8" heading:

https://pve.proxmox.com/wiki/Roadmap#Known_Issues_&_Breaking_Changes
 
We‘re running a total of 38 PM1735 of different sizes and firmwares in customer servers with kernel 6.8 and don’t observe these problems. Mixed boards, chipsets and CPUs (AMD EPYC/Threadripper, Intel Xeon).
Did you modify any kernel parameters or ACPI settings?
 
I just made a post on this similar issue. Hope this will solve your problem.

https://forum.proxmox.com/threads/r...aining-the-host-pve-8-1-10.144773/post-673410
@matrix1999 I've added the same parameter by editing "/etc/default/grub" and then running proxmox-boot-tool refresh:
cat /sys/module/nvme_core/parameters/default_ps_max_latency_us = 0
cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.8.4-2-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet intel_iommu=on nvme_core.default_ps_max_latency_us=0

The OS is resided on ZFS disks, should I add the parameter "nvme_core.default_ps_max_latency_us=0" to /etc/kernel/cmdline as well?
Just to double check something, which version of the kernel is currently running on the system?

Asking because the Proxmox kernel recently went from 6.5.x to 6.8.x, and a lot of people are having strange problems caused by it.

If you're now running kernel 6.8.x, and the problems only recently started, then it's probably worth dropping back to the older 6.5.x kernel series for now and seeing if that's more stable.

Instructions for switching back to the 6.5.x kernel are here, under the "Kernel 6.8" heading:

https://pve.proxmox.com/wiki/Roadmap#Known_Issues_&_Breaking_Changes
@justinclift I didn't find any issue related to the NVME but I will consider about that as the last resort, Thanks.

@cwt Can you share please the APCI settings?
the actions performed are:
Adding nvme_core.default_ps_max_latency_us=0 to the grub.
Upgrading the firmware from EPK71H3Q to EPK76H3Q
https://support.hpe.com/connect/s/s...onId=MTX-77f2ff3d7ebf4999&tab=revisionHistory

Many Thanks.
 
  • Like
Reactions: justinclift
We usually disable high C-States and enable x2APIC, if available. Especially on AMD chipsets enabling high C-States resulted in strange problems since PVE 7.x. But this was usually a problem on „lower“ systems like small Threadrippers/Ryzen CPUs. x2APIC has also a side effect on IOMMU.

Did you add the kernel parameter before or after the problem occured?
 
We usually disable high C-States and enable x2APIC, if available. Especially on AMD chipsets enabling high C-States resulted in strange problems since PVE 7.x. But this was usually a problem on „lower“ systems like small Threadrippers/Ryzen CPUs. x2APIC has also a side effect on IOMMU.

Did you add the kernel parameter before or after the problem occured?
The sever have Intel chipsets.
CPU: Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz

Do you mean to those parameters:
acpi_enforce_resources=lax
pcie_aspm=off

I've added the "nvme_core.default_ps_max_latency_us=0" after the NVME failures.

Thanks.
 
I meant settings directly within the BIOS.

Side question: for how long do you run/use the NVME for now? Did you also check temps? Although the PMs have massive heatsinks, they can become pretty hot without proper airflow. We ensured that (in some scenarios) additional air is blown from front to rear in rack cases along and between the cards.
 
I meant settings directly within the BIOS.

Side question: for how long do you run/use the NVME for now? Did you also check temps? Although the PMs have massive heatsinks, they can become pretty hot without proper airflow. We ensured that (in some scenarios) additional air is blown from front to rear in rack cases along and between the cards.
Within the BIOS, the server power configured to "High Performance".
There is no "saving power" for the Disks \ PCI.
The server sensors values are ok (Temp\Airflow), in addition, we are monitoring the NVME Temp on Grafana via Telegraf metric that send from the Proxmox servers:
1718286067230.png

The temperature overall around 25 - 35 for each NVME disk.
When the NVME has disconnected, we didn't see any issue with the Temp\Power or other sensor.

Thanks.
 
I've added the same parameter by editing "/etc/default/grub" and then running proxmox-boot-tool refresh:
cat /sys/module/nvme_core/parameters/default_ps_max_latency_us = 0
cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.8.4-2-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet intel_iommu=on nvme_core.default_ps_max_latency_us=0

The OS is resided on ZFS disks, should I add the parameter "nvme_core.default_ps_max_latency_us=0" to /etc/kernel/cmdline as well?

if your proxmox runs on zfs, then yes you would need to update /etc/kernel/cmdline. Otherwise, if you have a "standard" proxmox install then just updating the grub file is all you need.
 
Just to double check something, which version of the kernel is currently running on the system?

Asking because the Proxmox kernel recently went from 6.5.x to 6.8.x, and a lot of people are having strange problems caused by it.

If you're now running kernel 6.8.x, and the problems only recently started, then it's probably worth dropping back to the older 6.5.x kernel series for now and seeing if that's more stable.

Instructions for switching back to the 6.5.x kernel are here, under the "Kernel 6.8" heading:

https://pve.proxmox.com/wiki/Roadmap#Known_Issues_&_Breaking_Changes
I have been having this issue from 6.5.x (and even before but I don't remember the exact number) and even now with 6.8.4-3.
 
  • Like
Reactions: justinclift
Hi All,
After a month, I can confirm that both issues are resolved:
1- Server freeze.
2- NVME discontented

Actions that made:
1- Upgrade kernel from 6.8.4-2 to 6.8.8-2 2.
2- Upgrade the NVME firmware from EPK71H3Q to EPK76H3Q.

Thanks.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!