Dear Proxmox Community,
I'm facing a critical issue with my Proxmox 8.2.2 cluster that's severely impacting our Ceph storage performance and stability. I hope you can help me troubleshoot and resolve this problem.
Environment:
Proxmox version: 8.2.2
Cluster size: 5 physical servers
Storage: Each server has one Samsung PM1735 NVMe SSD and Three Intel SSD via SAS (No issue with SSD).
NVME Version: EPK71H3Q
Storage Configuration: Each NVMe is part of a dedicated Ceph pool
OS: Debian 12 (Bookworm)
Issue:Our Samsung PM1735 NVMe drives are disconnecting frequently across all five servers. This is causing I/O errors, performance degradation, and potential data integrity issues in our Ceph storage.
Error Messages:
NVMe Disconnect:
nvme nvme0: Shutdown timeout set to 10 seconds
nvme nvme0: 63/0/0 default/read/poll queues
nvme 0000:18:00.0: Using 48-bit DMA addresses
nvme nvme0: resetting controller due to AER
block nvme0n1: no usable path - requeuing I/O
nvme nvme0: Device not ready; aborting reset, CSTS=0x1
I/O Callbacks Suppressed:
nvme_ns_head_submit_bio: 19 callbacks suppressed
Steps Taken:
Kernel Parameter: Set nvme_core.default_ps_max_latency_us=0 to disable NVMe power state transitions. Issue persists.
Samsung PM1735 specs:
ps 0 : mp:25.00W operational enlat:180 exlat:180 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:7.00W active_power:19.00W active_power_workload:80K 128KiB SW
Ceph configuration for the spespfic OSDs:
bluestore_min_alloc_size_ssd = 8K
osd_memory_target = 8G
osd_op_num_threads_per_shard_ssd = 8
Has anyone else experienced similar issues with high-performance NVMes like the PM1735 in a Proxmox/Ceph environment?
Are there known PCIe or power-related issues with Proxmox 8 on Debian 12 that could cause these symptoms?
We're willing to provide any additional logs or test results as needed.
Thank you in advance for your help!
I'm facing a critical issue with my Proxmox 8.2.2 cluster that's severely impacting our Ceph storage performance and stability. I hope you can help me troubleshoot and resolve this problem.
Environment:
Proxmox version: 8.2.2
Cluster size: 5 physical servers
Storage: Each server has one Samsung PM1735 NVMe SSD and Three Intel SSD via SAS (No issue with SSD).
NVME Version: EPK71H3Q
Storage Configuration: Each NVMe is part of a dedicated Ceph pool
OS: Debian 12 (Bookworm)
Issue:Our Samsung PM1735 NVMe drives are disconnecting frequently across all five servers. This is causing I/O errors, performance degradation, and potential data integrity issues in our Ceph storage.
Error Messages:
NVMe Disconnect:
nvme nvme0: Shutdown timeout set to 10 seconds
nvme nvme0: 63/0/0 default/read/poll queues
nvme 0000:18:00.0: Using 48-bit DMA addresses
nvme nvme0: resetting controller due to AER
block nvme0n1: no usable path - requeuing I/O
nvme nvme0: Device not ready; aborting reset, CSTS=0x1
I/O Callbacks Suppressed:
nvme_ns_head_submit_bio: 19 callbacks suppressed
Steps Taken:
Kernel Parameter: Set nvme_core.default_ps_max_latency_us=0 to disable NVMe power state transitions. Issue persists.
Samsung PM1735 specs:
ps 0 : mp:25.00W operational enlat:180 exlat:180 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:7.00W active_power:19.00W active_power_workload:80K 128KiB SW
Ceph configuration for the spespfic OSDs:
bluestore_min_alloc_size_ssd = 8K
osd_memory_target = 8G
osd_op_num_threads_per_shard_ssd = 8
Has anyone else experienced similar issues with high-performance NVMes like the PM1735 in a Proxmox/Ceph environment?
Are there known PCIe or power-related issues with Proxmox 8 on Debian 12 that could cause these symptoms?
We're willing to provide any additional logs or test results as needed.
Thank you in advance for your help!