Hello Proxmox Community,
I hope everyone is doing well. I'm encountering significant commit and apply latency in my Ceph cluster setup and would greatly appreciate your insights and advice on diagnosing and resolving this issue.
Setup Overview:
I've been experiencing high commit and apply latency within the Ceph cluster, as evidenced by the following OSD performance metrics obtained from ceph osd perf:
Networking:
I use a dedicated Mikrotik CRS309-1G-8S+ switch(10GBE sfp+) for my Ceph Cluster Network, and a separate CRS309-1G-8S+ for my Public network as well as my PVE cluster network. In some of the nodes I use a dual NIC for those two networks. MTU is set to 9000 on PVE on the Ceph Cluster ports.
Ping tests report an avarage of ~0.150ms between the nodes on the Ceph Cluster network.
iperf reports a full utilization of the bandwidth.
I hope everyone is doing well. I'm encountering significant commit and apply latency in my Ceph cluster setup and would greatly appreciate your insights and advice on diagnosing and resolving this issue.
Setup Overview:
- Ceph Version: 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) - Reef (stable)
- Proxmox VE Version: 8.2.0
- Running Kernel: 6.8.4-2-pve
- Cluster Configuration:
- OSDs Distribution:
- Host "light":
- OSD.2 (SSD) - Samsung_SSD_870_QVO_4TB - bluestore
- Host "nasprox":
- OSD.1 (SSD) - Samsung_SSD_870_QVO_4TB - bluestore
- Host "slim":
- OSD.0 (SSD) - Samsung_SSD_870_QVO_4TB - bluestore
- Host "light":
- Weights: All OSDs have a weight of 3.63869.
- Status: All OSDs are currently up.
- OSDs Distribution:
I've been experiencing high commit and apply latency within the Ceph cluster, as evidenced by the following OSD performance metrics obtained from ceph osd perf:
- OSD.2 (Host "light"):
- Commit Latency: 478 ms
- Apply Latency: 478 ms
- OSD.1 (Host "nasprox"):
- Commit Latency: 434 ms
- Apply Latency: 434 ms
- OSD.0 (Host "slim"):
- Commit Latency: 488 ms
- Apply Latency: 488 ms
Networking:
I use a dedicated Mikrotik CRS309-1G-8S+ switch(10GBE sfp+) for my Ceph Cluster Network, and a separate CRS309-1G-8S+ for my Public network as well as my PVE cluster network. In some of the nodes I use a dual NIC for those two networks. MTU is set to 9000 on PVE on the Ceph Cluster ports.
Ping tests report an avarage of ~0.150ms between the nodes on the Ceph Cluster network.
iperf reports a full utilization of the bandwidth.
- What could be causing the high commit and apply latency(which are the same!) in my Ceph cluster?
- Are there specific OSD tuning parameters or configurations I should review or modify to optimize performance?
- How can I reduce commit and apply latency to improve overall cluster performance?
Last edited: