Ceph: High Latency on NVMe SSD

wursti

New Member
Dec 19, 2025
2
0
1
Hi there,
I am having trouble with OSD latency, ranging inbetween 4 to 20 milliseconds:

Code:
root@pve2:~# ceph osd perf
osd  commit_latency(ms)  apply_latency(ms)
  3                  11                 11
  0                  12                 12
  2                  10                 10
  4                  12                 12
  5                  11                 11
  6                  12                 12
  7                  13                 13
  8                  13                 13
  9                  11                 11
 10                  15                 15
 11                  12                 12
 12                  12                 12
 13                  11                 11
 14                  11                 11
  1                  17                 17
 15                   9                  9


The Servers are somewhat old, I just moved an existing Setup from VMware to Proxmox.

Hardware Setup:
- 4 Servers (2x Intel Xeon 4214, 256GB DDR4, Mellanox Connectx-3 40Gbit/s, 4x 2TB Samsung 990 Pro NVMe, 2x 256 SATA SSD on HW Raid1 for OS)
- 2x Juniper QFX 5100, Virtual Chassis mode, (2x 40 GBit/s via DAC per Host, 4x 40 Gbit/s DAC VC interconnect)

Network Setup:
LACP Bond --> VLAN --> Bridges --> VMs/Ceph.
I am having separate VLANs for Ceph Frontend/Backend.

Software Setup / Tuning:
- PVE 9.1.2 with Ceph 19.2.3, Size 3, 1024 PGs
- IOMMU and all Energy saving features (C-States) deactivated.
- Switches in "cut through" forwarding mode, latency under 0,1 ms.
- Iperf shows 50Gbit+ between Hosts.
- Local NVMe performance as expected.

Its just a bit slowish IOPS-Wise:
One Queue 4K results in ~ 5K IOPS.
A Queue of 64 yields a harvest of about 60K IOPS.
Sequential RW Bandwith is somewhere between 5 and 10 GB/Sec, as expected.
Tests are performed from within a Windows 11 VM, VirtIO, Cache mode Write Back.

So everything is fine, its just not what I expect from 16 NVMe dives.
So I try to bring the OSD latency down, at least I think that this is my problem.
 

Attachments

  • OSD_LATENCY_NVME.png
    OSD_LATENCY_NVME.png
    67.9 KB · Views: 9
The Samsung 990 Pro NVMe does not have PLP from what I see? Generally PLP/enterprise storage is recommended here for PVE/Ceph, so it could just be the drives.

If the Windows VM is using Host CPU try a generic v2 or v3 CPU.
 
So everything is fine, its just not what I expect from 16 NVMe dives.
16 consumer NVMe drives. Any write Ceph does is sync and any drive without PLP will show high latency and, once its cache fills, poor sequential performance. Keep in mind that you have to write to 3 disks and besides data itself it has to write to the RocksDB of each OSD to keep track of the objects it holds. Also, you should benchmark from the host itself to get hardware performance, then from VMs to get VMs performance, then correlate results.
 
  • Like
Reactions: gurubert
Thanks very much for Your replies!

To proove the impact of missing PLP, I found a setting which is OFC NOT recommended for productive setup:
/etc/ceph/ceph.conf
Code:
bluestore_prefer_dm_nvme_plp = 0

It should make the thing perform, but it doesnt seem to have any effect.
I couldnt find anything in the official manual.

One thing cuts the latency down to less then half. But its just a wipe of the drive (here: osd.0)
!!!ATTENTION THIS ERASES THE DISK!!!
Code:
nvme format /dev/nvme0n1 --ses=1
Result:
Code:
root@pve2:~# ceph osd perf
osd  commit_latency(ms)  apply_latency(ms)
  0                   4                  4
  3                  10                 10
  2                  10                 10
  4                  11                 11
  5                  71                 71
  6                  11                 11
  7                  12                 12
  8                  11                 11
  9                   9                  9
 10                  12                 12
 11                  11                 11
 12                  10                 10
 13                  11                 11
 14                  11                 11
  1                   9                  9
 15                  10                 10


If I got it right, the high latency should be there, because Ceph doesnt use the volatile write cache.
When I set the "feature 6" to "Zero", the latency rises up over 50 milliseconds.
Code:
nvme get-feature /dev/nvme0 -f 0x06 -H
get-feature:0x06 (Volatile Write Cache), Current value:0x00000001
        Volatile Write Cache Enable (WCE): Enabled
nvme set-feature /dev/nvme0 -f 0x06 --value=0 #WC Disable
nvme get-feature /dev/nvme0 -f 0x06 -H
get-feature:0x06 (Volatile Write Cache), Current value:0x00000000
        Volatile Write Cache Enable (WCE): Disabled

So it seems to perform that weak while using the drives Write Cache.
Without, it performs like 1/5 old Desktop HDD IOPS-wise.

Besides that, I can run dozens of Win VMs on that Ceph store without stressing it too much.
When using VirtIO and Writeback caching policy, it gives me respectable Bandwidth and IOPS:

Ceph_Nvme_IOPS.png

Ceph_Nvme_Throughput.png