Ceph latency with cephfs volume for Elasticsearch with replicas

tubatodd · Sep 17, 2025

We've been using Proxmox with ceph for years. Our typical cluster is...

Proxmox 8.3.5
Ceph 18.2.4
10 servers
3 Enterprise SSD OSDs per server each
20Gbs between the servers
10Gbs between VMs and Ceph Public network for cephfs mounts.
1 Pool for VM deployment
2 subvolumes/pools; 1 Elastic data and 1 metadata
20 kubernetes VMs

We recently added the cephfs-csi driver to our kubernetes cluster, created a couple of cephfs volumes and started having Elasticsearch utilize the newly created StorageClass for cephfs. It is functional, but I am noticing that we are having periodic commit/apply latency spikes. Without Elasticsearch, the latency is a steady 0ms all the time. But with ElasticSearch (some pods have 2 - 6 replicas), we're seeing an increase in normal latency (0ms - 2ms), with "periodic" spikes into the 10s of milliseconds (not all the same latency value) across all OSDs at once. I've seen as high as 99ms.

I've rules out networking (used iperf between the nodes and from VMs to Proxmox nodes), doubled the `osd_memory_target` to 8G, doubled the `bluestore_cache_size_ssd` to 6G and reduced the sizing from 3/2 to 2/1 on the Elasticsearch pools. I am still seeing these spikes.

Now I'm headed down the path of the cephfs volume and Elasticsearch itself. Not sure what to look out for. Any help would be much appreciated.

ness1602 · Sep 18, 2025

Usually Elasticsearch doesnt recommend network drives,because of the latency. But why do you have 6 replicas for some elasticsearch clusters,isnt that too much?

Search

Search

Ceph latency with cephfs volume for Elasticsearch with replicas

tubatodd

Active Member

ness1602

Famous Member

We value your privacy