We've been using Proxmox with ceph for years. Our typical cluster is...
Proxmox 8.3.5
Ceph 18.2.4
10 servers
3 Enterprise SSD OSDs per server each
20Gbs between the servers
10Gbs between VMs and Ceph Public network for cephfs mounts.
1 Pool for VM deployment
2 subvolumes/pools; 1 Elastic data and 1 metadata
20 kubernetes VMs
We recently added the cephfs-csi driver to our kubernetes cluster, created a couple of cephfs volumes and started having Elasticsearch utilize the newly created StorageClass for cephfs. It is functional, but I am noticing that we are having periodic commit/apply latency spikes. Without Elasticsearch, the latency is a steady 0ms all the time. But with ElasticSearch (some pods have 2 - 6 replicas), we're seeing an increase in normal latency (0ms - 2ms), with "periodic" spikes into the 10s of milliseconds (not all the same latency value) across all OSDs at once. I've seen as high as 99ms.
I've rules out networking (used iperf between the nodes and from VMs to Proxmox nodes), doubled the `osd_memory_target` to 8G, doubled the `bluestore_cache_size_ssd` to 6G and reduced the sizing from 3/2 to 2/1 on the Elasticsearch pools. I am still seeing these spikes.
Now I'm headed down the path of the cephfs volume and Elasticsearch itself. Not sure what to look out for. Any help would be much appreciated.
Proxmox 8.3.5
Ceph 18.2.4
10 servers
3 Enterprise SSD OSDs per server each
20Gbs between the servers
10Gbs between VMs and Ceph Public network for cephfs mounts.
1 Pool for VM deployment
2 subvolumes/pools; 1 Elastic data and 1 metadata
20 kubernetes VMs
We recently added the cephfs-csi driver to our kubernetes cluster, created a couple of cephfs volumes and started having Elasticsearch utilize the newly created StorageClass for cephfs. It is functional, but I am noticing that we are having periodic commit/apply latency spikes. Without Elasticsearch, the latency is a steady 0ms all the time. But with ElasticSearch (some pods have 2 - 6 replicas), we're seeing an increase in normal latency (0ms - 2ms), with "periodic" spikes into the 10s of milliseconds (not all the same latency value) across all OSDs at once. I've seen as high as 99ms.
I've rules out networking (used iperf between the nodes and from VMs to Proxmox nodes), doubled the `osd_memory_target` to 8G, doubled the `bluestore_cache_size_ssd` to 6G and reduced the sizing from 3/2 to 2/1 on the Elasticsearch pools. I am still seeing these spikes.
Now I'm headed down the path of the cephfs volume and Elasticsearch itself. Not sure what to look out for. Any help would be much appreciated.