High latency on Ceph. Poor Performance for VMs

leozinho

New Member
Nov 12, 2024
3
0
1
Greetings to all,
I am seeking assistance with a challenging issue related to Ceph that has significantly impacted the company I work for.

Our company has been operating a cluster with three nodes hosted in a data center for over 10 years. This production environment runs on Proxmox (version 6.3.2) and Ceph (version 14.2.15). From a performance perspective, our applications function adequately.

To address new business requirements, such as the need for additional resources for virtual machines (VMs) and to support the company’s growth, we deployed a new cluster in the same data center. The new cluster also consists of three nodes but is considerably more robust, featuring increased memory, processing power, and a larger Ceph storage capacity.

The goal of this new environment is to migrate VMs from the old cluster to the new one, ensuring it can handle the growing demands of our applications. This new setup operates on more recent versions of Proxmox (8.2.2) and Ceph (18.2.2), which differ significantly from the versions in the old environment.

The Problem
During the gradual migration of VMs to the new cluster, we encountered severe performance issues in our applications—issues that did not occur in the old environment. These performance problems rendered it impractical to keep the VMs in the new cluster.

An analysis of Ceph latency in the new environment revealed extremely high and inconsistent latency, as shown in the screenshot below:
<<Ceph latency screenshot - new environment>>
imagem novo ambiente.JPG


To mitigate operational difficulties, we reverted all VMs back to the old environment. This resolved the performance issues, ensuring our applications functioned as expected without disrupting end-users. After this rollback, Ceph latency in the old cluster returned to its stable and low levels:
<<Ceph latency screenshot - old environment>>
image (6).png


With the new cluster now available for testing, we need to determine the root cause of the high Ceph latency, which we suspect is the primary contributor to the poor application performance.

Tests Performed in the New Environment
-Deleted the Ceph OSD on Node 1. Ceph took over 28 hours to synchronize. We then recreated the OSD on Node 1.
-Deleted the Ceph OSD on Node 2. Ceph also took over 28 hours to synchronize. We then recreated the OSD on Node 2.
-Moved three VMs to the local backup disk of PM1.
-Destroyed the Ceph cluster.
-Created local storage on each server using the virtual disk (RAID 0) previously used by Ceph.
-Migrated VMs to the new environment and conducted a stress test to check for disk-related issues.


Questions and Requests for Input

-Are there any additional tests you would recommend to better understand the performance issues in the new environment?

-Have you experienced similar problems with Ceph when transitioning to a more powerful cluster?

-Could this be caused by a Ceph configuration issue?

-The Ceph storage in the new cluster is larger, but the network interface is limited to 1Gbps. Could this be a bottleneck? Would upgrading to a 10Gbps network interface be necessary for larger Ceph storage?

-Could these issues stem from incompatibilities or changes in the newer versions of Proxmox or Ceph?

-Is there a possibility of hardware problems? Note that hardware tests in the new environment have not revealed any issues.

Thank you in advance for your insights and suggestions.

edit: The first screenshot was taken during our disk testing, which is why one of them was in the OUT state. I’ve updated the post with a more recent image
 
Last edited:
what disk model in old and new ?
Hi Gabriel,
Thank you for taking the time to look into this issue.
Here are the details for both the old and new clusters:

Old Cluster
Controller Model and Firmware:
pm1: Smart Array P420i Controller, Firmware Version 8.32
pm2: Smart Array P420i Controller, Firmware Version 8.32
pm3: Smart Array P420i Controller, Firmware Version 8.32

The old cluster uses KINGSTON SSDs:
pm1: SCEKJ2.3 (1920 GB) x2, SCEKJ2.7 (960 GB) x2
pm2: SCEKJ2.7 (1920 GB) x2
pm3: SCEKJ2.7 (1920 GB) x2

New Cluster
Controller Model and Firmware:
pmx1: Smart Array P440ar Controller, Firmware Version 7.20
pmx2: Smart Array P440ar Controller, Firmware Version 6.88
pmx3: Smart Array P440ar Controller, Firmware Version 6.88
The new cluster also uses KINGSTON SSDs but with larger capacities:
pmx1: SCEKH3.6 (3840 GB) x4
pmx2: SCEKH3.6 (3840 GB) x2
pmx3: SCEKJ2.8 (3840 GB), SCEKJ2.7 (3840 GB)

Given these differences in SSD models, controller types, and firmware versions between the old and new environments, could these factors be contributing to the performance and latency issues we’re experiencing with Ceph?
 
Just glancing at this, upgrade minimum for the ceph public and ceph cluster network to at least 10g.
 
  • Like
Reactions: gurubert
-The Ceph storage in the new cluster is larger, but the network interface is limited to 1Gbps. Could this be a bottleneck? Would upgrading to a 10Gbps network interface be necessary for larger Ceph storage?

Yes, 10Gbps is the usual recommended minimum for a production environment and 25Gbps fairly common. There are people who even go for 100Gbps when they setup a new cluster.
 
  • Like
Reactions: waltar
In your picture from new cluster on node3 the ceph osd is down and ceph itself - depends on your choosen policy - maybe try to rebuild security (3 copies ??) which then results in your observed high latency. Fitrst bring up all osd's and check if ok, then running your vm's and check again.
1Gbit is to slow as ceph client has (default) 3 copies to write.
 
Ill repost what i posted here - https://www.reddit.com/r/Proxmox/co...=web3xcss&utm_term=1&utm_content=share_button

You have OSDs backed by RAID0 HBA groups. This is not supported and why your OSD weights are INSANELY high. On top of this your new cluster has an OSD that is out forcing a rebalance/peering. You need to redo the OSD configs and give Proxmox direct access to the Kingston SSDs that are behind those raid groups.

Before this happens, nothing else matters.
 
Ill repost what i posted here - https://www.reddit.com/r/Proxmox/co...=web3xcss&utm_term=1&utm_content=share_button

You have OSDs backed by RAID0 HBA groups. This is not supported and why your OSD weights are INSANELY high. On top of this your new cluster has an OSD that is out forcing a rebalance/peering. You need to redo the OSD configs and give Proxmox direct access to the Kingston SSDs that are behind those raid groups.

Before this happens, nothing else matters.
I updated the first image from the new environment, but you're right. Thanks for the reply! I will review the solutions and share them with my team. I will reply to you with more info once I get it from my team.
 
Hi, do you have any news about this problem? I'm having the same problem, but I have a 10G NIC, and I don't use RAID. I have all the disks checked by Proxmox. For me everything worked fine for about 10 days and now I have degradation on all the VMs. For me too, if I move the VMs to local disks, everything works fine.