cluster performance degradation

Oct 9, 2024
46
0
6
I created a 3-node cluster with ceph and HA, all three nodes have an Enterprise license, I created about 10 VMs spread across the three nodes, for the first week the VMs were running very well with excellent performance, now for a few days there has been a huge degradation in performance, in the resource info I see everything normal both as CPU and RAM, and I can't understand why this degradation, I understood that the problem is ceph, because if I migrate a VM to a local disk the VM runs fine, but if I put it on ceph storage the performance drops a lot, can someone tell me what to look at and where to understand the reason for this degradation
 
the network there are 3 10G NICs separate configuration of net and cluster always with 10G NICs, there are 4 16TB disks for each node, 256G of RAM for each node, and 20 CPU cores, the strange thing is that for about 10 days everything worked well, the VMs had wonderful performances, now if I restart a VM it takes 3 minutes to restart, if instead I put it on local storage it goes very well, I'm ruined for this, the customers are all calling me because the VMs are slow
 
thanks for your reply, can you please explain to me in detail what I should do? I have to insert an SSD of how much? and how should I configure it? should I add it in OSD for each node? or should I do other configurations? what do you mean by wal/db, I wonder why the cluster was fine before and now everything has slowed down?
 
I gave this command systemctl status corosync
to understand if there are problems, these are the results on the 3 nodes

1734864010228.png

1734864050163.png

1734864132202.png

Is it normal that MTU gives me 1397? I use an active-bakup bonding on 2 NICs
 
Corosync has an option for failover link itself, no need for active-backup:

but if management and something else is on this network,then you should separate it.
 
in my nodes I have 2 10G ports, and 2 1G ports, I used the 2 10G ports one for the cluster and ceph, and the other for net and internet, I put the 2 1G ports in active-backup on another switch, the thing I wonder is that as soon as I installed everything everything was fine, wonderful performance, after about a week everything degraded. what do you advise me to do?
 
I've been reading and studying for 3 months, and after 3 months I set up the cluster, and I also paid for the licenses, the first week I was very surprised at how well it worked, but now everything has slowed down, and I haven't changed anything in the configuration, I just added the VMs, is there a way to figure out where the problem is?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!