Problems with ceph in cluster recovery when upgrading to version 18.2.2

pablomart81

Member
Dec 9, 2020
32
0
11
43
Since we upgraded to version 18.2.2, the cluster is noticeably slow, when removing a disk it takes a whole day to return to a "health ok" state
So that you understand the architecture a little, I have 3 identical servers with 2 sockets of 24 cores and two threads, 512G ram, 6 2TB sas disks for CEPH, 1 SSD disk for the CEPH database, and 2 ssd disks in mirror raid for the system.
Ceph has two dedicated network interfaces, the link is 20GB in LACP mode, it has always been like this.

When checking the cluster everything seems to be correct but the cluster does degrade because a disk must be removed and another one replaced or one of the servers is shut down for maintenance, the reconstruction is extremely slow and reaches the point of giving timeout to the VM's access to the disk.
Has anyone else had the same problem? Any suggestions?
If you need any command output, let me know and I'll send it to you
 
Does the network for Ceph actually give you the expected speeds? iperf can be used to benchmark the network.

If recovery / rebalance brings the cluster to a point where you see performance issues in the guests, it could also be, that the cluster does not have enough reserves to handle the additional load. As a quick way, you could assign less priority to the recover/rebalance. See https://pve.proxmox.com/wiki/Ceph_mClock_Tuning
 
  • Like
Reactions: pablomart81
Thank you for the prompt response,
The problem is on two servers, 2 and 3, each of them has 2 degraded network cards.

Thank you very much
 
Does anyone know what this log is due to and how it affects the functioning of Ceph?

Aug 26 07:46:03 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 07:47:35 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 07:50:38 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 07:54:34 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 07:55:52 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 07:56:56 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 07:57:11 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 07:58:27 pve01-poz kernel: libceph: osd16 (1)10.0.0.2:6817 socket closed (con state OPEN)
Aug 26 07:59:15 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 07:59:19 pve01-poz kernel: libceph: osd5 (1)10.0.0.2:6816 socket closed (con state OPEN)
Aug 26 08:00:03 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 08:03:11 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 08:04:05 pve01-poz kernel: libceph: osd15 (1)10.0.0.1:6809 socket closed (con state OPEN)
Aug 26 08:12:01 pve01-poz kernel: libceph: osd3 (1)10.0.0.1:6807 socket closed (con state OPEN)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!