Problems with ceph in cluster recovery when upgrading to version 18.2.2

pablomart81 · Aug 20, 2024

Since we upgraded to version 18.2.2, the cluster is noticeably slow, when removing a disk it takes a whole day to return to a "health ok" state
So that you understand the architecture a little, I have 3 identical servers with 2 sockets of 24 cores and two threads, 512G ram, 6 2TB sas disks for CEPH, 1 SSD disk for the CEPH database, and 2 ssd disks in mirror raid for the system.
Ceph has two dedicated network interfaces, the link is 20GB in LACP mode, it has always been like this.
When checking the cluster everything seems to be correct but the cluster does degrade because a disk must be removed and another one replaced or one of the servers is shut down for maintenance, the reconstruction is extremely slow and reaches the point of giving timeout to the VM's access to the disk.
Has anyone else had the same problem? Any suggestions?
If you need any command output, let me know and I'll send it to you

aaron · Aug 20, 2024

Does the network for Ceph actually give you the expected speeds? iperf can be used to benchmark the network.

If recovery / rebalance brings the cluster to a point where you see performance issues in the guests, it could also be, that the cluster does not have enough reserves to handle the additional load. As a quick way, you could assign less priority to the recover/rebalance. See https://pve.proxmox.com/wiki/Ceph_mClock_Tuning

pablomart81 · Aug 21, 2024

Thank you for the prompt response,
The problem is on two servers, 2 and 3, each of them has 2 degraded network cards.

Thank you very much

pablomart81 · Aug 26, 2024

Does anyone know what this log is due to and how it affects the functioning of Ceph?

Aug 26 07:46:03 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 07:47:35 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 07:50:38 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 07:54:34 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 07:55:52 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 07:56:56 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 07:57:11 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 07:58:27 pve01-poz kernel: libceph: osd16 (1)10.0.0.2:6817 socket closed (con state OPEN)
Aug 26 07:59:15 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 07:59:19 pve01-poz kernel: libceph: osd5 (1)10.0.0.2:6816 socket closed (con state OPEN)
Aug 26 08:00:03 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 08:03:11 pve01-poz kernel: libceph: osd10 (1)10.0.0.1:6827 socket closed (con state OPEN)
Aug 26 08:04:05 pve01-poz kernel: libceph: osd15 (1)10.0.0.1:6809 socket closed (con state OPEN)
Aug 26 08:12:01 pve01-poz kernel: libceph: osd3 (1)10.0.0.1:6807 socket closed (con state OPEN)

Search

Search

Problems with ceph in cluster recovery when upgrading to version 18.2.2

pablomart81

Member

aaron

Proxmox Staff Member

pablomart81

Member

pablomart81

Member