Since we upgraded to version 18.2.2, the cluster is noticeably slow, when removing a disk it takes a whole day to return to a "health ok" state
So that you understand the architecture a little, I have 3 identical servers with 2 sockets of 24 cores and two threads, 512G ram, 6 2TB sas disks for CEPH, 1 SSD disk for the CEPH database, and 2 ssd disks in mirror raid for the system.
Ceph has two dedicated network interfaces, the link is 20GB in LACP mode, it has always been like this.
When checking the cluster everything seems to be correct but the cluster does degrade because a disk must be removed and another one replaced or one of the servers is shut down for maintenance, the reconstruction is extremely slow and reaches the point of giving timeout to the VM's access to the disk.
Has anyone else had the same problem? Any suggestions?
If you need any command output, let me know and I'll send it to you
So that you understand the architecture a little, I have 3 identical servers with 2 sockets of 24 cores and two threads, 512G ram, 6 2TB sas disks for CEPH, 1 SSD disk for the CEPH database, and 2 ssd disks in mirror raid for the system.
Ceph has two dedicated network interfaces, the link is 20GB in LACP mode, it has always been like this.
When checking the cluster everything seems to be correct but the cluster does degrade because a disk must be removed and another one replaced or one of the servers is shut down for maintenance, the reconstruction is extremely slow and reaches the point of giving timeout to the VM's access to the disk.
Has anyone else had the same problem? Any suggestions?
If you need any command output, let me know and I'll send it to you