Serious regression in ceph recovery

mgiammarco

Renowned Member
Feb 18, 2010
164
8
83
Hello,
I have a Proxmox 6.4 cluster wiith ceph 14.2.20 with three servers, each one with: 192gb ram, 48 core, 4 ssd, 10gb ethernet, light load (few vms)
One of the servers has filled root partition due to a failing nfs mount (another thread), so immediately ceph mon stopped working.
After I solved the problem Ceph (replica 3) started recovering and rebalancing ( 5% of objects to recover).
Immediately people started complaining about very slow VMs, I have checked and they were very very slow in disk access.
I have done this command:
ceph tell 'osd.*' injectargs --osd-max-backfills=1 --osd-recovery-max-active=1

No improvment. So I disabled recover and rebalance in global flags and now all is fast again.

But for me it is a serious regression considering how big is the hardware: a litlle recovery put down a three node cluster, this is not HA.
What's happened?
Thanks,
Mario
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!