Serious regression in ceph recovery

mgiammarco · Jul 28, 2021

Hello,
I have a Proxmox 6.4 cluster wiith ceph 14.2.20 with three servers, each one with: 192gb ram, 48 core, 4 ssd, 10gb ethernet, light load (few vms)
One of the servers has filled root partition due to a failing nfs mount (another thread), so immediately ceph mon stopped working.
After I solved the problem Ceph (replica 3) started recovering and rebalancing ( 5% of objects to recover).
Immediately people started complaining about very slow VMs, I have checked and they were very very slow in disk access.
I have done this command:
ceph tell 'osd.*' injectargs --osd-max-backfills=1 --osd-recovery-max-active=1

No improvment. So I disabled recover and rebalance in global flags and now all is fast again.

But for me it is a serious regression considering how big is the hardware: a litlle recovery put down a three node cluster, this is not HA.
What's happened?
Thanks,
Mario

Search

Search

Serious regression in ceph recovery

mgiammarco

Renowned Member