Ceph: sudden slow ops, freezes, and slow-downs

Hey guys, I am also getting this issue, however, I am not using CephFS and this is based on a new cluster setup with nvme. I posted a question on the forums as well: https://forum.proxmox.com/threads/ceph-slow-ops.121033/

Just to summarize: new set up, added OSDs, create pools and I get slow ops all the over the place and no good pgs at all.

Tried downgrading the kernel as suggested but to no avail.

Ceph is unusable on my 3 node instance so I switched back to Zfs for now.

Not sure if any of you guys have a solution for this issue?

Thanks!
My Issue was related to CephFS and probably upgrading to higher Ceph which I did back then and a huge number of small files on CephFS. The solution for me was to throw away CephFS and since then everything works with a charm. So, if you are not using CephFS there must be some other issue in your setup.
 
Hi,

I'm using a 4-node cluster with Ceph (PVE 7.3.4 and Ceph 17.2.5) and 12 HDD OSD (3 OSD per node).

The Ceph network is a dedicated 10 GbE network for this 4-node cluster.

More or less one year ago, with previous versions, CephRBD and CephCephFS were working properly : fast and usable.

From now, Ceph is so slow and unusable : slow ops, freezes, and slow-downs...

For example :
  • when I add new HDD OSD, the recovery/rebalance speed is between 200-250 MiBs/s which is, in my opinion, a good result.
  • when I would like to backup a VM (<10 GB) from a node to CephFS, it takes many hours...
  • when I create a 1 GiB file on CephFS :
dd if=/dev/zero of=test.img bs=1M count=1024
I've warnings/errors such has : MDS report slow metadata IOs, osd.(different numbers) slow slow ops...​
Time to create a 1 GiB file is between 2 seconds and few minutes...​

Any suggestions or tips are welcome

Regards
 
Last edited:
My Issue was related to CephFS and probably upgrading to higher Ceph which I did back then and a huge number of small files on CephFS. The solution for me was to throw away CephFS and since then everything works with a charm. So, if you are not using CephFS there must be some other issue in your setup.
Yep I read that many of the issues are caused by CephFS, so my issue was kinda isolated. I fixed it by fixing my botched 10GBE Cisco DAC Module.

It worked like a charm after that.